# Introduction to Machine Learning in Python

## What is Machine Learning?

### Machine Learning at Glance

<img src="resources/imgs/ml-wordle-436.jpg" width="60%">

### Machine learning teaches machines how to carry out tasks by themselves. It is that simple.
### The complexity comes with the details.

_W. Richert & L.P. Coelho, 2013
Building Machine Learning Systems with Python_

Machine learning is the process to automatically **extract knowledge** from data, usually with the goal of making **predictions** on _new_, _unseen_ data. 

A classical example is a _spam filter_, for which the user keeps labeling incoming mails as either spam or not spam. 

A machine learning algorithm then "learns" what distinguishes spam from normal emails, and can predict for new emails whether they are spam or not.

<font size="4"> Central to machine learning is the concept of **making decision automatically** from data, **without the user specifying explicit rules** how this decision should be made.</font>

For the case of emails, the user doesn't provide a list of words or characteristics that make an email spam. Instead, the user provides examples of spam and non-spam emails.

The second central concept is **generalization**. 

The goal of a machine learning algorithm is to predict on new, previously unseen data. We are not interested in marking an email as spam or not, that the human already labeled. Instead, we want to make the users' life easier by making an automatic decision for new incoming mail.

There are two kinds of machine learning we will talk about in these notebooks: 

* **Supervised learning;** 
* **Unsupervised learning.**

### Supervised Learning

In **Supervised Learning**, we have a dataset consisting of both **input features** and a **desired output** (aka target or ground truth), such as in the spam / no-spam example.

The task is to construct a model (or program) which is able to predict the desired output of an **unseen** object
given the set of features.

<img src="resources/imgs/ml_supervised_example.png" width="100%" />

Supervised learning is further broken down into two categories, **classification** and **regression**.

In classification, the label is discrete (a.k.a. _Categorical Data_, i.e. _Integer values_), such as "spam" or "no spam". 
In other words, it provides a clear-cut distinction between categories. 

In regression, the label is continuous, i.e. _Float output_.

### Other Examples

Some more complicated examples are:

- Image Classification: Given a dataset of images labeled with different objects or classes, supervised learning algorithms can be used to build models that can classify new images into the correct categories. This application is widely used in various domains, such as autonomous vehicles, medical imaging, and facial recognition. Examples are: 
    - given a multicolor image of an object through a telescope, determine
  whether that object is a star, a quasar, or a galaxy.
  - given a photograph of a person, identify the person in the photo.

- Sentiment Analysis: Supervised learning can be used for sentiment analysis in natural language processing (NLP). By training on labeled data containing text and corresponding sentiment labels (positive, negative, neutral), models can be developed to determine the sentiment of new, unseen text data.

- Spam Detection: In email filtering, supervised learning algorithms can be used to distinguish between spam and legitimate emails. By training on a dataset of labeled emails, the model learns to identify patterns and characteristics associated with spam messages.

- Credit Risk Assessment: In the financial industry, supervised learning can be employed to predict credit risk for loan applicants. By using historical data on borrowers and their repayment behavior, models can be built to evaluate the creditworthiness of new loan applicants.

- Medical Diagnosis: Supervised learning can be used in medical applications to assist in diagnosing diseases. By training on labeled medical records and corresponding diagnoses, models can be developed to aid in the identification of certain conditions or diseases.

What these tasks have in common is that there is one or more unknown
quantities associated with the object which needs to be determined from other
observed quantities.

### For example

* In astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a **classification problem**: the label is from three distinct categories. 

* On the other hand, we might wish to estimate the age of an object based on such observations: this would be a **regression problem**, because the label (age) is a continuous quantity.

### Unsupervised Learning

In **Unsupervised Learning** there is no desired output associated with the data.

Instead, we are interested in extracting some form of knowledge or model from the given data.

In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself.

Unsupervised learning comprises tasks such as *dimensionality reduction*, *clustering*, and
*density estimation*. 

<img src="resources/imgs/ml_unsupervised_example.png" width="100%" />

Unsupervised learning is often harder to understand and to evaluate.

Sometimes the two may even be combined: e.g. Unsupervised learning can be used to find useful
features in heterogeneous data, and then these features can be used within a supervised
framework.

### Other Examples

Some more involved unsupervised learning problems are:

- given detailed observations of distant galaxies, determine which features or combinations of
  features summarize best the information.
- given a mixture of two sound sources (for example, a person talking over some music),
  separate the two (this is called the [blind source separation](http://en.wikipedia.org/wiki/Blind_signal_separation) problem).
- given a large collection of news articles, find recurring topics inside these articles.
- given a collection of images, cluster similar images together (for example to group them when visualizing a collection)

Some of the most common algorithms are:

- Clustering: Unsupervised learning can be used for clustering similar data points together based on their intrinsic characteristics. This application is employed in market segmentation, customer profiling, and anomaly detection.

- Dimensionality Reduction: In cases where datasets have a large number of features, unsupervised learning techniques like Principal Component Analysis (PCA) can be utilized to reduce the dimensionality of the data while preserving its essential information.

- Recommendation Systems: Unsupervised learning can be used to build recommendation systems that suggest products, movies, or content to users based on their past behavior and preferences.

- Anomaly Detection: Unsupervised learning algorithms can be used to identify rare and unusual patterns or outliers in datasets, which can be crucial for detecting fraudulent transactions, defective products, or anomalies in system logs.

### There are many Machine (and Deep) learning frameworks in the python ecosystem
and the choice depends on the data you have, and the task you want to tackle.
TensorFlow is the leading Deep Learning framework. Other widely popular frameworks include Keras, Caffe, and PyTorch. 


Scikit-learn is a widely used machine learning library for Python. While it is not specifically designed for deep learning, it offers a wide range of classical machine learning algorithms, such as SVMs, decision trees, and random forests. Scikit-learn is well-documented, easy to use, and serves as a great starting point for those new to machine learning.

# Scikit-learn at a Glance

 Scikit-learn is a popular and widely used machine learning library for Python. 
 - It is built on top of other scientific libraries such as NumPy and SciPy, making it easy to integrate with the Python data science ecosystem. 
 -  It includes supervised learning algorithms such as linear regression, logistic regression, support vector machines (SVM), decision trees, random forests, and more. Additionally, it offers unsupervised learning algorithms like clustering (K-means, DBSCAN) and dimensionality reduction techniques (PCA, t-SNE).
 - It provides a wide range of tools for data preprocessing and feature engineering. It includes functionalities for handling missing values, feature scaling, one-hot encoding, and more.
 - It offers various metrics for evaluating the performance of machine learning models, and provides utilities for cross-validation, hyperparameter tuning, and model selection.
 - It is well-integrated with other Python data science libraries, such as Pandas, for data manipulation and Seaborn/Matplotlib for data visualization.

<img src="resources/imgs/scikit-learn-cheatsheet.png" width="80%" />