### What is "learning"?

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." *Tom M. Mitchell*.

### What's the difference between traditional programming and machine learning?

Traditional programming is rule based: you give the rules and the data and you get the answers. Machine Learning instead receive the data and the answers and try to find the rules.

### What are some kind of Machine Learning?

Machine Learning algorithms can be divided into different categories:

- Supervised Learning: when the labels of the data are provided. For example image classification tasks in which the training data have the label of the class.
- Unsupervised Learning: when the labels are not provided. For example for clustering algorithms such as K-Means.

### Scaling and Normalizing the data

To compare different distributions, data must be often preprocessed. When features have very different range values (say you are comparing age and income) to make the distribution comparable *normalization* is needed, meaning you will make both features to range between 0 and 1 (or between two specified values, as long as you do it for both). **Normalization** is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks. You can use `MinMaxScaler` from `sklearn` to achieve it.

**Standardization** assumes that your data has a normal distribution, and it will make the data to have mean 0 and standard deviation 1. You can use `StandardScaler` from `sklearn` to achieve it. Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis.

**Remember**: you always `fit_trasform` (or `fit` and `transform`) on the training data, then you *only* `transform` on the validation/test data.

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() 
train_scaled = scaler.fit_transform(train_data)
test_scaled = scaler.transform(test_data)
```

### Training, Test, Validation

Since you want to build algorithms that generalize on unseen data, it's necessary to have some data to test your model on. In fact, if you test on the data you used for training, you would incur in data leakage. Usually, if you have a dataset available, a common practice is to split it into three parts: a training set (around 60-70% of your dataset), a validation set (around 20-30% of your dataset) and a test set with the remaining data. To avoid that the data are sorted, it's convenient to shuffle the data before splitting. 

*Training set*: is the portion of the dataset you will train your model on.

*Validation set*: is the portion of the dataset you will test your model during the training, to decide which parameters to use (for example: if you need to decide which kernel to use in SVM, you will use the validation set to compare the accuracies).

*Test set*: is the portion of the dataset that you will touch only at the end, to test your model performance.

**Warning - repetita iuvant**: if you perform PCA, or StandardScaling, or anything else, remember always that you need to train on the training set only (or `fit`), and `transform` on the validation/test.

### KFold Cross-Validation

It's a way to split the data to choose the best parameters/the best Machine Learning model: instead of splitting the data into training and validation once, you divide your dataset into *k* subsets: you will train on *k-1* and test on the remaining one. You repeat by training on different *k-1* subsets and test on the remaining subset. You do it for all the possible combination.


### Underfitting, Overfitting

**Underfitting**: when a model is too simple to capture the variance of the data. Often spotted by low accuracy in the training.

**Overfitting**: when a model is too complex and learn also the noise of the data. Often spotted by a decreasing accuracy in the validation, while training accuracy is rising.

### Bias-variance tradeoff

It is the property of a model that the variance of the parameter estimates across samples can be reduced by increasing the bias in the estimated parameters. The problem is that is hard to simultaneously minimize these two sources of error. 

From Wikipedia:

- The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
- The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting).

### One hot encoding

It's a way to represent categorical data, often to transform the labels. For example:

- If you are considering the iris dataset, the three possible classes are: setosa, virginica e versicolor. Since computers work with numbers, you can represent setosa as the number 0, virginica as the number 1 and versicolor as the number 2. However, numbers have an ordinal property: 2>1>0. To avoid having this relationship, you can one-hot encode:
    - setosa is represented by the vector \[1,0,0\]
    - virginica is represented by the vector \[0,1,0\]
    - versicolor is represented by the vector \[0,0,1\]
    
In general, if you have *n* classes, the vector will have *n* entries (*e.g.* for four classes, you'll have \[1,0,0,0\], \[0,1,0,0\], \[0,0,1,0\], \[0,0,0,1\]).

### Features and Target

In a supervised learning setting, there are usually features that you will use to make prediction (petal length, petal width, sepal length, sepal width for the iris dataset) and a target to predict (setosa, virginica, versicolor for the iris dataset). In general, we indicate with $X$ the features and with $y$ the target. 

### Clustering

It's the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (with a defined sense) to each other than to those in other groups (clusters). There are many types of clustering: hierarchical clustering (based on distance) that produces dendrograms, as well as centroid-based clustering, where the clusters are represented by a central vector. These are usually *unsupervised learning* algorithms.

### K-Means 

K-Means is a clustering algorithm which generates *k* clusters in the given *n* observations around *k* centroids. The algorithms is as follows:
- Define k points among the available data. They will be the centroids and each centroid will be assigned to a different cluster
- assign each item to the cluster of the closest centroid
- calculate new mean for each cluster, that will be the new centroid
- repeat till convergence criteria is met

### How to evaluate the quality of clusters

There are many ways to measure the quality of a cluster. A popular one is the *silhouette* value, and it ranges between -1 and 1: high value, good clustering. It measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation). 


### K-nearest neighbors (KNN)

KNN is a supervised non-parametric classification algorithm. This means you need training data, but there's no actual training to do. For each test data point, the algorithm check the class of the k-nearest neighbours (given a distance, often the euclidean one) and make them to vote: the class that gets the more vote is assigned to the test point.

### PCA

It's a technique often used for dimensionality reduction (reduce the number of features in the data, to avoid complexity) or as feature engineering technique to improve the quality of the features. What PCA does is to find the Principal Components, that are the axis in the space that better explain the variance of the data. 

PCA is *unsupervised*, so no labeled data are needed.
In addition, PCA is used to visualize multidimensional data on a plane by using the first two principal components as axes of the plane to represent the data.

### LDA

Similarly to PCA, it is used for dimensionality reduction. The main difference is that it's a *supervised* algorithm, meaning that the labels are needed. When plotting the data, data points belonging to different classes are usually better separated than in PCA.

**For both PCA and LDA** the data must be normalized.

### Decision Tree

It's a Machine Learning algorithm used mainly in classification problems. The name comes from the fact that it builds a tree with the nodes containing conditional statements about the features. The nodes can be divided into three types:
- Decision nodes (representing the feature)
- Chance nodes (representing the conditional statements)
- Leafs (the targets)

It can lead to overfitting if not pruned, but it's very easy to interpret since it produces human-readable rules.

### Random Forests

It's an ensamble learning method mostly used for classification. It constructs many decision trees during the training, varying the features as well as the data points used for training. It's strength is that it produces robusts results even with small datasets in comparison with other machine learning models.

### Linear Regression

It's a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. As the name suggests, it's used for regression task. The goal of the algorithm is to compute the parameters that determine a line that minimize the overall distance from each of the points.

### Logistic Regression

Logistic Regression is another machine learning model that computes the probability of a certain class or event. Used for binary classification problems mainly, such as desease-not desease problems. The 
main novelty of this method is the logistic function or sigmoid function:

$$\sigma(t) = \frac{1}{1+e^{-t}}$$

that outputs probabilities.

### Support Vector Machines

It's another supervised learning algorithm for classification, that finds the *support vectors*, that is the *maximum-margin hyperplane* that separates the elements of different classes. It's usually very powerful, also because of the possibility of using kernels, such as polynomial kernel and RBF kernel, to project the data in an higher dimensional space in which it's easier to separate them.
