# Supervised Learning
In this notebook we will explain briefly the most used supervised learning algorithms for both classificaiton and regression and we will apply them to a real datasets and compare the performances. We will use the Python package [SKLearn](https://scikit-learn.org/) that implements a lot of machine learning algorithms.
Before comparing the models on the selected dataset let us introduce a fundamental concept that will help us understand better the concepts of overfitting, underfitting and the regularization techniques: the bias-variance decomposition.

## Bias Variance Decomposition
The error of an estimator (that in our case is the Machine Learning model) can be decomposed into two terms: a bias term and a variance term. Imagine we could train the same model on many different datasets. The bias tells how much is the error if we take the average of the predictions from these models, while the variance term is the variance of predictions obtained by these models. Matematically 
$$
Err(x) = \mathbb{E}_{\hat{f}}[(\mathbb{E}_{\hat{f}}[\hat{f}(x)]-f(x))^2] + \mathbb{E}_{\hat{f}}[(\hat{f}(x)-\mathbb{E}_{\hat{f}}[\hat{f}(x)])^2]
$$
Where $\hat{f}$ is the predicted function and $f$ is the true function. Note the expectation over the models trained on different datasets.
Bias and variance are represented in the following image, borrowed from this [great article](http://scott.fortmann-roe.com/docs/BiasVariance.html) on bias-variance decomposition:
<p align="center">
  <img src="../imgs/biasvariance.png"/ width=30%>
</p>
As we can see from the picture, an estimator with high bias and low variance will return on average a value that is different from the true value, but the predicted values will be similar. While an estimator with low bias and high variance will return the correct value on average, but the predicted values will change a lot. 
The bias variance decomposition is a different way to express the concepts of overfitting and underfitting. If we increase the model capacity we reduce the bias but we might increase variance and vice versa. As shown by the figure:
<p align="center">
  <img src="../imgs/overunderfitting.png"/ width=30%>
</p>



## Regression
For the regression tasks we will use the Diabetes dataset represents patients in a 10-dimensional feature space, where the features represent:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level

The feature variables are mean centered and scaled by the standard deviation times the square root of n_samples.
The targets $Y$ are integers in the range $[25,346]$ and are a quantitative measure of disease progression one year after the measurements.

In [101]:
import numpy as np
import sklearn
from sklearn.datasets import load_diabetes

In [102]:
X,Y = load_diabetes(return_X_y=True)
print(f"{X.shape},{Y.shape}")

(442, 10),(442,)


We use 90% of the data for training and 10% for testing. First we shuffle the data so that we can assume that data are not ordered.

In [103]:
X,Y = sklearn.utils.shuffle(X,Y,random_state=9)

X_train = X[0:int(0.9 * X.shape[0])]
Y_train = Y[0:int(0.9 * X.shape[0])]
X_test  = X[int(0.9 * X.shape[0]):]
Y_test  = Y[int(0.9 * X.shape[0]):]

print(f"{X_train.shape},{Y_train.shape}")
print(f"{X_test.shape},{Y_test.shape}")

(397, 10),(397,)
(45, 10),(45,)


### K Nearest Neighbors
Each estimators in sklearn have a score method providing a default evaluation criterion for the problem they are designed to solve. From the documentation we read that the score method of the KNeighborsClassifier returns the coefficient of determination. that is defined as $(1-\frac{u}{v})$, where $u$ is the residual sum of squares `((y_true - y_pred)** 2).sum()` and $v$ is the total sum of squares `((y_true - y_true.mean()) ** 2).sum()`. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a 
score of 0.0.

In [104]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train,Y_train)
print(f"Coefficient of determination (Trainset): {model.score(X_train,Y_train)}")
print(f"Coefficient of determination (Testset): {model.score(X_test,Y_test)}")

Coefficient of determination (Trainset): 0.17884130982367757
Coefficient of determination (Testset): 0.0


In [105]:
Y_pred = model.predict(X_test)
print(f"Some predictions on the Testset:\n\t\tY_true:{Y_test[0:10]} \n\t\tY_pred:{Y_pred[0:10]}")
Y_pred = model.predict(X_train)
print(f"Some predictions on the Trainset:\n\t\tY_true:{Y_train[0:10]},\n\t\tY_pred:{Y_pred[0:10]}")

Some predictions on the Testset:
		Y_true:[138.  91.  99. 170. 317. 281. 142. 144. 155. 280.] 
		Y_pred:[144. 171. 220.  87. 192. 109. 107.  25. 150. 195.]
Some predictions on the Trainset:
		Y_true:[104. 118. 186. 132. 199. 215. 279. 135.  65.  70.],
		Y_pred:[ 55.  65. 186.  44.  71.  67. 109.  50.  65.  71.]


### Linear Regression
This is the standard Linear regression we implemented from scratch in the previous lecture. The minimization objective is:
$$
|| \mathbf{y} - \mathbb{X} \mathbf{w} ||^2
$$

In [106]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,Y_train)
print(f"Coefficient of determination (Trainset): {model.score(X_train,Y_train)}")
print(f"Coefficient of determination (Testset): {model.score(X_test,Y_test)}")

Coefficient of determination (Trainset): 0.5413921609396912
Coefficient of determination (Testset): 0.29612044394293746


In [107]:
Y_pred = model.predict(X_test)
print(f"Some predictions on the Testset:\n\t\tY_true:{Y_test[0:10].astype(np.int32)} \n\t\tY_pred:{Y_pred[0:10].astype(np.int32)}")
Y_pred = model.predict(X_train)
print(f"Some predictions on the Trainset:\n\t\tY_true:{Y_train[0:10].astype(np.int32)},\n\t\tY_pred:{Y_pred[0:10].astype(np.int32)}")

Some predictions on the Testset:
		Y_true:[138  91  99 170 317 281 142 144 155 280] 
		Y_pred:[171 150 231  90 224 196 190 124 219 235]
Some predictions on the Trainset:
		Y_true:[104 118 186 132 199 215 279 135  65  70],
		Y_pred:[ 75  96 202 121 111 248 216 126 122  62]


### Regularization
Regularization is an indictive bias that results in favoring some hypotheses over others. The most used regularization techniques are:
- __L1 regularization__ where the L1 norm of the parameters:
$$
L1(w\in \mathbb{R}^d) = \sum_i^d |w_i|
$$
is minimized together with the objective. The linear regression with L1 regularization is called Lasso Regression.
- __L2 regularization__ where the L2 norm of the parameters:
$$
L2(w\in \mathbb{R}^d) = \sqrt{\sum_i^d w_i^2}
$$
is minimized together with the objective. The linear regression with L2 regularization is called Ridge Regression.


### Lasso Linear Regression

In [108]:
from sklearn.linear_model import Lasso
model = Lasso(alpha=1.)
model.fit(X_train,Y_train)
print(f"Coefficient of determination (Trainset): {model.score(X_train,Y_train)}")
print(f"Coefficient of determination (Testset): {model.score(X_test,Y_test)}")

Coefficient of determination (Trainset): 0.382943118645549
Coefficient of determination (Testset): 0.21692207839766886


In [109]:
Y_pred = model.predict(X_test)
print(f"Some predictions on the Testset:\n\t\tY_true:{Y_test[0:10].astype(np.int32)} \n\t\tY_pred:{Y_pred[0:10].astype(np.int32)}")
Y_pred = model.predict(X_train)
print(f"Some predictions on the Trainset:\n\t\tY_true:{Y_train[0:10].astype(np.int32)},\n\t\tY_pred:{Y_pred[0:10].astype(np.int32)}")

Some predictions on the Testset:
		Y_true:[138  91  99 170 317 281 142 144 155 280] 
		Y_pred:[158 140 189 125 184 164 168 130 167 177]
Some predictions on the Trainset:
		Y_true:[104 118 186 132 199 215 279 135  65  70],
		Y_pred:[111 127 187 131 128 197 188 127 135 103]


### Ridge Regression

In [110]:
from sklearn.linear_model import Ridge
model = Ridge(alpha=1)
model.fit(X_train,Y_train)
print(f"Coefficient of determination (Trainset): {model.score(X_train,Y_train)}")
print(f"Coefficient of determination (Testset): {model.score(X_test,Y_test)}")

Coefficient of determination (Trainset): 0.4613537648625582
Coefficient of determination (Testset): 0.32552492063229144


In [111]:
Y_pred = model.predict(X_test)
print(f"Some predictions on the Testset:\n\t\tY_true:{Y_test[0:10].astype(np.int32)} \n\t\tY_pred:{Y_pred[0:10].astype(np.int32)}")
Y_pred = model.predict(X_train)
print(f"Some predictions on the Trainset:\n\t\tY_true:{Y_train[0:10].astype(np.int32)},\n\t\tY_pred:{Y_pred[0:10].astype(np.int32)}")

Some predictions on the Testset:
		Y_true:[138  91  99 170 317 281 142 144 155 280] 
		Y_pred:[178 163 185 102 210 174 177 124 201 193]
Some predictions on the Trainset:
		Y_true:[104 118 186 132 199 215 279 135  65  70],
		Y_pred:[100 130 193 140 131 208 193 133 143  89]
