# Machine Learning pipelines

## What I will learn if study this notebook?
You will learn how ML experts suggest:
- to report the performance of your models,
- to select the hyperparameters of your models.

TL, DR
- It doesn't make sense to quantify how good your algorithm is over all possible data-generating distributions, since **no free lunch theorem** proves that all algorithm have the same average performance, instead
- you want to quantify (measure) how good is your algorithm at a group of data-generating distributions related to a specific problem,
- in particular, you want to estimate how good the algorithm will be when it will process new data (data outside the one that was used for training)
- To estimate that **out-of-sample performance** of an algorithm you can use **k-fold cross-validation**.
- That out-of-sample performance, in practice, is reported along with a 95% **confidence interval** (you will need to estimate the score and the **standard error**).
- Since most of the times, algorithms have **hyperparameters** there is a question on how to select them:
- You can select the best hyperparameters by the **one-standard-error rule**.
- Another option is to integrate the hyperparameter selection into the algorithm itself by using **cross-validation**. To estimate the out-of-sample performance of that resulting algorithm you can use cross-validation, effectively getting what is known as **nested cross-validation**. 


## How good is this algorithm?

It doesn't make sense to quantify how good your algorithm is over all possible data-generating distributions, since **no free lunch theorem** proves that all algorithm have the same average performance, instead you want to quantify (measure) how good is your algorithm at a group of data-generating distributions related to a specific problem, in particular, you want to estimate how good the algorithm will be when it will process new data (data outside the one that was used for training)

"The **no free lunch theorem** for machine learning {cite}`wolpert_free_1997` states that, averaged over all possible data-generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points" From {cite}`goodfellow_deep_2016`, Pag. 113

## Can I say that my algorithm is the best?

"In machine learning experiments, it is common to say that algorithm A is better than algorithm B if the upper bound of the 95 percent of the condifence interval for the error of the algorithm A is less than the lower bound of the of the 95 percent of the condifence interval for the error of the algorithm B."
From {cite}`goodfellow_deep_2016`, Pag. 125

TODO

Add plot with an example to understand the statement visually.

so how I can estimate the out of sample error and the confidence intervals (the dot and the bars)?

We can use the following code:

```python
scores = cross_val_score(reg, X, y, cv=k,scoring='neg_mean_squared_error')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 1.96))
```



To understand the explanation we need two theoretical results regarding the statistic $\bar{X}$. Which is an unbiased estimator of the mean $\mu$.

$\hat{\mu}$ score means
$\hat{SE}$

95% Confidence Interval
performance $\hat{\mu}$ +/- 1.96 $\hat{SE}$

## Cross-validation

>**cross-validation** can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility. The process of evaluating a model’s performance is known as **model assessment**, whereas the process of selecting the proper level of flexibility for a model is known as **model selection**." 
>
> From {cite}`james_introduction_2014`, Pag. 175

There are severla types of **cross-validation** procedures:
- k-fold cross-validation
- leave-one-out cross-validation

Now that we have our cross-validation scores and the estimates of the standard errors, **how should we select the best hyperparameter for our model?**
We can follow the **one-standard-error rule**:

>..."if we repeated cross-validation using a different set of cross-validation folds, then the precise model with the lowest estimated test error would surely change. In this setting, we can select a model using the one-standard-error rule. We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve."
>
>From {cite}`james_introduction_2014` Pag. 214

>..."One problem is that no unbiased estimators of the variance of such average error estimator exist {cite}`bengio_unbiased_2004`, but approximations are typically used."
>
> From {cite}`goodfellow_deep_2016` Pag. 119

## Discussions
This topic can be consufing and tricky to implement.

Little mistake, they say 95% confidence interval but they use 2 (which is ~95.45%) instead they should use 1.96 (for 95%)
https://github.com/scikit-learn/scikit-learn/issues/1940

In the first implementation, someone got confused and divide by the number of sample
https://github.com/scikit-learn/scikit-learn/issues/6059
https://github.com/scikit-learn/scikit-learn/pull/6072



## References

```{bibliography} ./references.bib
:filter: docname in docnames
```

In [None]:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

diabetes = datasets.load_diabetes()
X = diabetes.data[:16]
y = diabetes.target[:16]
reg = LinearRegression()

k = 4
scores = cross_val_score(reg, X, y, cv=k,scoring='neg_mean_squared_error')
scores

array([ -22862.90680958,  -36520.18051482, -124486.1325756 ,
         -2205.68974692])

In [None]:
from sklearn.metrics import mean_squared_error
import numpy as np

inds=list(range(len(X)))
fold_length = int(len(inds)/k)

scores = []

for i in range(0,len(inds),fold_length):
    train_inds = np.delete(inds, inds[i:i+fold_length])
    test_inds = inds[i:i+fold_length]

    reg.fit(X[train_inds],y[train_inds])

    y_pred = reg.predict(X[test_inds])
    y_true = y[test_inds]

    scores.append(-mean_squared_error(y_true, y_pred))

scores = np.array(scores)
scores

array([ -22862.90680958,  -36520.18051482, -124486.1325756 ,
         -2205.68974692])

In [None]:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

diabetes = datasets.load_diabetes()
X = diabetes.data[:16]
y = diabetes.target[:16]
reg = LinearRegression()

k = 4
scores = cross_val_score(reg, X, y, cv=k)
scores

array([ -9.5593196 , -37.69944289, -12.77935441,  -2.11896031])

In [None]:
from sklearn.metrics import r2_score
import numpy as np

inds=list(range(len(X)))
fold_length = np.floor(len(inds)/k).astype('int')

scores = []

for i in range(0,len(inds),fold_length):
    train_inds = np.delete(inds, inds[i:i+fold_length])
    test_inds = inds[i:i+fold_length]

    reg.fit(X[train_inds],y[train_inds])

    y_pred = reg.predict(X[test_inds])
    y_true = y[test_inds]

    scores.append(r2_score(y_true, y_pred))

scores = np.array(scores)
scores

array([ -9.5593196 , -37.69944289, -12.77935441,  -2.11896031])