# Evaluate the performance of ML algos with Resampling

We discusses at the lecture the basic principles of the use of the static TVT splitting.

There are clever techniques from statistics called "**resampling**" methods: they allow to make accurate estimates for how well your algorithm performs. A decent and more formal definition: a re-sampling method involves repeatedly drawing samples from a training data set and refitting a model to obtain addition information about that model.

Note that in resampling we do not loose data for training! In fact: IMPORTANT: once we estimate the performance of our algorithm, **we can then re-train the final algorithm on the entire training dataset** and get it ready for operational use. Only at that point you check the model on test data.

We are going to look at 4 different techniques that we can use to split up our training dataset and create useful estimates of performance for our ML algorithms:

1. Split into Train and Test Sets
1. k-fold Cross-Validation
1. Leave One Out Cross-Validation
1. Repeated Random Test-Train Splits

## 0. Import the data

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AMLBas2324/main/pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
data

## 1. Split into Train and Test Sets

The simplest method that we can use to evaluate the performance of a ML algorithm is to separate the dataset, and use (at least) different training and testing datasets (e.g. 2/3 and 1/3, but choices may vary).

This algorithm evaluation technique is very fast, and has pros and cons:
* _Pro_. It is **ideal for large datasets** (millions of records): splitting a large dataset into largish sub-datasets allows that 1) each split of the data is **not too tiny**, and 2) both are **representative** of the underlying problem. Because of the speed that this choice brings, it is useful to use this approach when the algorithm you are investigating is slow to train.
* _Con_. **High variance**. This means that differences in the training and test dataset can result in meaningful differences in the estimate of accuracy.





In [None]:
from sklearn.model_selection import train_test_split  # <--- note from where we are taking this module...
from sklearn.linear_model import LogisticRegression

In [None]:
array = data.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
test_size = 0.33
seed = 2

NOTE: importance of the random seed (read later).

IMPORTANT: before running the nxt cell, appreciate that we put a magic function at the start, and guess why..

In [None]:
%%time
# Evaluate using a train and a test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
#model = LogisticRegression()
model = LogisticRegression(solver='lbfgs', max_iter=300)
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result*100.0))

We should see that the estimated accuracy for the model is approximately 70-80% (NOTE: depends on e.g. seed, etc).


**Importance of the random seed**. Note that in addition to specifying the size of the split, we also specify the random seed. Because the split of the data is random, we want to ensure that the results are reproducible. By specifying the random seed we ensure that we get the same random numbers each time we run the code and in turn the same split of data. This is important if we want to compare this result to the estimated accuracy of another ML algorithm or the same algorithm with a different configuration. To ensure the comparison was apples-for-apples, we must ensure that they are trained and tested on exactly the same data.

### <font color='red'>Exercise 1</font>

Try to change the seed, and re-train. Does accuracy change? Is it reproducible for a a fixed seed? for different seeds, could you measure its variance? (up to your curiosity here, but no need to do more here than just few tries and get a feeling.. but you can do more and clever tests..)

In [None]:
# type your code below

---

### <font color='red'>Exercise 2</font>

What happens if I check accuracy on the _train_ set (conceptually wrong)? Do I see something different or not? What is the drawback if I do this mistake?

In [None]:
# type your code below

---

### <font color='red'>Exercise 3</font>

What if change the training/test ratio?

In [None]:
# type your code below

---

## 2. K-fold Cross-Validation

If you have done the exercises above, you might have experienced that you are open to quite some variance with a single train-test set split (e.g. an accuracy with an error, presumably largish). **Cross-validation** is an approach that you can use to estimate the performance of a ML algorithm with less variance than a single train-test set split.

It works by **splitting the dataset into k-parts** (e.g. $k=5$ or $k=10$). Each split of the data is called a $fold$. The algorithm is trained on $k-1$ folds (with 1 held back), and then tested on the held-back fold. This is also repeated, so that _each_ fold of the dataset is given a chance to be the held-back test set. So you repeat it k times. After running cross-validation you end up with $k$ different performance scores that you can summarize using a mean and a standard deviation.

The result is a more reliable estimate of the performance of the algorithm on new data. It is **more accurate** because the algorithm is trained and evaluated multiple times on different data.

**The choice of $k$ is a trade-off** between reasonably large size of each test partition, and a number that allows enough repetitions of the train-test evaluation of the algorithm.

**$k$ values of $3$, $5$ and $10$ are common** (at least for modest-size datasets in the thousands or tens of thousands of records). In the example below we use 10-fold cross-validation.

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score   # <---
from sklearn.linear_model import LogisticRegression

In [None]:
%%time
# Evaluate using Cross Validation
num_folds = 5
seed = 2

kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#model = LogisticRegression()
model = LogisticRegression(solver='lbfgs', max_iter=300)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

You can see that we report both the mean and the standard deviation of the performance measure.


This method - at least in this example - is not enourmously better w.r.t. the previous one. But if you think and execute the exercises suggested below, you might review this idea a bit.. Also, consider that getting about the same accuracy with a decent standard deviation without having to isolate-and-never-use e.g. 30% of your dataset, is not a negligible value in itself!

Some additional considerations:

* often a model takes a really long time to train, so k-fold CV is
computationally expensive.. and train/test splitting is easier

* there is actually a spectrum of CV techniques, from k=2 to k=n. k=2 is still k-fold CV, but it's also identical to the basic train/test splitting! k=n is also k-fold CV, but it becomes equivalent to Leave-One-Out cross validation (see next)! The choice between the two ends of this spectrum is a bias vs variance tradeoff, and there is no inherently "right" a-priori choice, though there will be plenty of approaches that will eventually perform better empirically.

### <font color='red'>Exercise 4</font>

<div class="alert alert-block alert-info">
What if I change the nb folds?
</div>

In [None]:
# type your code below

### <font color='red'>Exercise 5</font>

Consider to focus on the seeds only, and try to change seeds in static splitting vs in cross-validation, and see how the varibility of the results changes...

In [None]:
# type your code below

## 3. Leave One Out Cross-Validation

It is indeed a variation of the cross-validation, actually one of its extreme. The result is a large number of performance measures that can be summarized in an effort to give a more reasonable estimate of the accuracy of your model on unseen data.

You can configure cross-validation so that the size of the fold is 1 ($k=n$, i.e. $k$ is set to the number of observations in your dataset).

In [None]:
from sklearn.model_selection import LeaveOneOut       # <---
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

In [None]:
%%time
# Evaluate using Leave One Out Cross Validation
loocv = LeaveOneOut()
#model = LogisticRegression()
model = LogisticRegression(solver='lbfgs', max_iter=500)
results = cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

(*NOTE: probably not so visible in this small example, but the time it took to run this is larger than the previous one..*)

You can see in the standard deviation that the score has **higher variance** than the k-fold cross-validation results described above.

## 4. Repeated Random Test-Train Splits

Another variation on k-fold cross-validation is to **create a random split of the data** like the train/test split described above, but **repeat multiple times the process of splitting and evaluation of the algorithm**, like cross-validation.

This has pros and cons:

* Pro: it has the speed of using a train/test split and the reduction in variance in the estimated performance of k-fold cross-validation. You can also repeat the process many more times as needed to improve the accuracy.
* Con: A down side is that repetitions may include much of the same data in the train or the test split from run to run, introducing redundancy into the evaluation.

The example below splits the data into a 67%/33% train/test split and repeats the process 10 times.

In [None]:
from sklearn.model_selection import ShuffleSplit      # <---
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

In [None]:
%%time
# Evaluate using Shuffle Split Cross Validation
n_splits = 100
test_size = 0.33
seed = 7

kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)
#model = LogisticRegression()
model = LogisticRegression(solver='lbfgs', max_iter=500)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

We should see that in this case the distribution of the performance measure is roughly on par with the "standard" k-fold cross-validation above.

## OK, fine, but.. what techniques to use when?!?

Discussion at the lecture.

There are some tips to consider what resampling technique to use in different circumstances.

* Generally k-fold cross-validation is the gold standard for evaluating the performance of a ML algorithm on unseen data with k set to 3, 5, or 10 (typically 10).
* Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.
* Techniques like leave-one-out cross-validation and repeated random splits can be useful intermediates when trying to balance variance in the estimated performance, model training speed and dataset size.

The best (only?) possible advice is to experiment and find a technique for your problem that is fast and produces reasonable estimates of performance that you can use to make decisions. GOLDEN RULE: **if in doubt, just use 10-fold cross-validation**.

## Summary

What we did:

* we discovered 4 statistical techniques that we can use to estimate the performance of ML algorithms, called Resampling.

## What's next

Now we will see how you can evaluate the performance of classification and regression algorithms using a suite of different metrics and built in evaluation reports.