<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/CrossValidation1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/Machine-Learning.git cloned-repo
%cd cloned-repo

# **Intro to Cross Validation (CV)**

When training a model the first thing we do is:
>divide the data into training and test sets

The model is trained on the training set. <br>
Then to test the accuracy of the model predictions, it is tested using data it has never seen before... the test set. 

Once we have a model that is close to our performance criteria, we can do hyperparameter tuning to select the best hyperparameters for the model. 

The problem with this approach is:
>We may actually now fit the hyperparameters to the test set. <br>
**Leading to a risk of overfitting on the test set** because the parameters can be tweaked to get optimal model performance. <br>

A common method to solve this problem is to divide the test set into test and validate sets. Then train on the training set, test on the test set, and when ready, validate performance on the validation set.

The downsides of this method are: 
1. Reducing the amount of data available for training the model. 
2. The results of the model performance are variable depending upon the random selection of the training and test sets. 

This problem can be solved by using **Cross-validation(CV)**

If we use Cross-validation, a validation set is no longer necessary. We need only the training and the test set. 

There are several cross-validation methods. <br>
We will look at the basic methods.<br>
You can do a deeper dive into CV by<br>
checking this [link](https://scikit-learn.org/stable/modules/cross_validation.html) <br>
Basic Cross Validation Methods we will explore in these two notebooks: <br>
- Holdout
- K-fold CV
- Leave One Out (LOOCV)

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

# **Holdout**
This is the basic method for testing models. <br>
1. Split the dataset into training and test sets.<br>
2. Train the model on the training set<br>
3. Test the model on the test set. 

**Create synthetic data**<br>
The data set we are creating has 1000 samples and three clusters<br>

In [None]:
# demonstrate that the train-test split procedure is repeatable
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_blobs(n_samples=100)

**Train-test split**

Split the dataset into training and test sets. Use 80% of the data for training. 
20% for testing<br>
Note the accuracy 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
print(X.shape, y.shape)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, 
                                                    random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# fit the model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

Using the same data, do another 80/20 split of the data<br>
Note the accuracy the model this time. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# fit the model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

The accuracy changes depending upon the train-test split. Even when the split is the same percentage. 


The issue of the changing accuracy depending up the dataset split can be solved using different Cross validation techniques

# **K-Fold Cross Validation**

# **K-fold CV**<br>
K-fold CV uses the following algorithm: 
1. Randomly split the data into K equally sized parts
2. Select a single partition from the K parts to be the test set
3. Train the model K times, once for each K


In [None]:
from IPython.display import Image
Image("images/grid_search_cross_validation.png" , width=640)

KFold divides all the samples in  groups of samples, called folds, of equal sizes (if possible).<br>
 The prediction function is learned using  folds, and the fold left out is used for test.

The KFold general procedure:

- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
>Take the group as a hold out or test data set<br>
>Take the remaining groups as a training data set<br>
>Fit a model on the training set and evaluate it on the test set<br>
>Retain the evaluation score and discard the model<br>

Summarize the skill of the model using the sample of model evaluation scores

The k value must be chosen carefully for your data sample.<br>

A poorly chosen value for k may result in a mis-representative idea of the skill of the model, such as a score with a high variance (that may change a lot based on the data used to fit the model), or a high bias, (such as an overestimate of the skill of the model).<br>

Three common tactics for choosing a value for k are as follows:<br>

>- Representative: The value for k is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.<br>
- k=10: The value for k is fixed to 10, a value that has been found through experimentation to generally result in a model skill estimate with low bias a modest variance.<br>
- k=n: The value for k is fixed to n, where n is the size of the dataset to give each test sample an opportunity to be used in the hold out dataset. This approach is called leave-one-out cross-validation.<br>

Three models are trained and evaluated with each fold given a chance to be the held out test set.<br>

For example:<br>

Model1: Trained on Fold1 + Fold2, Tested on Fold3<br>
Model2: Trained on Fold2 + Fold3, Tested on Fold1<br>
Model3: Trained on Fold1 + Fold3, Tested on Fold2<br>
The models are then discarded after they are evaluated as they have served their purpose.<br>

The skill scores are collected for each model and summarized for use.<br>



In [None]:
from sklearn.model_selection import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
kf.get_n_splits(X)

print(kf)

for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Setup the K-fold

In [None]:
dataIn=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6]

In [None]:
kfold = KFold(3, True, 1)

In [None]:
# enumerate splits
for train, test in kfold.split(dataIn):
	print('train: %s, test: %s' % (train, test))
 
#numbers are the index of the element

In [None]:
# scikit-learn k-fold cross-validation
from numpy import array
from sklearn.model_selection import KFold
# data sample
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
# prepare cross validation
kfold = KFold(3, True, 1)
# enumerate splits
for train, test in kfold.split(data):
	print('train: %s, test: %s' % (data[train], data[test]))

Larger KFold dataset

In [None]:
# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

In [None]:
# evaluate a logistic regression model using k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# create dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
for i in range(len(scores)):
  print("Test#:",i,"=",scores[i],"\n")
# report performance
print('Average of scores is Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

**Assignment**<br>
Increase the number of data samples to 1000<br>
Check the accuracy average of different k values (try 2-15)<br>
Report to the class your findings

The expectation is:<br>
low values of k will result in a noisy estimate of model performance <br>
large values of k will result in a better estimate of model performance.



# **Leave One Out (LOOCV)**

To calculate the best 'k', we would train the model on all available data and estimate the performance on a separate large and representative hold-out dataset.<br>
The performance on this hold-out dataset would represent the “true” performance of the model and any cross-validation performances on the training dataset would represent an estimate of this score.<br>

The KFold method is good, if you have a large enough dataset to leave out a large percentage of your dataset. 

If you do not have a large dataset, you can use Leave One Out CV. It is a computationally expensive version of cross-validation where k=N, and N is the total number of examples in the training dataset. <br>

Each sample in the training set is given an example to be used alone as the test evaluation dataset. It is rarely used for large datasets because of the compute expense, but it provides a good estimate of model performance given the available data.

In [None]:
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot

In [None]:
# create the dataset
#Change the dataset size here:
def get_dataset(n_samples=100):
	X, y = make_classification(n_samples=n_samples, n_features=20, 
                            n_informative=15, n_redundant=5, 
                            random_state=1)
	return X, y

# retrieve the model to be evaluate
def get_model():
	model = LogisticRegression()
	return model

In [None]:
# evaluate the model using a given test condition
def evaluate_model(cv):
	# get the dataset
	X, y = get_dataset()
	# get the model
	model = get_model()
	# evaluate the model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# return scores
	return mean(scores), scores.min(), scores.max()

Running the example first reports the LOOCV, <br>
then the mean, min, and max accuracy for each k value that was evaluated.

In [None]:
# calculate the ideal test condition
ideal, _, _ = evaluate_model(LeaveOneOut())
print('Ideal: %.3f' % ideal)

In [None]:
# define folds to test
folds = range(2,31)
# record mean and min/max of each set of results
means, mins, maxs = list(),list(),list()
# evaluate each k value
for k in folds:
	# define the test condition
	cv = KFold(n_splits=k, shuffle=True, random_state=1)
	# evaluate k value
	k_mean, k_min, k_max = evaluate_model(cv)
	# report performance
	print('> folds=%d, accuracy=%.3f (%.3f,%.3f)' % (k, k_mean, k_min, k_max))
	# store mean accuracy
	means.append(k_mean)
	# store min and max relative to the mean
	mins.append(k_mean - k_min)
	maxs.append(k_max - k_mean)
# line plot of k mean values with min/max error bars


In the plot below the red line is the ideal value of accuracy

In [None]:
pyplot.errorbar(folds, means, yerr=[mins, maxs], fmt='o')
# plot the ideal case in a separate color
pyplot.plot(folds, [ideal for _ in range(len(folds))], color='r')
# show the plot
pyplot.show()

The plot for this model on this dataset, **most k values underestimate the performance of the model compared to the ideal case**. <br>
The results suggest that perhaps k=10 alone is slightly optimistic and perhaps k=13 might be a more accurate estimate.

**Assignment**<br>
Change the size of the dataset to a larger number. <br>
Rerun the LOOCV. <br>
Report to the class what happened<br>