<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Cross-validation" data-toc-modified-id="Cross-validation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Cross-validation</a></span></li><li><span><a href="#The-Validation-Set-Approach" data-toc-modified-id="The-Validation-Set-Approach-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The Validation Set Approach</a></span><ul class="toc-item"><li><span><a href="#-LEAVE-ONE-OUT-CROSS-VALIDATION--LOOCV" data-toc-modified-id="-LEAVE-ONE-OUT-CROSS-VALIDATION--LOOCV-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span><b> LEAVE ONE OUT CROSS VALIDATION- LOOCV</b></a></span></li></ul></li><li><span><a href="#K-Fold-Cross-Validation" data-toc-modified-id="K-Fold-Cross-Validation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>K-Fold Cross-Validation</a></span></li><li><span><a href="#REPEATED-K-FOLD-CROSS-VALIDATION" data-toc-modified-id="REPEATED-K-FOLD-CROSS-VALIDATION-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>REPEATED K-FOLD CROSS-VALIDATION</a></span></li><li><span><a href="#REPEATED-K-FOLD-CROSS-VALIDATION" data-toc-modified-id="REPEATED-K-FOLD-CROSS-VALIDATION-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>REPEATED K-FOLD CROSS-VALIDATION</a></span></li></ul></div>

In [40]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import (train_test_split,LeaveOneOut,KFold,RepeatedKFold,
                                     cross_val_score)
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR,SVC
from sklearn.metrics import make_scorer,mean_squared_error,r2_score
import sklearn
import pandas as pd

## Cross-validation

One of the finest techniques to check the generalization power of a machine learning model is to use ***Cross-validation techniques***. **Cross-validation** refers to a set of methods for measuring the performance of a given predictive model and can be computationally expensive, because they involve fitting the same model multiple times using different subsets of the training data. Cross-validation techniques generally involves the following process:

1.  Divide the available data set into two sets namely training and testing (validation) data set.

2.  Train the model using the training set

3.  Test the effectiveness of the model on the the reserved sample (testing) of the data set and estimate the prediction error.

**cross-validation methods for assessing model performance includes,**

         Validation set approach (or data split)
         Leave One Out Cross Validation
         k-fold Cross Validation
         Repeated k-fold Cross Validation
         


## The Validation Set Approach

1.  randomly dividing the available data set into two parts namely, training data set and validation data set.

2.  Model is trained on the training data set

3.  The Trained model is then used to predict observations in the validation set to test the generalization ability of the model when faced with new observations by calculating the prediction error using model performance metrics

# Marketing Data Set
**Description**

 The impact of three advertising medias (youtube, facebook and newspaper) on sales. Data are the advertising budget in thousands of dollars along with the sales.

In [9]:
marketing=pd.read_csv('data/marking.csv',index_col=0)

In [10]:
marketing.shape[1]-1

3

In [11]:
print("The advertising datasets has "+
      str(marketing.shape[0])+ ' observations and '+str(marketing.shape[1]-1)
      +' features and a label name sales ')

The advertising datasets has 200 observations and 3 features and a label name sales 


In [12]:
marketing.head(4)

Unnamed: 0,youtube,facebook,newspaper,sales
1,276.12,45.36,83.04,26.52
2,53.4,47.16,54.12,12.48
3,20.64,55.08,83.16,11.16
4,181.8,49.56,70.2,22.2


In [126]:
scale=StandardScaler()
features=scale.fit_transform(marketing.iloc[:,:3])

In [127]:
features[0:3]

array([[ 0.96985227,  0.98152247,  1.77894547],
       [-1.19737623,  1.08280781,  0.66957876],
       [-1.51615499,  1.52846331,  1.78354865]])

In [128]:
X_train, X_test, y_train, y_test=train_test_split(features,marketing.iloc[:,-1],
                                                  test_size=0.3,random_state=50)

In [129]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((140, 3), (60, 3), (140,), (60,))

In [130]:
lrg=LinearRegression()
svm=SVR(kernel='linear')
lrg.fit(X_train,y_train)
svm.fit(X_train,y_train)

SVR(kernel='linear')

In [131]:
lrg.coef_,svm.coef_

(array([4.88659082, 3.21452514, 0.16418342]),
 array([[4.40824062, 3.57237032, 0.09602809]]))

In [132]:
lrg_predictions=lrg.predict(X_test)
svm_predictions=svm.predict(X_test)

In [133]:
compare=pd.DataFrame({"y-true":y_test,"y-predicted":lrg_predictions})
compare.head(10)

Unnamed: 0,y-true,y-predicted
113,16.92,16.637655
166,14.28,18.010867
13,11.04,12.726141
74,13.2,12.108743
145,13.68,12.246996
21,21.6,22.129354
200,16.08,18.427477
9,5.76,4.205481
40,25.8,24.685959
89,15.48,14.379462


In [134]:
MSE_svm=mean_squared_error(y_pred=svm_predictions,y_true=y_test)
MSE_svm

3.7105713808019356

In [135]:
MSE_lrg=mean_squared_error(y_pred=lrg_predictions,y_true=y_test)
RMSE=np.sqrt(MSE_lrg)
R2=r2_score(y_pred=lrg_predictions,y_true=y_test)

In [136]:
R2,MSE_lrg,RMSE

(0.859057555097746, 3.776792977082012, 1.9433972772138002)

Using RMSE, the prediction error rate is calculated by dividing the RMSE by the average value of the outcome variable, which should be as small as possible

In [137]:
RMSE/np.mean(marketing['sales'])

0.1154927959359244

# NOTE
the validation set approach is only useful when a large data set is available. The model is trained on only a subset of the data set so it is possible the model will not be able to capture certain patterns or interesting information about data which are only present in the test data, leading to higher bias. The estimate of the test error rate can vary highly, depending on precisely which observations are included in the training set and which observations are included in the validation set


<h3><b> LEAVE ONE OUT CROSS VALIDATION- LOOCV</b></h3>

LOOCV is a special case of K-cross-validation where the number of folds equals the number of instances in the data set.It involves splitting the date set into two parts. However, instead of creating two subsets of comparable size, only a single data point is reserved as the test set.
The model is trained on the training set which consist of all the data points except the reserved point and compute the test error on the reserved data point. It repeats the process until each of the n data points has served as the test set and then avarage the n test errors.

***LeaveOneOut() is equivalent to KFold(n_splits=n)***

In [138]:
X=np.array([1,3,4,5,7])
loocv=LeaveOneOut()
for train,test in loocv.split(X):
    print("%s %s"%(X[train],X[test]))

[3 4 5 7] [1]
[1 4 5 7] [3]
[1 3 5 7] [4]
[1 3 4 7] [5]
[1 3 4 5] [7]


get_n_splits(X[, y, groups])-Returns the number of splitting iterations in the cross-validator

In [139]:
Y=np.array(marketing.iloc[:,-1])

In [140]:
loocv_obj=LeaveOneOut()
error=[]
for train_idx,test_idx in loocv_obj.split(features):
    X_train=features[train_idx]
    y_train=Y[train_idx]
    X_test=features[test_idx]
    y_test=Y[test_idx]
    lrg1=LinearRegression()
    lrg1.fit(X_train,y_train)
    pred=lrg1.predict(X_test)
    MSE=mean_squared_error(pred,y_test)
    error.append(MSE)
np.mean(error),np.sqrt(np.mean(error))

(4.243535712820084, 2.0599843962564583)

In [141]:
lrg1=LinearRegression()
loocv=LeaveOneOut()
mse=make_scorer(mean_squared_error)

In [142]:
#sorted(sklearn.metrics.SCORERS.keys())

In [143]:
scores=cross_val_score(lrg1,features,Y,scoring=mse,cv=loocv)

print("Folds: " + str(len(scores)) + ", MSE: " + str(np.mean(scores)) + 
      ", RMSE: " + str(np.sqrt(np.mean(scores))))

Folds: 200, MSE: 4.243535712820084, RMSE: 2.0599843962564583


In [144]:
scores = cross_val_score(lrg1, features,Y, scoring="neg_mean_squared_error", cv=loocv,n_jobs=1)
print("Folds: " + str(len(scores)) + ", MSE: " + str(np.mean(np.abs(scores))) +
       ", RMSE: " + str(np.sqrt(np.mean(np.abs(scores)))))

Folds: 200, MSE: 4.243535712820084, RMSE: 2.0599843962564583




## K-Fold Cross-Validation 

In practice if we have enough data, we set aside part of the data set known as the validation set and use it to measure the performance of our model prediction but since data are often scarce, this is usually not possible and the best practice in such situations is to use **K-fold cross-validation**.

**K-fold cross-validation** 

1.  Randomly split the data set into k-subsets (or k-fold) 
2. Train the model on K-1  subsets
3. Test the model on the reserved subset and record the prediction error
4. Repeat this process until each of the k subsets has served as the test set.
5. The average of the K validation scores is then obtained and used as the validation score for the model and is known as the cross-validation error .



In [145]:
X=np.array([1,3,4,5,7])
kcv=KFold(n_splits=3)
for train,test in kcv.split(X):
    print("%s %s"%(X[train],X[test]))

[4 5 7] [1 3]
[1 3 7] [4 5]
[1 3 4 5] [7]


In [146]:
cv4=KFold(n_splits=4)

In [147]:
scores = cross_val_score(lrg1, features,Y, scoring="neg_mean_squared_error", cv=cv4,n_jobs=1)
print("Folds: " + str(len(scores)) + ", MSE: " + str(np.mean(np.abs(scores))) +
       ", RMSE: " + str(np.sqrt(np.mean(np.abs(scores)))))

Folds: 4, MSE: 4.280873800694744, RMSE: 2.0690272595339927


##  REPEATED K-FOLD CROSS-VALIDATION

The process of splitting the data into k-folds can be repeated a number of times, this is called repeated k-fold cross validation.

number -the number of folds 

repeats	For repeated k-fold cross-validation only: the number of complete sets of folds to compute

In [148]:
X=np.array([1,3,4,5,7])
rkcv=RepeatedKFold(n_splits=3,n_repeats=2)
for train,test in rkcv.split(X):
    print("%s %s"%(X[train],X[test]))

[3 4 5] [1 7]
[1 5 7] [3 4]
[1 3 4 7] [5]
[1 3 7] [4 5]
[1 4 5] [3 7]
[3 4 5 7] [1]


In [149]:
r2cv4 = RepeatedKFold(n_splits=4, n_repeats=2, random_state=40)

In [150]:
scores = cross_val_score(lrg1, features,Y, scoring="neg_mean_squared_error", cv=r2cv4 ,n_jobs=1)
print("Folds: " + str(len(scores)) + ", MSE: " + str(np.mean(np.abs(scores))) +
       ", RMSE: " + str(np.sqrt(np.mean(np.abs(scores)))))

Folds: 8, MSE: 4.238083749539515, RMSE: 2.0586606688669007


# Cross-Validation on Classification Problems 

In [41]:
from sklearn.datasets import load_iris

In [151]:
x_iris,y_iris=load_iris(return_X_y=True)

In [152]:
x_iris=StandardScaler().fit_transform(x_iris)

# K-FOLD CROSS-VALIDATION

In [153]:
clr=SVC(kernel='linear')

In [154]:
clr_scores=cross_val_score(clr,x_iris,y_iris,scoring='accuracy',cv=5)

In [155]:
clr_scores

array([0.96666667, 1.        , 0.93333333, 0.93333333, 1.        ])

In [156]:
print(f"{np.round(clr_scores.mean(),2)} accuracy with a standard deviation of {np.round(clr_scores.std(),2)}")

0.97 accuracy with a standard deviation of 0.03


##  REPEATED K-FOLD CROSS-VALIDATION

In [157]:
r2cv4 = RepeatedKFold(n_splits=5, n_repeats=2, random_state=40)
clr_scores_repeat=cross_val_score(clr,x_iris,y_iris,scoring='accuracy',cv=r2cv4)

In [158]:
print(f"{np.round(clr_scores_repeat.mean(),2)} accuracy with a standard deviation of {np.round(clr_scores_repeat.std(),2)}")

0.97 accuracy with a standard deviation of 0.02
