<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Cross-validation" data-toc-modified-id="Cross-validation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Cross-validation</a></span></li><li><span><a href="#The-Validation-Set-Approach" data-toc-modified-id="The-Validation-Set-Approach-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The Validation Set Approach</a></span></li><li><span><a href="#LEAVE-ONE-OUT-CROSS-VALIDATION--LOOCV" data-toc-modified-id="LEAVE-ONE-OUT-CROSS-VALIDATION--LOOCV-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>LEAVE ONE OUT CROSS VALIDATION- LOOCV</a></span></li><li><span><a href="#K-Fold-Cross-Validation" data-toc-modified-id="K-Fold-Cross-Validation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>K-Fold Cross-Validation</a></span></li><li><span><a href="#REPEATED-K-FOLD-CROSS-VALIDATION" data-toc-modified-id="REPEATED-K-FOLD-CROSS-VALIDATION-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>REPEATED K-FOLD CROSS-VALIDATION</a></span></li></ul></div>

In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,LeaveOneOut,KFold,RepeatedKFold
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer,mean_squared_error,r2_score
import sklearn
import pandas as pd

## Cross-validation 
***Cross-validation *** One of the finest techniques to check the generalization power of a machine learning model is Cross-validation techniques. Cross-validation is a resampling approach which are can be computationally expensive, because they involve fitting the same model multiple times using different subsets of the training data. **Cross-validation  ** refers to a set of methods for measuring the performance of a given predictive model.

1. by dividing the available data set into two sets namely training and testing (validation) data set. 

2. Train the model using the training data set

3. Test the effectiveness of the model on the the reserved sample (testing) of the data set and estimate the prediction error.

**cross-validation methods for assessing model performance.**
         
         Validation set approach (or data split)
         Leave One Out Cross Validation
         k-fold Cross Validation
         Repeated k-fold Cross Validation
         

## The Validation Set Approach

1. randomly dividing the available data set into two parts namely, training data set and  validation data set.
    
2. Model is trained on the training data set
    
3. The Trained model is then used to predict observations in the validation set to test the generalization ability of the model when faced with new observations by calculating the prediction error using model performance metrics

# Marketing Data Set
**Description**

 The impact of three advertising medias (youtube, facebook and newspaper) on sales. Data are the advertising budget in thousands of dollars along with the sales.

In [2]:
marketing=pd.read_csv('data/marking.csv',index_col=0)

In [3]:
print("The advertising experiment has dataset "+
      str(marketing.shape[0])+ ' observations and '+str(marketing.shape[1])+' features')

The advertising experiment has dataset 200 observations and 4 features


In [4]:
marketing.head(4)

Unnamed: 0,youtube,facebook,newspaper,sales
1,276.12,45.36,83.04,26.52
2,53.4,47.16,54.12,12.48
3,20.64,55.08,83.16,11.16
4,181.8,49.56,70.2,22.2


In [5]:
scale=StandardScaler()
features=scale.fit_transform(marketing.iloc[:,:3])

In [6]:
features[0:3]

array([[ 0.96985227,  0.98152247,  1.77894547],
       [-1.19737623,  1.08280781,  0.66957876],
       [-1.51615499,  1.52846331,  1.78354865]])

In [7]:
X_train, X_test, y_train, y_test=train_test_split(features,marketing.iloc[:,-1],
                                                  test_size=0.3,random_state=50)

In [8]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((140, 3), (60, 3), (140,), (60,))

In [9]:
lrg=LinearRegression()
svm=SVR(kernel='linear')

In [10]:
lrg.fit(X_train,y_train)
svm.fit(X_train,y_train)

SVR(kernel='linear')

In [11]:
lrg.coef_,svm.coef_

(array([4.88659082, 3.21452514, 0.16418342]),
 array([[4.40824062, 3.57237032, 0.09602809]]))

In [12]:
lrg_predictions=lrg.predict(X_test)
svm_predictions=svm.predict(X_test)

In [13]:
MSE_svm=mean_squared_error(y_pred=svm_predictions,y_true=y_test)
MSE_svm

3.7105713808019356

In [14]:
MSE_lrg=mean_squared_error(y_pred=lrg_predictions,y_true=y_test)
RMSE=np.sqrt(MSE_lrg)
R2=r2_score(y_pred=lrg_predictions,y_true=y_test)

In [15]:
R2,MSE_lrg,RMSE

(0.8590575550977458, 3.776792977082013, 1.9433972772138004)

Using RMSE, the prediction error rate is calculated by dividing the RMSE by the average value of the outcome variable, which should be as small as possible

In [16]:
RMSE/np.mean(marketing['sales'])

0.11549279593592442

# NOTE
the validation set approach is only useful when a large data set is available. The model is trained on only a subset of the data set so it is possible the model will not be able to capture certain patterns or interesting information about data which are only present in the test data, leading to higher bias. The estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set


## LEAVE ONE OUT CROSS VALIDATION- LOOCV

Leave out one data point and build the model on the rest of the data set. Each sample is used once as a test set  while the remaining samples form the training set. The LOOCV estimate for the test error is the average of these n test error estimates.

***LeaveOneOut() is equivalent to KFold(n_splits=n)***

In [17]:
X=np.array([[1,3],[4,5],[10,7]])
y=np.array([1,3,2])

In [18]:
loocv=LeaveOneOut()

get_n_splits(X[, y, groups])-Returns the number of splitting iterations in the cross-validator

In [19]:
loocv.get_n_splits(X)

3

In [20]:
for train_idx,test_idx in loocv.split(X):
    print('train indx',train_idx,'test idx',test_idx)
    train_x,test_x=X[train_idx],X[test_idx]
    y_train,y_test=y[train_idx],y[test_idx]
    print(train_x,y_train,'\n \n',test_x,y_test)

train indx [1 2] test idx [0]
[[ 4  5]
 [10  7]] [3 2] 
 
 [[1 3]] [1]
train indx [0 2] test idx [1]
[[ 1  3]
 [10  7]] [1 2] 
 
 [[4 5]] [3]
train indx [0 1] test idx [2]
[[1 3]
 [4 5]] [1 3] 
 
 [[10  7]] [2]


In [21]:
Y=np.array(marketing.iloc[:,-1])

In [22]:
loocv_obj=LeaveOneOut()

In [23]:
error=[]
for train_idx,test_idx in loocv_obj.split(features):
    X_train=features[train_idx]
    y_train=(Y[train_idx])
    X_test=features[test_idx,:]
    y_test=(Y[test_idx])
    lrg1=LinearRegression()
    lrg1.fit(X_train,y_train)
    pred=lrg1.predict(X_test)
    MSE=mean_squared_error(pred,y_test)
    error.append(MSE)

In [24]:
np.mean(error),np.sqrt(np.mean(error))

(4.2435357128200835, 2.059984396256458)

In [25]:
lrg1=LinearRegression()
mse=make_scorer(mean_squared_error)

In [26]:
#sorted(sklearn.metrics.SCORERS.keys())

In [27]:
scores=cross_val_score(lrg1,features,Y,scoring=mse,cv=loocv)

print("Folds: " + str(len(scores)) + ", MSE: " + str(np.mean(scores)) + 
      ", RMSE: " + str(np.sqrt(np.mean(scores))))

Folds: 200, MSE: 4.2435357128200835, RMSE: 2.059984396256458


In [28]:
scores = cross_val_score(lrg1, features,Y, scoring="neg_mean_squared_error", cv=loocv,n_jobs=1)
print("Folds: " + str(len(scores)) + ", MSE: " + str(np.mean(np.abs(scores))) +
       ", RMSE: " + str(np.sqrt(np.mean(np.abs(scores)))))

Folds: 200, MSE: 4.2435357128200835, RMSE: 2.059984396256458




## K-Fold Cross-Validation 

In practice if we have enough data, we set aside part of the data set known as the validation set and use it to measure the performance of our model prediction but since data are often scarce, this is usually not possible and the best practice in such situations is to use **K-fold cross-validation**.

**K-fold cross-validation** 

1.  Randomly split the data set into k-subsets (or k-fold) 
2. Train the model on K-1  subsets
3. Test the model on the reserved subset and record the prediction error
4. Repeat this process until each of the k subsets has served as the test set.
5. The average of the K validation scores is then obtained and used as the validation score for the model and is known as the cross-validation error .



In [30]:
cv4=KFold(n_splits=4)

In [31]:
scores = cross_val_score(lrg1, features,Y, scoring="neg_mean_squared_error", cv=cv4,n_jobs=1)
print("Folds: " + str(len(scores)) + ", MSE: " + str(np.mean(np.abs(scores))) +
       ", RMSE: " + str(np.sqrt(np.mean(np.abs(scores)))))

Folds: 4, MSE: 4.280873800694743, RMSE: 2.0690272595339927


##  REPEATED K-FOLD CROSS-VALIDATION

The process of splitting the data into k-folds can be repeated a number of times, this is called repeated k-fold cross validation.

number -the number of folds 

repeats	For repeated k-fold cross-validation only: the number of complete sets of folds to compute

In [32]:
r2cv4 = RepeatedKFold(n_splits=4, n_repeats=2, random_state=40)

In [33]:
scores = cross_val_score(lrg1, features,Y, scoring="neg_mean_squared_error", cv=r2cv4 ,n_jobs=1)
print("Folds: " + str(len(scores)) + ", MSE: " + str(np.mean(np.abs(scores))) +
       ", RMSE: " + str(np.sqrt(np.mean(np.abs(scores)))))

Folds: 8, MSE: 4.2380837495395145, RMSE: 2.0586606688669007
