### Variations of k-fold cross validation

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
df = pd.read_csv('Social_Network_Ads.csv')
X = df.iloc[:, 2:4]   # Using 1:2 as indices will give us np array of dim (10, 1)
y = df.iloc[:, 4]

df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [3]:
# Scale
from sklearn.preprocessing import StandardScaler
X_sca = StandardScaler()
X = X_sca.fit_transform(X)

In [4]:
from __future__ import division
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

In [5]:
kfold_cv = KFold(n_splits=10)
correct = 0
total = 0
for train_indices, test_indices in kfold_cv.split(X):
    X_train, X_test, y_train, y_test = X[train_indices], X[test_indices], \
                                        y[train_indices], y[test_indices]
    clf = SVC(kernel='linear', random_state=0).fit(X_train, y_train)
    correct += accuracy_score(y_test, clf.predict(X_test))
    total += 1
print("Accuracy: {0:.2f}".format(correct/total))

Accuracy: 0.82


In [6]:
from sklearn.svm import SVC #support vector classifier
clf = SVC(kernel='linear', random_state=0).fit(X_train, y_train)

In [7]:
# applying k-fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(clf, X_train, y_train, cv=10)
print ('\n', accuracies)
print ('\n',accuracies.mean())
print ('\n',accuracies.std())


 [0.7027027  0.7027027  0.94594595 0.97222222 0.97222222 0.88888889
 0.86111111 0.74285714 0.82857143 0.91428571]

 0.8531510081510081

 0.10004154029821953


#### Leave-one-out CV

Another type of cross validation is leave one out cross validation. Out of the <b>_*n*_</b> samples, one of them is left out and the model is trained on other samples. When <b>_*k*_</b> in k-Fold validation is equal to the number of samples then K-Fold validation is same as leave one out cross validation

In [8]:
df1 = pd.read_csv('Social_Network_Ads.csv')
X1 = df1.iloc[:, 2:4]
y1 = df1.iloc[:, 4]

df1.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [9]:
#scale
from sklearn.preprocessing import StandardScaler
X_sca = StandardScaler()
X1 = X_sca.fit_transform(X1)

In [10]:
from __future__ import division
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

loo_cv = LeaveOneOut()
correct = 0
total = 0
for train_indices, test_indices in loo_cv.split(X1):
#     uncomment these lines to print splits
#     print("Train Indices: {}...".format(train_indices[:4]))
#     print("Test Indices: {}...".format(test_indices[:4]))
#     print("Training SVC model using this configuration")
    X1_train, X1_test, y1_train, y1_test = X1[train_indices], X1[test_indices], \
                                        y1[train_indices], y1[test_indices]
    clf = SVC(kernel='linear', random_state=42).fit(X1_train, y1_train)
    correct += accuracy_score(y1_test, clf.predict(X1_test))
    total += 1
print("Accuracy: {0:.2f}".format(correct/total))

Accuracy: 0.84


### Implementing Stratified k-fold

K-fold validation does not preserve the split of the output variable while splitting the data in k-folds. Imagine training a NBayes classifier using k-fold validation using 10 samples where 5 are positive and 5 are negative. Since k-Fold randomly selects the split imagine splitting it in an unfortunate way -- 1 split contains all positive samples and 1 contains all negative. NBayes classifier will calculate the prior probabilities and find it to be 100% i.e. the model will "think" the output is always positive (which is obviously wrong!). To tackle this scenario we use Stratified split, what it would essentially do is preserve the split in the original dataset in training set, that is, if the original dataset has 50% positive and 50% negative outputs then the training set will also have 50% positive and 50% negative outputs.

In [11]:
df2 = pd.read_csv('Social_Network_Ads.csv')
X2 = df2.iloc[:, 2:4]
y2 = df2.iloc[:, 4]

df2.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [12]:
#scale it
from sklearn.preprocessing import StandardScaler
X2_sca = StandardScaler()
X2 = X_sca.fit_transform(X2)

In [15]:
from __future__ import division 
from sklearn.model_selection import StratifiedKFold 
from sklearn.metrics import accuracy_score 
from sklearn.svm import SVC 

strat_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=0) 
correct = 0 
total = 0 

for train_indices, test_indices in strat_cv.split(X, y): # uncomment these lines to print splits  
    print("Train Indices: {}...".format(train_indices[:4])) 
    print("Test Indices: {}...".format(test_indices[:4])) 
    print("Training SVC model using this configuration") 
    X_train, X_test, y_train, y_test = X[train_indices], X[test_indices], y[train_indices], y[test_indices] 
    clf = SVC(kernel='linear', random_state=0).fit(X_train, y_train) 
    correct += accuracy_score(y_test, clf.predict(X_test)) 
    total += 1 
    print("Accuracy: {0:.2f}".format(correct/total))



Train Indices: [0 1 2 3]...
Test Indices: [22 23 60 75]...
Training SVC model using this configuration
Accuracy: 0.83
Train Indices: [0 1 2 3]...
Test Indices: [ 5  9 13 36]...
Training SVC model using this configuration
Accuracy: 0.85
Train Indices: [0 1 2 3]...
Test Indices: [ 8 17 25 28]...
Training SVC model using this configuration
Accuracy: 0.88
Train Indices: [0 1 2 5]...
Test Indices: [ 3  4 14 31]...
Training SVC model using this configuration
Accuracy: 0.88
Train Indices: [0 1 3 4]...
Test Indices: [ 2 11 15 18]...
Training SVC model using this configuration
Accuracy: 0.86
Train Indices: [1 2 3 4]...
Test Indices: [ 0 16 19 26]...
Training SVC model using this configuration
Accuracy: 0.84
Train Indices: [0 1 2 3]...
Test Indices: [ 6  7 12 20]...
Training SVC model using this configuration
Accuracy: 0.84
Train Indices: [0 2 3 4]...
Test Indices: [ 1 30 42 45]...
Training SVC model using this configuration
Accuracy: 0.84
Train Indices: [0 1 2 3]...
Test Indices: [39 43 46 90].

### To validate Time Series data

Time series data is data associated with a time frame, for instance stock prices. The motivation is to predict stock price for future given the data from previous data. If we were to use any splitting techniques from above we would end up predicting past from future (due to random nature from splitting) which shouldn't be permitted, we should always predict future from past. This can be achieved using <b> "TimeSeriesSplit" </b> function

In [16]:
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

X = np.random.rand(10, 2)
y = np.random.rand(10)
print(X)
print(y)

[[0.26834453 0.53584406]
 [0.13858101 0.68815886]
 [0.55775302 0.39856144]
 [0.52501334 0.8958961 ]
 [0.05706948 0.59143032]
 [0.05462498 0.41478816]
 [0.07377104 0.75373714]
 [0.1303596  0.24446038]
 [0.66597346 0.25103964]
 [0.27948098 0.32449053]]
[0.63643492 0.77520179 0.01045411 0.15540252 0.80098639 0.17076189
 0.12586894 0.8345535  0.6752844  0.60468472]


In [17]:
tss = TimeSeriesSplit(n_splits=7)

for train_indices, test_indices in tss.split(X):
    print("Train indices: {0} Test indices: {1}".format(train_indices, test_indices))

Train indices: [0 1 2] Test indices: [3]
Train indices: [0 1 2 3] Test indices: [4]
Train indices: [0 1 2 3 4] Test indices: [5]
Train indices: [0 1 2 3 4 5] Test indices: [6]
Train indices: [0 1 2 3 4 5 6] Test indices: [7]
Train indices: [0 1 2 3 4 5 6 7] Test indices: [8]
Train indices: [0 1 2 3 4 5 6 7 8] Test indices: [9]
