# Cross validation

In this script, I'm going to try to put different ways to perform Cross-Validation.

## Using numpy APIs

This script was taken from
http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html

In [1]:
from sklearn import datasets, svm
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
svc = svm.SVC(C=1, kernel='linear')
svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])

0.98

If we assume that the features are not time dependent, then we can proceed "manualy" as follow, using only numpy to create the K-fold sets:

In [2]:
import numpy as np

In [8]:
perm = np.random.permutation(X_digits.shape[0])
X_shuff = X_digits[perm]
y_shuff = y_digits[perm]

In [11]:
X_folds = np.array_split(X_shuff, 3)
y_folds = np.array_split(y_shuff, 3)

In [12]:
for arr in X_folds:
    print(arr.shape)

(599, 64)
(599, 64)
(599, 64)


In [13]:
scores = []
for k in range(len(X_folds)):
    # We use 'list' to copy, in order to 'pop' later on
    X_train = list(X_folds)
    X_test  = X_train.pop(k)
    X_train = np.concatenate(X_train)
    y_train = list(y_folds)
    y_test  = y_train.pop(k)
    y_train = np.concatenate(y_train)
    scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores)

[0.9849749582637729, 0.9749582637729549, 0.9816360601001669]


## Using Scikit-learn APIs

If we don't have any time dependance, we can alternatively use the tools from Scikit-Learn.

First on some toy data:

In [16]:
from sklearn.model_selection import KFold, cross_val_score
X = ["a", "a", "b", "c", "c", "c"]
k_fold = KFold(n_splits=3)
for train_indices, test_indices in k_fold.split(X):
    print('Train: {} | test: {}'.format(train_indices, test_indices))

Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]


Then on the digit data, with a one liner (because we use the default metric of the estimator):

In [17]:
from sklearn.model_selection import KFold, cross_val_score
k_fold = KFold(n_splits=3)
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
         for train, test in k_fold.split(X_digits)]

[0.9348914858096828, 0.9565943238731218, 0.9398998330550918]

Or alternatively,using:

In [24]:
cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)

array([ 0.93489149,  0.95659432,  0.93989983])

## Time series

Here I display a simple way of creating a training set and validation set out of a data. It assumes that we have the whole data stored in a pandas data frame. For the whole model and tutorial, see this link: https://machinelearningmastery.com/persistence-time-series-forecasting-with-python/

In [None]:
# split into train and test sets
X = dataframe.values
train_size = int(len(X) * 0.66)
train, test = X[1:train_size], X[train_size:]
train_X, train_y = train[:,0], train[:,1]
test_X, test_y = test[:,0], test[:,1]