# Notebook Instructions
<i>You can run the notebook document sequentially (one cell a time) by pressing <b> shift + enter</b>. While a cell is running, a [*] will display on the left. When it has been run, a number will display indicating the order in which it was run in the notebook [8].</i>

<i>Enter edit mode by pressing <b>`Enter`</b> or using the mouse to click on a cell's editor area. Edit mode is indicated by a green cell border and a prompt showing in the editor area.</i>

# Cross validation

Cross validation technique is used to estimate the performance of the model on a multiple train-validation set split. In this notebook, we implement a k-fold cross validation method to evaluate the random forest model.

## Create a random forest model - you already know this!

In [1]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd
data = pd.read_csv('../data_modules/AAPL_2008_2018.csv')

# Returns
data['ret1'] = data.Adj_Close.pct_change()
data['ret5'] = data['ret1'].rolling(5).sum()
data['ret10'] = data['ret1'].rolling(10).sum()
data['ret20'] = data['ret1'].rolling(20).sum()
data['ret40'] = data['ret1'].rolling(40).sum()

# Standard Deviation
data['std5'] = data['ret1'].rolling(5).std()
data['std10'] = data['ret1'].rolling(10).std()
data['std20'] = data['ret1'].rolling(20).std()
data['std40'] = data['ret1'].rolling(40).std()

# Future returns
data['retFut1'] = data.ret1.shift(-1)

# Define predictor variables (X) and a target variable (y)
data = data.dropna()
predictor_list = ['ret1', 'ret5', 'ret10', 'ret20',
                  'ret40', 'std5', 'std10', 'std20', 'std40']
X = data[predictor_list]
y = np.where(data.retFut1 > 0, 1, -1)

seed = 42
random_forest = RandomForestClassifier(
    n_estimators=20,
    max_features=0.6,
    min_samples_leaf=400,
    random_state=seed,
    bootstrap=True
)

## K-fold cross-validation

The KFold function from sklearn.model_selection package uses the train/test indices to split the data into k consecutive sets of train/test.

The parameters are:
1. n_splits= number of folds. It must be at least 2
2. shuffle= True/False, if the data should be shuffled before splitting into batches

And, we have used a cross_val_score function from the model selection module to do cross-validation. The function cross_val_score takes as input

1. estimator model
2. predictor variables
3. target variable
4. number of folds (cv).
The function returns an array of scores of the estimator for each run of the cross-validation. You can use the help function to see the details of the cross_val_score method.

In [3]:
from sklearn.model_selection import KFold

# Split dataset into k consecutive folds
kf = KFold(n_splits=5, shuffle=False)
kf.split(X)

<generator object _BaseKFold.split at 0x287dec160>

In [4]:
from sklearn.model_selection import cross_val_score

# Returns an array of scores of the estimator
scores = cross_val_score(random_forest, X, y, cv=kf.split(X))

# Print the scores
print(scores)

[0.53137255 0.56862745 0.51372549 0.51866405 0.52455796]


After running cross validation we end up with 5 (number of folds) performance scores that is summarized using a mean and a standard deviation below.

In [8]:
"Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0)

'Accuracy: 53.139% (1.953%)'

The above cross-validation accuracy scores and standard deviation provides a more reliable measure to evaluate the performance of the model because the model is trained and evaluated on different data. Here, we have used the k-fold method to evaluate the random forest classifier model.
<BR>