# Cross Validation

Date May 12, 2018 

Cross Validation mechanics - a statistical method of evaluating and comparing regression and other learning algorithms by dividing data into complementary segments: at least one used to learn or train a model and another used to validate the model. 

In typical cross-validation, the training and validation sets must cross over in successive rounds such that each data point has a chance of being validated against.

Cross-validation can be applied in three contexts: performance estimation, model selection, and tuning learning model parameters - a type of monitoring technique then, of machine learning models.

Pointers and Tips
 - Train the model for each split
 - Average the scores
 - Train model on all the data
 - Establishes a lower bound score
 - Important to have a baseline

Read more --> http://leitang.net/papers/ency-cross-validation.pdf 

Nearest Neighbor --> http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbor-algorithms

![img92.gif](attachment:img92.gif)



In [4]:
import numpy as np
import pandas as pd
dc_listings = pd.read_csv("dc_airbnb.csv")

# Holdout validation or 2-Fold involves:

- splitting the full dataset into 2 partitions or segments: a training set and a test set
- training the model on the training set,
- using the trained model to predict labels on the test set,
- computing an error metric to understand the model's effectiveness,
- switch the training and test sets and repeat,
- average the errors.

In [5]:
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

np.random.seed(1)

#shuffle
s_index = np.random.permutation(dc_listings.shape[0])
dc_listings = dc_listings.loc[s_index]

split_one = dc_listings.iloc[0:1862]
split_two = dc_listings.iloc[1862:]

In [6]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

#cross validation involves iterating while switching train and test splits
train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one

# for documentation on algorithm = auto parameter of KNeighborsRegressor
# see http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbor-algorithms
# first iteration
train_columns = ['accommodates']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto', metric='euclidean')
knn.fit(train_one[train_columns], train_one['price'])
prediction1 = knn.predict(test_one[train_columns])

# score performance using MSE and RMSE
# RMSE is aka residual sum of squares
y_true = test_one['price'].as_matrix()
y_pred = prediction1
iteration_one_mse = mean_squared_error(y_true,y_pred)
iteration_one_rmse = iteration_one_mse**0.5    

# second iteration
train_columns = ['accommodates']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto', metric='euclidean')
knn.fit(train_two[train_columns], train_two['price'])
prediction2 = knn.predict(test_two[train_columns])

# score performance using MSE and RMSE
# RMSE is aka residual sum of squares
y_true = test_two['price'].as_matrix()
y_pred = prediction2
iteration_two_mse = mean_squared_error(y_true,y_pred)
iteration_two_rmse = iteration_two_mse**0.5   

avg_rmse = np.mean([iteration_one_rmse,iteration_two_rmse])
print(avg_rmse)

128.96254732948216


# K-fold cross validation 

K-fold cross validation, on the other hand, takes advantage of a larger proportion of the data during training while still rotating through different segments or subsets of the data to avoid the issues of train/test validation.

The algorithm for k-fold cross validation involves:

- splitting the full dataset into k equal length partitions.
- selecting k-1 partitions as the training set and
- selecting the remaining partition as the test set
- training the model on the training set.
- using the trained model to predict labels on the test fold.
- computing the test fold's error metric.
- repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration.
- calculating the mean of the k error values.

In [7]:
# Generally 5 folds are used
dc_listings.loc[dc_listings.index[0:745], "fold"] = 1
dc_listings.loc[dc_listings.index[745:1490], "fold"] = 2
dc_listings.loc[dc_listings.index[1490:2234], "fold"] = 3
dc_listings.loc[dc_listings.index[2234:2978], "fold"] = 4
dc_listings.loc[dc_listings.index[2978:3723], "fold"] = 5

In [8]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

#cross validation involves iterating while switching train and test splits
train_split = dc_listings[dc_listings['fold'].isin([2,3,4,5])]
test_split = dc_listings[dc_listings['fold'].isin([1])]

# first iteration, default neighbors = 5 and is optional
# train on the hypothesis that 'accommodates' influences the price 
train_columns = ['accommodates']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto', metric='euclidean')
knn.fit(train_split[train_columns], train_split['price'])
prediction = knn.predict(test_split[train_columns])

# score performance using MSE and RMSE
# RMSE is aka residual sum of squares
y_true = test_split['price'].as_matrix()
y_pred = prediction
iteration_one_mse = mean_squared_error(y_true,y_pred)
iteration_one_rmse = iteration_one_mse**0.5  
print(iteration_one_rmse)

107.04609155929425


In [9]:
# Try wrapping the training model in a function
# Use np.mean to calculate the mean.
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

fold_ids = [1,2,3,4,5]

def train_and_validate(df,folds):
    
    thermses = []
    for f in folds:
        train_split = df[df['fold'] != f]
        test_split = df[df['fold'] == f]
    #train the split
        train_columns = ['accommodates']
        knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto', metric='euclidean')
        knn.fit(train_split[train_columns], train_split['price'])
    
        prediction = knn.predict(test_split[train_columns])

        # performance using MSE and RMSE
# RMSE is aka residual sum of squares
        y_true = test_split['price'].as_matrix()
        y_pred = prediction
        iteration_one_mse = mean_squared_error(y_true,y_pred)
        iteration_one_rmse = iteration_one_mse**0.5
        thermses.append(iteration_one_rmse)
    return(thermses)

rmses = train_and_validate (dc_listings,fold_ids)
avg_rmse = np.mean(rmses)
print(rmses)
print(avg_rmse)
print(np.std(rmses))

[107.04609155929425, 136.62225078440179, 153.0273362676136, 107.39207160219395, 146.9242838376558]
130.20240681023188
19.4850676211877


# Using KFold and Cross_Val_Score

The function written as is, however, has many limitations. If we want to now change the number of folds we want to use, we need to make the function more general so it can also handle randomizing the ordering of the rows in the dataframe and splitting into folds.

In machine learning, we're interested in building a good model and accurately understanding how well it will perform. To build a better k-nearest neighbors model, we can change the features it uses or tweak the number of neighbors (a hyperparameter). To accurately understand a model's performance, we can perform k-fold cross validation and select the proper number of folds. The scikit-learn library makes it easy for us to quickly experiment with these different knobs when it comes to building a better model. 

The KFold class returns an iterator object which we use in conjunction with the cross_val_score() function, also from sklearn.model_selection. Together, these 2 functions allow us to compactly train and test using k-fold cross validation

In [10]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.neighbors import KNeighborsRegressor
import numpy as np


knn = KNeighborsRegressor()
kf = KFold(n_splits=5, shuffle=True, random_state=1)
X1 = dc_listings[['accommodates']]
y1 = dc_listings[['price']]
mses = cross_val_score(estimator=knn, X=X1, y=y1, cv=kf, scoring='neg_mean_squared_error')
mses = np.abs(mses)
avg_rmse = np.mean(np.sqrt(mses))
print(avg_rmse)

# compare to above rmse - in a way we are building some type of baseline score for the model

134.66322485283825


# Choosing the right k value 

Choosing the right k for k-fold cross validation is more of an art and less of a science. A k value of 2 is really just holdout validation. On the other end, setting k equal to n (the number of observations in the data set) is known as leave-one-out cross validation, or LOOCV for short. Through lots of trial and error, data scientists have converged on 10 as the standard k value.

In [11]:
from sklearn.model_selection import cross_val_score, KFold

num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]

for fold in num_folds:
    kf = KFold(fold, shuffle=True, random_state=1)
    model = KNeighborsRegressor()
    mses = cross_val_score(model, dc_listings[["accommodates"]], dc_listings["price"], scoring="neg_mean_squared_error", cv=kf)
    rmses = np.sqrt(np.absolute(mses))
    avg_rmse = np.mean(rmses)
    std_rmse = np.std(rmses)
    print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))

3 folds:  avg RMSE:  126.19223474162338 std RMSE:  1.1069911739514842
5 folds:  avg RMSE:  134.66322485283825 std RMSE:  17.06396645051883
7 folds:  avg RMSE:  128.76102940687468 std RMSE:  15.113036472055438
9 folds:  avg RMSE:  130.97403954965455 std RMSE:  17.552727811846307
10 folds:  avg RMSE:  129.38243003886737 std RMSE:  22.858028610253182
11 folds:  avg RMSE:  128.5209174821928 std RMSE:  21.039336959245166
13 folds:  avg RMSE:  128.66536927916962 std RMSE:  29.931738536000207
15 folds:  avg RMSE:  127.74903938013544 std RMSE:  30.22520525014004
17 folds:  avg RMSE:  125.08689980063376 std RMSE:  34.7037432777398
19 folds:  avg RMSE:  123.24952437277325 std RMSE:  37.93258646028018
21 folds:  avg RMSE:  129.74292153412665 std RMSE:  36.31090634328946
23 folds:  avg RMSE:  129.91038839859866 std RMSE:  37.34861558876879


# Bias and Variance

Parting thoughts: a lower RMSE does NOT always mean that a model is more accurate. A model has two other sources of error, bias and variance.

Bias describes error that results in bad assumptions about the learning algorithm. For example, assuming that only one feature, like a car's weight, relates to a car's fuel efficiency will lead you to fit a simple, univariate regression model that will result in high bias. The error rate will be high since a car's fuel efficiency is affected by many other factors besides just its weight.

Variance describes error that occurs because of the variability of a model's predicted values. If we were given a dataset with 1000 features on each car and used every single feature to train an incredibly complicated multivariate regression model, we will have low bias but high variance. In an ideal world, we want low bias and low variance but in reality, there's always a tradeoff.

The standard deviation of the RMSE values can be a proxy for a model's variance while the average RMSE is a proxy for a model's bias. Bias and variance are the 2 observable sources of error in a model that we can indirectly control.