# Cross Validation
### Learn how to use k-fold cross validation to perform more rigorous testing.

##### Contents:
- Holdout Validation
- K-fold Cross Validation
    - add new df column with fold number
        - df.set_value(df.index[0:744], "fold", 1)
- first iteration with univariate knn model
- define function for training models
- K-Fold with Scikit-Learn
    - KFold()
    - cross_val_score()
- Exploring different K values
    - leave-one-out cross-validation (LOOCV)
- Bias-Variance Tradeoff
    - standard deviation of RMSE as proxy for model variance
    - average RMSE as proxy for model bias    

## 1: Introduction

In an earlier mission, we learned about train/test validation, a simple technique for testing a machine learning model's accuracy on new data that the model wasn't trained on. In this mission, we'll focus on more robust techniques.

To start, we'll focus on the **holdout validation** technique, which involves:

- splitting the full dataset into 2 partitions:
    - a training set
    - a test set
- training the model on the training set,
- using the trained model to predict labels on the test set,
- computing an error metric to understand the model's effectiveness,
- switch the training and test sets and repeat,
- average the errors.

In holdout validation, we usually use a 50/50 split instead of the 75/25 split from train/test validation. This way, we remove number of observations as a potential source of variation in our model performance.

<img src="img/holdout_validation.png">

Let's start by splitting the data set into 2 nearly equivalent halves.

#### Instructions:
- Use the numpy.random.permutation() function to shuffle the ordering of the rows in dc_listings.
- Select the first 1862 rows and assign to split_one.
- Select the remaining 1861 rows and assign to split_two.

In [1]:
import numpy as np
import pandas as pd

dc_listings = pd.read_csv("data/dc_airbnb.csv")
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

In [9]:
dc_listings = dc_listings.iloc[np.random.permutation(dc_listings.index)]
split_one = dc_listings.iloc[:1862]
split_two = dc_listings.iloc[1862:]

## 2: Holdout Validation

Now that we've split our data set into 2 dataframes, let's:

- train a k-nearest neighbors model on the first half,
- test this model on the second half,
- train a k-nearest neighbors model on the second half,
- test this model on the first half.

#### Instructions:
- Train a k-nearest neighbors model (using 5 neighbors) that:
    - Uses the accommodates column from train_one for training and
    - Tests it on test_one.
- Assign the resulting RMSE value to iteration_one_rmse.
- Train a k-nearest neighbors model (using 5 neighbors) that:
    - Uses the accommodates column from train_two for training and
    - Tests it on test_two.
- Assign the resulting RMSE value to iteration_two_rmse.
- Use numpy.mean() to calculate the average of the 2 RMSE values and assign to avg_rmse.

In [27]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one

knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(train_one[['accommodates']], train_one.price)
predictions = knn.predict(test_one[['accommodates']])
iteration_one_rmse = mean_squared_error(predictions, test_one.price) ** (1/2)

knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(train_two[['accommodates']], train_two.price)
predictions = knn.predict(test_two[['accommodates']])
iteration_two_rmse = mean_squared_error(predictions, test_two.price) ** (1/2)

avg_rmse = np.mean([iteration_one_rmse, iteration_two_rmse])
avg_rmse

131.5077027164597

## 3: K-Fold Cross Validation

If we average the two RMSE values from the last step, we get an RMSE value of approximately **128.96**. Holdout validation is actually a specific example of a larger class of validation techniques called **k-fold cross-validation**. While holdout validation is better than train/test validation because the model isn't repeatedly biased towards a specific subset of the data, both models that are trained only use half the available data. K-fold cross validation, on the other hand, takes advantage of a larger proportion of the data during training while still rotating through different subsets of the data to avoid the issues of train/test validation.

Here's the algorithm from k-fold cross validation:

- splitting the full dataset into k equal length partitions,
    - selecting k-1 partitions as the training set and
    - selecting the remaining partition as the test set
- training the model on the training set,
- using the trained model to predict labels on the test fold,
- computing the test fold's error metric,
- repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration,
- calculating the mean of the k error values.

Holdout validation is essentially a version of k-fold cross validation when k is equal to 2. Generally, 5 or 10 folds is used for k-fold cross-validation. Here's a diagram describing each iteration of 5-fold cross validation:

<img src="img/kfold_cross_validation.png">

As you increase the number the folds, the number of observations in each fold decreases and the variance of the fold-by-fold errors increases. Let's start by manually partitioning the data set into 5 folds. Instead of splitting into 5 dataframes, let's add a column that specifies which fold the row belongs to. This way, we can easily select

#### Instructions:
- Add a new column to dc_listings named fold that contains the fold number each row belongs to:
    - Fold 1 should have rows from index 0 to 744, including both of those rows.
    - Fold 2 should have rows from index 744 to 1488, including both of those rows.
    - Fold 3 should have rows from index 1488 to 2232, including both of those rows.
    - Fold 4 should have rows from index 2232 to 2976, including both of those rows.
    - Fold 5 should have rows from index 2976 to 3723, including both of these rows.
- Display the unique value counts for the fold column to confirm that each fold has roughly the same number of elements.

In [30]:
dc_listings.set_value(dc_listings.index[0:744], "fold", 1)
dc_listings.set_value(dc_listings.index[744:1488], "fold", 2)
dc_listings.set_value(dc_listings.index[1488:2232], "fold", 3)
dc_listings.set_value(dc_listings.index[2232:2976], "fold", 4)
dc_listings.set_value(dc_listings.index[2976:3723], "fold", 5)

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state,fold
2061,84%,82%,2,4,Private room,1.0,1.0,1.0,120.0,$40.00,,2,1125,2,38.895562,-76.993420,Washington,20002,DC,1.0
2168,91%,50%,1,2,Entire home/apt,1.0,1.5,1.0,135.0,,,1,1125,4,38.899199,-76.986642,Washington,20002,DC,1.0
2856,100%,100%,1,2,Entire home/apt,1.0,1.0,1.0,109.0,$25.00,,3,1125,25,38.927794,-77.037579,Washington,20009,DC,1.0
221,98%,52%,49,4,Entire home/apt,1.0,1.0,1.0,185.0,,,3,90,8,38.905627,-77.027356,Washington,20001,DC,1.0
2629,100%,87%,1,2,Private room,1.0,1.0,1.0,75.0,$20.00,,2,1125,20,38.918617,-77.002885,Washington,20002,DC,1.0
917,100%,100%,1,2,Entire home/apt,1.0,1.0,1.0,120.0,$20.00,,2,1125,2,38.911746,-77.059957,Washington,20007,DC,1.0
2153,100%,100%,1,4,Entire home/apt,1.0,1.0,2.0,85.0,$25.00,$150.00,2,15,71,38.896228,-76.973565,Washington,20002,DC,1.0
956,100%,92%,1,4,Entire home/apt,2.0,1.0,2.0,172.0,$30.00,,2,365,20,38.917344,-77.038756,Washington,20009,DC,1.0
189,82%,82%,2,2,Entire home/apt,1.0,1.0,1.0,250.0,$85.00,"$1,500.00",2,1125,17,38.900339,-77.016231,Washington,20001,DC,1.0
1236,100%,100%,1,2,Entire home/apt,1.0,1.0,1.0,215.0,,,3,29,3,38.907180,-77.036503,Washington,20036,DC,1.0


## 4: First Iteration

Let's start by performing the first iteration of k-fold cross validation on a simple, univariate model.

#### Instructions:
- Train a k-nearest neighbors model using the accommodates column as the sole feature from folds 2 to 5 as the training set.
- Use the model to make predictions on the test set (accommodates column from fold 1) and assign the predicted labels to labels.
- Calculate the RMSE value by comparing the price column with the predicted labels.
- Assign the RMSE value to iteration_one_rmse.

In [32]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Training
model = KNeighborsRegressor()
train_iteration_one = dc_listings[dc_listings["fold"] != 1].copy() # copys added to prevent warnings
test_iteration_one = dc_listings[dc_listings["fold"] == 1].copy()
model.fit(train_iteration_one[["accommodates"]], train_iteration_one["price"])

# Predicting
labels = model.predict(test_iteration_one[["accommodates"]])
test_iteration_one["predicted_price"] = labels
iteration_one_mse = mean_squared_error(test_iteration_one["price"], test_iteration_one["predicted_price"])
iteration_one_rmse = iteration_one_mse ** (1/2)

## 5: Function For Training Models

From the first iteration, we achieved an RMSE value of **105.06**. While this is one of the lowest RMSE values we achieved in the last few missions, let's calculate the RMSE values for the remaining iterations. To make the iteration process easier, let's wrap the code we wrote in the previous screen in a function.

#### Instructions:
- Write a function named train_and_validate that takes in a dataframe as the first parameter (df) and a list of fold values (1 to 5 in our case) as the second parameter (folds). This function should:
    - Train n models (where n is number of folds) and perform k-fold cross validation (using n folds). Use the default k value for the KNeighborsRegressor class.
    - Return a list of RMSE values, where the first element is the RMSE for when fold 1 was the test set, the second element is the RMSE for when fold 2 was the test set, and so on.
- Use the train_and_validate function to return the list of RMSE values for the dc_listings Dataframe and assign to rmses.
- Calculate the mean of these values and assign to avg_rmse.
- Display both rmses and avg_rmse.

In [34]:
# Use np.mean to calculate the mean.
import numpy as np
fold_ids = [1,2,3,4,5]

def train_and_validate(df, folds):
    fold_rmses = []
    for fold in folds:
        # Train
        model = KNeighborsRegressor()
        train = df[df["fold"] != fold].copy()
        test = df[df["fold"] == fold].copy()
        model.fit(train[["accommodates"]], train["price"])
        # Predict
        labels = model.predict(test[["accommodates"]])
        test["predicted_price"] = labels
        mse = mean_squared_error(test["price"], test["predicted_price"])
        rmse = mse**(1/2)
        fold_rmses.append(rmse)
    return(fold_rmses)

rmses = train_and_validate(dc_listings, fold_ids)
print(rmses)
avg_rmse = np.mean(rmses)
print(avg_rmse)

[114.57347222474498, 141.79793229152017, 108.62683533473076, 164.02243191723201, 115.18211504034528]
128.840557362


## 6: Performing K-Fold Cross Validation Using Scikit-Learn

While the average RMSE value was approximately 136.78, the RMSE values ranged from 105.06 all the way to 176.97. This large amount of variability between the RMSE values means that we're either using a poor model or a poor evaluation criteria (or a bit of both!). By implementing your own k-fold cross-validation function, you hopefully acquired a good understanding of the inner workings of the technique. The function we wrote, however, has many limitations. If we want to now change the number of folds we want to use, we need to make the function more general so it can also handle randomizing the ordering of the rows in the dataframe and splitting into folds.

In machine learning, we're interested in building a good model and accurately understand how well it will perform. To build a better k-nearest neighbors model, we can change the features it uses or tweak the number of neighbors (a hyperparameter). To accurately understand a model's performance, we can perform k-fold cross validation and select the proper number of folds. We've learned how scikit-learn makes it easy for us to quickly experiment with these different knobs when it comes to building a better model. Let's now dive into how we can use scikit-learn to handle cross-validation as well.

First, we instantiate an instance of the [KFold class](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)

    kf = KFold(n, n_folds, shuffle=False, random_state=None)

where:

- n is the number of observations in the dataset,
- n_folds is the number of folds you want to use,
- shuffle is used to toggle shuffling of the ordering of the observations in the dataset,
- random_state is used to specify a seed value if shuffle is set to True.

You'll notice here that only the first parameter depends on the data set at all. This is because the KFold class returns an iterator object but won't actually handle the training and testing of models. If we're primarily only interested in error metrics for each fold, we can use the `KFold` class in conjunction with the [cross_val_score function](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html), which will handle training and testing of the models in each fold.

Here are the relevant parameters for the `cross_val_score` function:

    cross_val_score(estimator, X, Y, scoring=None, cv=None)
    
where:

- estimator is a sklearn model that implements the fit method (e.g. instance of KNeighborsRegressor),
- X is the list or 2D array containing the features you want to train on,
- y is a list containing the values you want to predict (target column),
- scoring is a string describing the scoring criteria (list of accepted values here).
- cv describes the number of folds. Here are some examples of accepted values:
    - an instance of the KFold class,
    - an integer representing the number of folds.

Depending on the scoring criteria you specify, either a single total value is returned one value for each fold. Here's the general workflow for performing k-fold cross-validation using the classes we just described:

- instantiate the scikit-learn model class you want to fit,
- instantiate the KFold class and using the parameters to specify the k-fold cross-validation attributes you want,
- use the cross_val_score function to return the scoring metric you're interested in.

#### Instructions:
- Create a new instance of the KFold class with the following properties:
    - n set to length of dc_listings,
    - 5 folds,
    - shuffle set to True,
    - random seed set to 1 (so we can answer check using the same seed),
    - assigned to the variable kf.
- Create a new instance of the KNeighborsRegressor class and assign to knn.
- Use the cross_val_score function to perform k-fold cross-validation:
    - using the KNeighborssRegressor instance knn,
    - using the accommodates column for training,
    - using the price column as the target column,
    - returning an array of MSE values (one value for each fold).
- Assign the resulting list of MSE values to mses, convert to RMSE values, and assign the average RMSE value to avg_rmse. 

In [77]:
# https://stackoverflow.com/questions/21443865/scikit-learn-cross-validation-negative-values-with-mean-squared-error
# The actual MSE is simply the positive version of the number you're getting.
# The unified scoring API always maximizes the score, 
# so scores which need to be minimized are negated in order for the unified scoring API to work correctly. 
# The score that is returned is therefore negated when it is a score that should be minimized 
# and left positive if it is a score that should be maximized.

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kf = KFold(n_splits=5, shuffle=True, random_state=1)
knn = KNeighborsRegressor()

mses = cross_val_score(estimator=knn, X=dc_listings[['accommodates']], y=dc_listings.price, cv=kf, scoring="neg_mean_squared_error")
mses = np.array([-x for x in mses])
rmses = mses ** (1/2)
avg_rmse = np.mean(rmses)
avg_rmse

133.13901478226748

## 7: Exploring Different K Values

Choosing the right k value when performing k-fold cross validation is more of an art and less of a science. As we discussed earlier in the mission, a k value of 2 is really just holdout validation. On the other end, setting k equal to n (the number of observations in the data set) is known as **leave-one-out cross validation**, or **LOOCV** for short. Through lots of trial and error, data scientists have converged on 10 as the standard k value.

In the following code block, we display the results of varying k from 3 to 23. For each k value, we calculate and display the average RMSE value across all of the folds and the standard deviation of the RMSE values. Across the many different k values, it seems like the average RMSE value is around 128. You'll notice that the standard deviation of the RMSE increases from approximately 1.1 to 37.3 as we increase the number the folds.

In [82]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]

for fold in num_folds:
    kf = KFold(fold, shuffle=True, random_state=1)
    model = KNeighborsRegressor()
    mses = cross_val_score(model, dc_listings[["accommodates"]], dc_listings["price"], scoring="neg_mean_squared_error", cv=kf)
    rmses = [np.sqrt(np.absolute(mse)) for mse in mses]
    avg_rmse = np.mean(rmses)
    std_rmse = np.std(rmses)
    print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))

3 folds:  avg RMSE:  138.85777313 std RMSE:  20.0193887134
5 folds:  avg RMSE:  133.139014782 std RMSE:  13.3991210902
7 folds:  avg RMSE:  127.193884418 std RMSE:  20.6216196916
9 folds:  avg RMSE:  139.577510538 std RMSE:  20.740269619
10 folds:  avg RMSE:  128.590398382 std RMSE:  19.1916176131
11 folds:  avg RMSE:  132.666928406 std RMSE:  27.2292312882
13 folds:  avg RMSE:  130.238261586 std RMSE:  28.1478124572
15 folds:  avg RMSE:  126.968042879 std RMSE:  32.1766475996
17 folds:  avg RMSE:  127.734183511 std RMSE:  34.3787457388
19 folds:  avg RMSE:  132.09637121 std RMSE:  30.7093499837
21 folds:  avg RMSE:  123.765258223 std RMSE:  38.0429054656
23 folds:  avg RMSE:  127.969795335 std RMSE:  35.4328513116


## 8: Bias-Variance Tradeoff

So far, we've been working under the assumption that a lower RMSE always means that a model is more accurate. This isn't the complete picture, unfortunately. A model has **two sources of error**, **bias** and **variance**.

**Bias describes error that results in bad assumptions about the learning algorithm**. For example, assuming that only one feature, like a car's weight, relates to a car's fuel efficiency will lead you to fit a simple, univariate regression model that will result in high bias. The error rate will be high since a car's fuel efficiency is affected by many other factors besides just its weight.

**Variance describes error that occurs because of the variability of a model's predicted values**. If we were given a dataset with 1000 features on each car and used every single feature to train an incredibly complicated multivariate regression model, we will have low bias but high variance. In an ideal world, we want low bias and low variance but in reality, there's always a tradeoff.

**The standard deviation of the RMSE values can be a proxy for a model's** **variance** while the **average RMSE is a proxy for a model's** **bias**. Bias and variance are the 2 observable sources of error in a model that we can indirectly control.

<img src="img/bias_variance.png">

While k-nearest negihbors can make predictions, it isn't a mathematical model. A mathematical model is usually an equation that can exist without the original data, which isn't true with k-nearest neighbors. In the next two courses, we'll learn about a mathematical model called linear regression. We'll explore the bias-variance tradeoff in greater depth in these next 2 courses because of its importance when working with mathematical models in particular.

## 9: Next Steps

In this mission, we explored more robust cross validation techniques like holdout validation and k-fold cross-validation. Next in this course is a guided project where you can practice what you've learned in this course on a different data set.