In last project, we learned about train/test validation, a simple technique for testing a machine learning model's accuracy on new data that the model wasn't trained on. In this project, we'll focus on more robust techniques.

We'll focus on the holdout validation technique, which involves:

* splitting the full dataset into 2 partitions:
* a training set
* a test set
* training the model on the training set,
* using the trained model to predict labels on the test set,
* computing an error metric to understand the model's effectiveness,
* switch the training and test sets and repeat,
* average the errors.

In holdout validation, we usually use a 50/50 split instead of the 75/25 split from train/test validation. This way, we remove the number of observations as a potential source of variation in our model performance.

![image.png](attachment:image.png)

Let's start by splitting the data set into 2 nearly equivalent halves.

When splitting the data set, we shouldnt forget to set a copy of it using .copy() to ensure we don't get any unexpected results later on. If we run the code locally in Jupyter Notebook or Jupyter Lab without .copy(), we'll notice what is known as a [SettingWithCopy Warning](https://www.dataquest.io/blog/settingwithcopywarning/). This won't prevent our code from running properly, but it's letting us know that whatever operation we're doing is trying to be set on a copy of a slice from a dataframe. To make sure we don't see this warning, make sure to include .copy() whenever we perform operations on a dataframe.

In [1]:
import numpy as np
import pandas as pd

dc_listings = pd.read_csv("dc_airbnb.csv")

In [2]:
dc_listings["price"] = dc_listings["price"].str.replace("$", "").str.replace(",", "").astype(float)

In [3]:
np.random.seed(1)
shuffled_index = np.random.permutation(dc_listings.index)
shuffled_index

array([ 574, 1593, 3091, ..., 1096,  235, 1061], dtype=int64)

In [4]:
dc_listings = dc_listings.reindex(shuffled_index)

In [5]:
dc_listings.shape[0]

3723

In [6]:
split_one = dc_listings.iloc[:1862].copy()
split_two = dc_listings.iloc[1862:].copy()

Now that we've split our data set into 2 dataframes, let's:

* train a k-nearest neighbors model on the first half,
* test this model on the second half,
* train a k-nearest neighbors model on the second half,
* test this model on the first half.

In [7]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# k-nearest neighbors model using the default algorithm (auto) and the default number of neighbors (5)

# First half
feature = ["accommodates"]
model = KNeighborsRegressor()
model.fit(split_one[feature], split_one["price"])
prediction = model.predict(split_two[feature])
mse1 = mean_squared_error(split_two["price"],prediction)
rmse1 = np.sqrt(mse1)
rmse1

131.70294747240013

In [8]:
# Second half

feature = ["accommodates"]
model = KNeighborsRegressor()
model.fit(split_two[feature], split_two["price"])
prediction = model.predict(split_one[feature])
mse2 = mean_squared_error(split_one["price"],prediction)
rmse2 = np.sqrt(mse2)
rmse2

126.22214718656423

In [9]:
# average of the 2 RMSE values

avg_rmse = np.mean([rmse1, rmse2])
avg_rmse

128.96254732948216

If we average the two RMSE values, we get an RMSE value of approximately **128.96**. Holdout validation is actually a specific example of a larger class of validation techniques called **k-fold cross-validation**.

While holdout validation is better than train/test validation because the model isn't repeatedly biased towards a specific subset of the data, both models that are trained only use half the available data. K-fold cross validation, on the other hand, takes advantage of a larger proportion of the data during training while still rotating through different subsets of the data to avoid the issues of train/test validation

Here's the algorithm from k-fold cross validation:

* splitting the full dataset into k equal length partitions.
* selecting k-1 partitions as the training set and
* selecting the remaining partition as the test set
* training the model on the training set.
* using the trained model to predict labels on the test fold.
* computing the test fold's error metric.
* repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration.
* calculating the mean of the k error values.

Holdout validation is essentially a version of k-fold cross validation when k is equal to 2. Generally, 5 or 10 folds is used for k-fold cross-validation.![image.png](attachment:image.png)

As we increase the number the folds, the number of observations in each fold decreases and the variance of the fold-by-fold errors increases.

In [33]:
dc_listings.loc[dc_listings.index[0:745],"fold"] = 1.0 # manually partitioning the data set into 5 folds.
dc_listings.loc[dc_listings.index[745:1490],"fold"] = 2
dc_listings.loc[dc_listings.index[1490:2234],"fold"] = 3
dc_listings.loc[dc_listings.index[2234:2978],"fold"] = 4
dc_listings.loc[dc_listings.index[2978:],"fold"] = 5

In [34]:
dc_listings["fold"].value_counts()

5.0    745
2.0    745
1.0    745
4.0    744
3.0    744
Name: fold, dtype: int64

In [35]:
print("\n Num of missing values: ", dc_listings["fold"].isnull().sum())


 Num of missing values:  0


In [36]:
# Performing the first iteration of k-fold cross validation on a simple, univariate model.


train_iteration_one = dc_listings[dc_listings["fold"] != 1]
test_iteration_one = dc_listings[dc_listings["fold"] == 1].copy()
model = KNeighborsRegressor()
model.fit(train_iteration_one[["accommodates"]],
          train_iteration_one["price"])

# Predicting
labels = model.predict(test_iteration_one[["accommodates"]])
iteration_one_rmse = np.sqrt(mean_squared_error(test_iteration_one["price"], labels))
iteration_one_rmse

107.04609155929425

From the first iteration, we achieved an RMSE value of roughly **107.04**. Let's calculate the RMSE values for the remaining iterations. To make the iteration process easier, let's wrap the code we wrote in a function.

In [37]:
fold_ids = [1,2,3,4,5]
def train_and_validate(df, folds): # folds would be a list 
    rmses = []
    for i in folds:
        train = df[df["fold"] == i]
        test = df[df["fold"]!=i]
        model = KNeighborsRegressor()
        model.fit(train[["accommodates"]], train["price"])
        prediction = model.predict(test[["accommodates"]])
        mse = mean_squared_error(test["price"], prediction)
        rmse = np.sqrt(mse)
        rmses.append(rmse)
    return rmses

In [38]:
def train_and_validate(df, folds):
    rmses = []
    for i,fold in enumerate(folds):
        train = df[df["fold"] != fold].copy()
        test = df[df["fold"] == fold].copy()
        knn = KNeighborsRegressor()
        knn.fit(train[["accommodates"]], train[["price"]])
        prediction = knn.predict(test[["accommodates"]])
        test["predicted_price"] = prediction
        rmse = np.sqrt(mean_squared_error(test[["price"]], test[["predicted_price"]]))
        rmses.append(rmse)
    return rmses

rmses = train_and_validate(dc_listings, fold_ids) 
avg_rmse = np.mean(rmses)
print(rmses, avg_rmse)

[107.04609155929425, 136.62225078440179, 153.0273362676136, 107.39207160219395, 146.9242838376558] 130.20240681023188


While the average RMSE value was approximately 130, the RMSE values ranged from 107 to 146. This large amount of variability between the RMSE values means that we're either using a poor model or a poor evaluation criteria (or a bit of both!). 

The function we wrote, however, has many limitations. If we want to now change the number of folds we want to use, we need to make the function more general so it can also handle randomizing the ordering of the rows in the dataframe and splitting into folds.

In machine learning, we're interested in building a good model and accurately understanding how well it will perform. To build a better k-nearest neighbors model, we can change the features it uses or tweak the number of neighbors (a hyperparameter). To accurately understand a model's performance, we can perform k-fold cross validation and select the proper number of folds.

We've learned how scikit-learn makes it easy for us to quickly experiment with these different knobs when it comes to building a better model. Let's now dive into how we can use scikit-learn to handle cross-validation as well.

First, we instantiate an instance of the [KFold class](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold) from sklearn.model_selection:

from sklearn.model_selection import KFold

kf = KFold(n_splits, shuffle=False, random_state=None)

where:
* n_splits is the number of folds we want to use,
* shuffle is used to toggle shuffling of the ordering of the observations in the dataset,
* random_state is used to specify the random seed value if shuffle is set to True.

we'll notice here that no parameters depend on the data set at all. This is because the KFold class returns an iterator object which we use in conjunction with the cross_val_score() function, also from sklearn.model_selection. Together, these 2 functions allow us to compactly train and test using k-fold cross validation:

Here are the relevant parameters for the cross_val_score function:

from sklearn.model_selection import cross_val_score

cross_val_score(estimator, X, Y, scoring=None, cv=None)

where:
* estimator is a sklearn model that implements the fit method (e.g. instance of KNeighborsRegressor),
* X is the list or 2D array containing the features you want to train on,
* y is a list containing the values you want to predict (target column),
* scoring is a string describing the scoring criteria (list of accepted values [here](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)).
* cv describes the number of folds. Here are some examples of accepted values:
 * an instance of the KFold class,
 * an integer representing the number of folds.

Depending on the scoring criteria we specify, a single total value is returned for each fold. Here's the general workflow for performing k-fold cross-validation using the classes we just described:

* instantiate the scikit-learn model class we want to fit,
* instantiate the KFold class and using the parameters to specify the k-fold cross-validation attributes we want,
* use the cross_val_score() function to return the scoring metric we're interested in.

In [42]:
from sklearn.model_selection import KFold, cross_val_score

kf = KFold(n_splits=5,shuffle=True, random_state=1) # Instance of the KFold class
knn = KNeighborsRegressor() #  instance of the KNeighborsRegressor

In [43]:
#  cross_val_score() function to perform k-fold cross-validation

mses = cross_val_score(knn, dc_listings[["accommodates"]], dc_listings["price"],
                       scoring = "neg_mean_squared_error", cv = kf) # scroing represent metrics.mean_squared_error

mses

array([-18950.25567785, -22382.62416107, -17724.9870604 , -10639.59569892,
       -22429.3527957 ])

In [44]:
rmses = np.sqrt(np.abs(mses))
rmses

array([137.65992764, 149.60823561, 133.13522096, 103.14841588,
       149.76432418])

Choosing the right k value when performing k-fold cross validation is more of an art and less of a science. As we discussed earlier in this project, a k value of 2 is really just **holdout validation**. On the other end, setting k equal to n (the number of observations in the data set) is known as **leave-one-out cross validation**, or LOOCV for short. Through lots of trial and error, data scientists have converged on 10 as the standard k value.

In [47]:
from sklearn.model_selection import KFold, cross_val_score

num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]

for fold in num_folds:
    kf = KFold(n_splits=fold, shuffle=True, random_state=1)
    knn = KNeighborsRegressor()
    mses = cross_val_score(knn, dc_listings[["accommodates"]], dc_listings["price"], scoring = "neg_mean_squared_error", cv = kf)
    rmses = np.sqrt(np.abs(mses))
    avg_rmse = np.mean(rmses)
    std_rmse = np.std(rmses)
    print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))

3 folds:  avg RMSE:  126.19223474162338 std RMSE:  1.1069911739514842
5 folds:  avg RMSE:  134.66322485283825 std RMSE:  17.06396645051883
7 folds:  avg RMSE:  128.76102940687468 std RMSE:  15.113036472055438
9 folds:  avg RMSE:  130.97403954965455 std RMSE:  17.552727811846307
10 folds:  avg RMSE:  129.38243003886737 std RMSE:  22.858028610253182
11 folds:  avg RMSE:  128.5209174821928 std RMSE:  21.039336959245166
13 folds:  avg RMSE:  128.66536927916962 std RMSE:  29.931738536000207
15 folds:  avg RMSE:  127.74903938013544 std RMSE:  30.22520525014004
17 folds:  avg RMSE:  125.08689980063376 std RMSE:  34.7037432777398
19 folds:  avg RMSE:  123.24952437277325 std RMSE:  37.93258646028018
21 folds:  avg RMSE:  129.74292153412665 std RMSE:  36.31090634328946
23 folds:  avg RMSE:  129.91038839859866 std RMSE:  37.34861558876879


In the above code block, we display the results of varying k from 3 to 23. For each k value, we calculate and display the average RMSE value across all of the folds and the standard deviation of the RMSE values. Across the many different k values, it seems like the average RMSE value is around 129. We'll notice that the standard deviation of the RMSE increases from approximately 1 to over 35 as we increase the number of folds.

So far, we've been working under the assumption that a lower RMSE always means that a model is more accurate. This isn't the complete picture, unfortunately. A model has two sources of error, **bias** and **variance**.

Bias describes error that results in bad assumptions about the learning algorithm. For example, assuming that only one feature, like a car's weight, relates to a car's fuel efficiency will lead you to fit a simple, univariate regression model that will result in high bias. The error rate will be high since a car's fuel efficiency is affected by many other factors besides just its weight.

Variance describes error that occurs because of the variability of a model's predicted values. If we were given a dataset with 1000 features on each car and used every single feature to train an incredibly complicated multivariate regression model, we will have low bias but high variance. In an ideal world, we want low bias and low variance but in reality, there's always a tradeoff.

The standard deviation of the RMSE values can be a proxy for a model's **variance** while the average RMSE is a proxy for a model's **bias**. Bias and variance are the 2 observable sources of error in a model that we can indirectly control.
![image.png](attachment:image.png)

While k-nearest neighbors can make predictions, it isn't a mathematical model. A mathematical model is usually an equation that can exist without the original data, which isn't true with k-nearest neighbors. In the next two projects, we'll learn about a mathematical model called **linear regression**. We'll explore the bias-variance tradeoff in greater depth in these next 2 projects because of its importance when working with mathematical models in particular.