# Cross validation

## Introduction

+ Read the file paris_airbnb.csv in a dataFrame paris_listings.
+ Remove commas and dollars from the target column'price' and convert it to'float'.
+ Use the numpy.random.permutation() function to mix the order of the betting_listings lines (first we place on random seed 1).
+ Re-index the DataFrame according to this new order with the method DataFrame.reindex().
+ Select the first 4000 lines and assign them to the split_one variable.
+ Select the remaining 4000 lines and assign it to the split_two variable.


In [6]:
# .index
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
np.random.seed(1)

paris_listings = pd.read_csv('paris_airbnb.csv')
stripped_commas = paris_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$','')
paris_listings['price'] = stripped_dollars.astype('float')

shuffled_index = np.random.permutation(paris_listings.index)
paris_listings = paris_listings.reindex(shuffled_index)

split_one = paris_listings.iloc[0:4000]
split_two = paris_listings.iloc[4000:]

In [7]:

def predict_price(new_listing):
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)

In [9]:
#from sklearn.model_selection import KFold
#kf = KFold(n_splits=2)
#kf.get_n_splits(split_one)
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one

# première moitié
model = KNeighborsRegressor()
model.fit(train_one[['accommodates']], train_one['price'])
test_one['predicted_price'] = model.predict(test_one[['accommodates']])
iteration_one_rmse = mean_squared_error(test_one['price'], test_one['predicted_price'])**(1/2)

# seconde moitié
model.fit(train_two[['accommodates']], train_two['price'])
test_two['predicted_price'] = model.predict(test_two[['accommodates']])
iteration_two_rmse = mean_squared_error(test_two['price'], test_two['predicted_price'])**(1/2)

avg_rmse = np.mean([iteration_two_rmse, iteration_one_rmse])
print(iteration_one_rmse, iteration_two_rmse, avg_rmse)

88.96592437557203 115.17976784140521 102.07284610848862


## Cross-validation Holdout

+ Train a model of the k closest neighbors using the default algorithm (auto) and the number of neighbors by default (5) in:
 - Using the'accommodates' column of train_one (first half of the dataset) for training and
 - And test it on test_one (second half of the dataset).
+ Assign the resulting RMSE value (square root mean square error) to the iteration_one_rmse variable.
+ Train a model of the k closest neighbors using the default algorithm (auto) and the number of neighbors by default (5):
 - Use the'accommodates' column of train_two (second half of the dataset this time) for training and
 - And test it on test_two (first half of the dataset).
+ Assign the resulting RMSE value to the iteration_two_rmse variable.
+ Use numpy.mean() to calculate the average of the 2 RMSE values and assign the result to the avg_rmse variable.
+ Display the result


In [11]:


kf = KFold(n_splits=5)
kf.get_n_splits(paris_listings['fold'].iloc[0:1600])

for train_index, test_index in kf.split(X):
 print(“TRAIN:”, train_index, “TEST:”, test_index)
 X_train, X_test = paris_listings['fold'].iloc[0:1600] , paris_listings['fold'].iloc[1600:3200]
 y_train, y_test = paris_listings['fold'].iloc[3200:4800], paris_listings['fold'].iloc[4800:6400]
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))

5

## Cross-validation of K-Fold

+ Add a new column to the DataFrame paris_listings called "fold" which contains the fold number of each row:
+ Fold 1 must have index lines 0 to 1600, including these 2 lines.
+ Fold 2 must have index lines 1600 to 3200, including these 2 lines.
+ Fold 3 must have index lines 3200 to 4800, including these 2 lines.
+ Fold 4 must have index lines 4800 to 6400, including these 2 lines.
+ Fold 5 must have index lines 6400 to 8000, including these 2 lines.
+ Display the number of values for each'fold' column to confirm that each fold contains approximately the same number of elements.


In [15]:
# value_counts()
# sklearn.metrics.mean_squared_error()
from sklearn.model_selection import KFold
import numpy as np
fold_ids = [1,2,3,4,5]

paris_listings['fold'] = 0
paris_listings['fold'].iloc[0:1600] = 1
paris_listings['fold'].iloc[1600:3200] = 2
paris_listings['fold'].iloc[3200:4800] = 3
paris_listings['fold'].iloc[4800:6400] = 4
paris_listings['fold'].iloc[6400:8000] = 5

def train_and_validate(df, folds):
    fold_rmses = []
    #training
    for fold in folds:
        model = KNeighborsRegressor()
        train = df[df['fold'] != fold]
        test = df[df['fold'] == fold]
        model.fit(train[['accommodates']],train['price'])
    
    #prediction
        labels = model.predict(test[['accommodates']])
        test['predicted_price'] = labels
        mse = mean_squared_error(test['price'], test['predicted_price'])
        rmse = mse**(1/2)
        fold_rmses.append(rmse)
    return (fold_rmses)

rmses = train_and_validate(paris_listings, fold_ids)
print(rmses)

avg_rmse = np.mean(rmses)
print(avg_rmse)

[81.94523308283405, 156.1902075995803, 72.58622217749041, 99.10605291807357, 83.16789539840478]
98.59912223527662


## First iteration

+ Train a model of the nearest k neighbours using the'accommodates' column as the only characteristic on the training set (folds 2 to 5 of the DataFrame paris_listings).
+ Use the model to make predictions on the test set (column'accommodates' of fold 1) and assign the predicted results to the labels variable.
+ Calculate the RMSE value by comparing the'price' column with the predicted label values.
+ Assign the RMSE value to the iteration_one_rmse variable.
+ Display the result

In [28]:
# != 1
# == 1

## Function to train models

+ Write a function that we will name train_and_validate that takes a dataframe as the first parameter (df) and a list of fold number values (1 to 5 in our case) as the second parameter (folds). This function should:
 - Train n models (where n is the fold number) and perform a cross validation of k-fold (using n folds). Use the default k value for the KNeighborsRegressor class.
 - Return the list of RMSE values, where the first element is when fold 1 is the test set, the second element is when fold 2 is the test set, and so on.
+ Use the train_and_validate function to return the list of RMSE values for the paris_listings dataframe and assign it to the rmses variable.
+ Calculate the average of these values and assign it to the avg_rmse variable.
+ Display rmses and avg_rmse.

In [32]:
# train = paris_listings[paris_listings['fold'] != fold]

## Perform a cross validation of K-Fold using Scikit-Learn


+ Create a new instance of the KFold class with the following properties:
 - 5 folds,
 - Set shuffle to True,
 - Set random_state to 1 (to get the same result as me),
 - Assigned to the variable kf.
+ Create a new instance of the class KNeighborsRegressor and assign it to the variable knn.
+ Use the cross_val_score() function to cross-validate k-fold in:
 - Using the KNeighborsRegressor instance knn,
 - Using the'accommodates' column for training,
 - Using the'price' column for the target column,
 - Returning an array of MSE values (one value for each fold).
+ Assign the resulting list of MSE values to the mses variable. Then, take the absolute value followed by the square root of each MSE value. Finally, calculate the average of the resulting RMSE values and assign the result to the avg_rmse variable.

In [27]:
from sklearn.model_selection import cross_val_score, KFold

kf = KFold(5, shuffle=True, random_state=1)
knn = KNeighborsRegressor()
mses = cross_val_score(knn, paris_listings[['accommodates']], paris_listings['price'], scoring='neg_mean_squared_error', cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)

print(rmses)
print(avg_rmse)

[ 75.39017691  78.61860292  91.61952671  87.38039883 158.31198012]
98.26413709965395


## Explore different values of k

In [29]:
for fold in range(5, 30):
    
    kf = KFold(fold, shuffle=True, random_state=1)
    knn = KNeighborsRegressor()
    mses = cross_val_score(knn, paris_listings[['accommodates']], paris_listings['price'], scoring='neg_mean_squared_error', cv=kf)
    rmses = np.sqrt(np.absolute(mses))
    avg_rmse = np.mean(rmses)
    std_rmse = np.std(rmses)
    
    print(str(fold),"folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))


5 folds:  avg RMSE:  98.26413709965395 std RMSE:  30.58599393612067
6 folds:  avg RMSE:  96.72094167843518 std RMSE:  31.86150896719774
7 folds:  avg RMSE:  100.5802680585613 std RMSE:  30.298978546243564
8 folds:  avg RMSE:  99.09090770943914 std RMSE:  32.01216994081181
9 folds:  avg RMSE:  100.65349476343783 std RMSE:  31.016383141381176
10 folds:  avg RMSE:  99.64732774449637 std RMSE:  32.80776719590842
11 folds:  avg RMSE:  98.01098681083695 std RMSE:  34.61336551901312
12 folds:  avg RMSE:  96.32608190568624 std RMSE:  36.84213484714486
13 folds:  avg RMSE:  96.33532504669681 std RMSE:  36.04164484994614
14 folds:  avg RMSE:  97.83887571975254 std RMSE:  37.486067259653595
15 folds:  avg RMSE:  95.58687573751473 std RMSE:  36.73027442785193
16 folds:  avg RMSE:  132.60431944488175 std RMSE:  145.34093862212308
17 folds:  avg RMSE:  98.2756484776724 std RMSE:  39.844277774194715
18 folds:  avg RMSE:  96.66674124116822 std RMSE:  41.06923127462802
19 folds:  avg RMSE:  94.81790717