# Holdout Validation

You're probably used to 80-20 train/test data splitting. But, here, we're going to do 50-50 train-test one way and then train-test using the complementary data.. So we'll end up with two models and two error numbers which we'll then average. 

The ultimate goal DQ hasn't told me yet :)

In [21]:
import pandas as pd
from pandas.api.types import is_numeric_dtype
import numpy as np

In [4]:
dc_listings = pd.read_csv("tomslee_airbnb_washington_1433_2017-07-11.csv")
dc_listings.head()

Unnamed: 0,room_id,survey_id,host_id,room_type,country,city,borough,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,bathrooms,price,minstay,last_modified,latitude,longitude,location
0,3732219,1433,280636,Shared room,,Washington,,Columbia Heights,0,0.0,3,1.0,,129.0,,2017-07-11 08:53:56.381540,38.931081,-77.030618,0101000020E6100000D02A33A5F54153C0A77686A92D77...
1,15087225,1433,90860645,Shared room,,Washington,,Brentwood,6,5.0,4,1.0,,118.0,,2017-07-11 08:53:55.616987,38.908054,-77.003306,0101000020E61000005B785E2A364053C041800C1D3B74...
2,19634784,1433,138150306,Shared room,,Washington,,South West,1,0.0,4,1.0,,84.0,,2017-07-11 08:53:53.434225,38.884121,-77.019518,0101000020E6100000BCEB6CC83F4153C0795A7EE02A71...
3,18547685,1433,26180779,Shared room,,Washington,,Shaw,11,5.0,2,1.0,,74.0,,2017-07-11 08:53:49.654605,38.910593,-77.023461,0101000020E6100000D0EE9062804153C0B77BB94F8E74...
4,13878076,1433,2387207,Shared room,,Washington,,Cleveland Park,2,0.0,2,1.0,,50.0,,2017-07-11 08:53:48.721169,38.935485,-77.059807,0101000020E61000009A44BDE0D34353C00473F4F8BD77...


Putting this one here (though it's not necessary) just to show one way of cleaning a should-be-numeric column :

In [17]:
is_numeric_dtype( dc_listings['city'].dtype )

False

In [19]:
if not is_numeric_dtype( dc_listings['price'] ) :
    dc_listings['price'] = dc_listings['price'].str.replace('[ $,]','')  # DQ doesn't know yet that str.replace takes a regex
    try :
        dc_listings['price'] = dc_listings['price'].astype('float')
    except ValueError as err :
        print("Price needs pre-processing to remove bad chars : {}".format(err) )

Remove any bias in the data by randomizing (shuffling)

In [23]:
dc_listings = dc_listings.loc[ np.random.permutation( len(dc_listings) ) ]

In [24]:
N_split = int( len( dc_listings )/2 )
train_df = dc_listings[ : N_split ]
test_df  = dc_listings[ N_split : ]

In [26]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

In [30]:
knn = KNeighborsRegressor( )
knn.fit( train_df[["accommodates"]], train_df['price'])
predictions = knn.predict( test_df[['accommodates']])
rmse1 = mean_squared_error( test_df['price'], predictions )**0.5
knn.fit( test_df[['accommodates']] , test_df['price'])
predictions = knn.predict( train_df[['accommodates']])
rmse2 = mean_squared_error( train_df['price'], predictions )**0.5
avg_rmse = np.mean([rmse1, rmse2])
avg_rmse

382.00638342493085

### K-Fold Cross Validation

Instead of just 50-50 - which goes against the 80-20 rule, what if we (still no clue why) do the 80-20 with 20% moving aound in 5 ways (quintiles - you get the idea) and take the average of the resulting five RMSEs?

First off, fold the dataframe - an obvious way would be to split into two as we have done before, but another way is to stick a label onto a new column that shows which fold the row belongs to. It would be nice to have a **nice pythonic idiomatic way** to do it, but.. for now..

In [31]:
filler = 0
count = 0
N = len( dc_listings )
n_folds = 5
M = int( N / n_folds )
# dc_listings['fold'] = 0
while count < N :
    dc_listings.loc[count,'fold'] = filler
    count += 1
    if 0 == count % M and (filler < n_folds-1) :
        filler += 1
dc_listings['fold'].value_counts()


4.0    1649
1.0    1647
3.0    1647
2.0    1647
0.0    1647
Name: fold, dtype: int64

Okay, that obviously worked out.. Let's make a function we can call on anytime to fold a dataframe..

In [33]:
def fold_df( df, n_folds ) :
    filler = 0
    count = 0
    N = len( df )
    M = int( N / n_folds )
    while count < N :
        df.loc[count,'fold'] = filler
        count += 1
        if 0 == count % M and (filler < n_folds-1) :
            filler += 1

In [34]:
fold_df( dc_listings, 10 )
dc_listings['fold'].value_counts()

9.0    830
6.0    823
0.0    823
5.0    823
8.0    823
3.0    823
2.0    823
7.0    823
4.0    823
1.0    823
Name: fold, dtype: int64

Another way (5 folds) is to specify the splits and then :

```
splits = [0, 745, 1490, 2234, 2978, len(dc_listings) ]
for i,split in enumerate(splits[:-1]) :
    dc_listings.loc[ split : splits[i+1], 'fold'] = i+1
```

Along these lines, .. make a function that takes the DF and the fold indexes and returns a list of the RMSEs

In [35]:
def train_and_validate( df, folds ) :   # assumes that fold labels already exist in df
    rmses = []
    for fold in folds :
        knn = KNeighborsRegressor()
        knn.fit( df.loc[ df['fold'] != fold, ["accommodates"] ], df.loc[ df['fold'] != fold,['price'] ] )
        df.loc[ df['fold'] == fold, 'labels'] = knn.predict( df.loc[ df['fold'] == fold , ['accommodates'] ]  )
        rmses.append( mean_squared_error( df.loc[ df['fold'] == fold, 'price'] , df.loc[ df['fold']==fold, 'labels'] )**0.5 )
    return rmses

What does SciKit Learn have that can help with such and approach?

```
from sklearn.model_selection import cross_val_score, KFold
from sklearn.neighbors import KNeighborsRegressor

kf = KFold( 5 , shuffle=True, random_state=1 )
knn = KNeighborsRegressor()
mses = cross_val_score(knn , dc_listings[['accommodates']], dc_listings['price'], scoring='neg_mean_squared_error' , cv=kf)
rmses = [ (abs(x))**0.5 for x in mses]
avg_rmse = np.mean( rmses )
```

Bottom line - be aware of **bias** (too much weight to some features ("My neighbour must be happy because he has a pretty wife") ) and **variance** (a model that depends on too many features due to which, the impact of the features that really matter is diluted by the impact of the ones that really done).

But, look at error results, how are you to know about B and V?
Answer : B is seen in the average RMS error and Variance in the std-deviation of the RMS error (in the simple case above, we generate samples through different splits of train/test)