### Lab:  Model Validation With Decision Trees

Welcome to this evening's lab!  It's going to be a fun one.  For today's class, we're going to try and take a crack at model building in a wholistic way.  

Specifically, we're going to try and do three different things:

 - Try out different versions of our data, and use our validation scores to see if something was an improvement or not
 - We're going to adjust model parameters to try and adjust our results to help curb overfitting
 - We're going to try and find model parameters that maximize our score for our dataset
 
The idea is that we'll be able to do a mini-walkthrough to test what it's like to build and validate a model and try and improve our results.

In [1]:
import numpy as np
import pandas as pd
import category_encoders as ce
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import make_pipeline

In [4]:
df = pd.read_csv('/Users/ethanalter/Dropbox (Personal)/GA-4K-DataScience/gazelle-4K/data_master/master.csv', parse_dates = ['visit_date'])

In [5]:
df

Unnamed: 0,id,visit_date,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0
252104,air_a17f0778617c76e2,2017-04-22,60,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0
252105,air_a17f0778617c76e2,2017-03-26,69,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0
252106,air_a17f0778617c76e2,2017-03-20,31,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0


In [6]:
#fillna
df.fillna(0)

Unnamed: 0,id,visit_date,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,0.0
1,air_ba937bf13d40fb24,2016-01-14,32,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,0.0
2,air_ba937bf13d40fb24,2016-01-15,29,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,0.0
3,air_ba937bf13d40fb24,2016-01-16,22,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,0.0
4,air_ba937bf13d40fb24,2016-01-18,6,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,0.0
...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0
252104,air_a17f0778617c76e2,2017-04-22,60,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0
252105,air_a17f0778617c76e2,2017-03-26,69,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0
252106,air_a17f0778617c76e2,2017-03-20,31,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0


In [8]:
#pulling out some date stuff
df['month'] = df['visit_date'].dt.month
df['year'] = df['visit_date'].dt.year

In [9]:
#time
df['time'] = (df['visit_date']-df['visit_date'].min()).dt.days

In [20]:
#let's do some shifts! 
df['yesterday'] = df.groupby('id')['visitors'].shift()
df['3daysago'] = df.groupby('id')['visitors'].shift(3)
df['week_ago'] = df.groupby('id')['visitors'].shift(7)


In [25]:
df = df.drop(['weekago'], axis =1)

In [26]:
df

Unnamed: 0,id,visit_date,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,month,year,time,yesterday,3daysago,week_ago
0,air_ba937bf13d40fb24,2016-01-13,25,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,2016,12,,,
1,air_ba937bf13d40fb24,2016-01-14,32,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,2016,13,25.0,,
2,air_ba937bf13d40fb24,2016-01-15,29,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,2016,14,32.0,,
3,air_ba937bf13d40fb24,2016-01-16,22,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,2016,15,29.0,25.0,
4,air_ba937bf13d40fb24,2016-01-18,6,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,2016,17,22.0,32.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0,4,2017,476,22.0,11.0,88.0
252104,air_a17f0778617c76e2,2017-04-22,60,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0,4,2017,477,49.0,25.0,61.0
252105,air_a17f0778617c76e2,2017-03-26,69,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0,3,2017,450,60.0,22.0,26.0
252106,air_a17f0778617c76e2,2017-03-20,31,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0,3,2017,444,69.0,49.0,19.0


In [29]:
df['rolling30']= df.groupby('id')['visitors'].rolling(30).mean().shift().values

In [33]:
#how do you resolve the nulls? 
df['rolling30'] = df['rolling30'].bfill()

In [34]:
df

Unnamed: 0,id,visit_date,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,month,year,time,yesterday,3daysago,week_ago,rolling30
0,air_ba937bf13d40fb24,2016-01-13,25,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,2016,12,,,,26.466667
1,air_ba937bf13d40fb24,2016-01-14,32,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,2016,13,25.0,,,26.466667
2,air_ba937bf13d40fb24,2016-01-15,29,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,2016,14,32.0,,,26.466667
3,air_ba937bf13d40fb24,2016-01-16,22,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,2016,15,29.0,25.0,,26.466667
4,air_ba937bf13d40fb24,2016-01-18,6,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,2016,17,22.0,32.0,,26.466667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0,4,2017,476,22.0,11.0,88.0,4.000000
252104,air_a17f0778617c76e2,2017-04-22,60,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0,4,2017,477,49.0,25.0,61.0,3.933333
252105,air_a17f0778617c76e2,2017-03-26,69,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0,3,2017,450,60.0,22.0,26.0,3.933333
252106,air_a17f0778617c76e2,2017-03-20,31,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0,3,2017,444,69.0,49.0,19.0,3.933333


In [None]:
train = df.groupby(['id']).apply(lambda x: x.iloc[:-15])
test = df.groupby(['id']).apply(lambda x: x.iloc[-15:])

In [1]:
def get_val_scores(df, estimator):
    
    df = df.drop('visit_date', axis = 1)
    
    # create training and validation set
    train = df.groupby('id').apply(lambda x: x.iloc[:-15]).reset_index(drop = True)
    val   = df.groupby('id').apply(lambda x: x.iloc[-15:]).reset_index(drop = True)
    
    # create a validation & test set
    X_train, y_train = train.drop('visitors', axis = 1), train['visitors']
    X_val, y_val     = val.drop('visitors', axis = 1), val['visitors']
    
    estimator.fit(X_train, y_train)
    
    # score on the test data
    return estimator.score(X_val, y_val)

**Step 1:** Using the suggestions from the homework prompt given previously, try and add 3-4 different features ( columns ) to your data, and use your validation score to determine if they improved your results.  For now just stick with a tree that is 6 levels deep.

This is meant to be open ended, and to allow you a chance to re-discover material from previous labs.

In [None]:
# your code here

**Step 2:** Let's now try out different types of model parameters  

The idea here is two-fold:  see if you can narrow the gap between in-sample and out-of-sample results (using training & validation sets), while simultaneously **not** decreasing your model scores (or at least not by very much).  The idea being that the closer these two are, the more reliable your results are likely to be.

Some knobs you can turn:
 - `min_samples_leaf`: parameter in the category encoder that determines what cutoff point you can use for using the local vs. global average for the category.  (A decent rule of thumb is to try and have at least 10 samples on a leaf, but feel free to try different values)
 - `max_features`: what portion of columns to use at each split.  This parameter will randomly sample columns at each split, which reduces the chance that random patterns within them will have a disproportionately large impact on your model.  Should be a fraction between 0 and 1 or the number of columns you want to include.  
 - You can also try the following:  remove and sort of max_depth on your tree, and just use `min_samples_leaf` as a way to prune unnecessary splits

In [None]:
# your code here

**Step 3:** Take the best version of your model & your data, and fit it on **all** of your training + validation data.  The idea is that now that we've found the best version of what we have to work with, we want to give it as much training samples as possible.  

In [None]:
# your code here

**Step 4:** Score your model on your test set.

Note how your validation + test scores compared to one another.

In [35]:
#testing different parameter values 
max_depth = [5, 6, 7]
max_features = [0.6, 0.8, 1]
min_samples_leaf = [5, 20, 40]

In [None]:
cv_scores = []

for depth in max_depth:
    for feature in max_features:
        for sample in min_samples_leaf:
            print(f"Getting validation score for: depth: {depth}, max_features: {feature}, leaf_samples: {sample}")
            pipe[-1].set_params(max_depth = depth, max_features = feature, min_samples_leaf = sample)
            val_score = get_val_scores(train, pipe)
            cv_scores.append((val_score, depth, feature, sample))