### Lab -- Data Prep

Welcome to today's lab!  Today we're going to shift our attention to a more demanding dataset -- the restaurants data.  A quarter million rows, dates, and categorical data make this a more interesting, realistic use case of boosting.  

The point of today's lab will be to experiment with different encoding methods and model parameters.

In [33]:
import pandas as pd
import numpy as np 
import category_encoders as ce
from sklearn.tree import DecisionTreeRegressor, plot_tree

**Step 1:**  Load in your dataset, and declare `X` and `y`.

**Bonus:**  If you would like, encode some of the time based data we created in previous classes.  For now, just try and extract different date parts like month, day, year, etc.  If you do not do this, you should drop the date columns before declaring `X` and `y`.

In [34]:
url = r"/Users/ethanalter/Dropbox (Personal)/GA-4K-DataScience/gazelle-4K/data_master/master.csv"
df = pd.read_csv(url,parse_dates = ['visit_date'])

In [35]:
df['quarter'] = df['visit_date'].dt.quarter
df['month'] = df['visit_date'].dt.month
df['year'] = df['visit_date'].dt.year

In [36]:
df = df.drop('visit_date', axis=1)

In [37]:
df

Unnamed: 0,id,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,quarter,month,year
0,air_ba937bf13d40fb24,25,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,1,2016
1,air_ba937bf13d40fb24,32,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,1,2016
2,air_ba937bf13d40fb24,29,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,1,2016
3,air_ba937bf13d40fb24,22,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,1,2016
4,air_ba937bf13d40fb24,6,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,1,1,2016
...,...,...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,49,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0,2,4,2017
252104,air_a17f0778617c76e2,60,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0,2,4,2017
252105,air_a17f0778617c76e2,69,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0,1,3,2017
252106,air_a17f0778617c76e2,31,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0,1,3,2017


**Step 2:** Experiment with different encoding methods

Let's do a quick check to see how different encoding methods work out of the box on our dataset.

You're going to repeat the same process for each of `OrdinalEncoder`, `TargetEncoder`, and `OneHotEncoder` and see which one gives you the best results on our data.

**2a:** Use an `OrdinalEncoder` to transform your training set with the `fit_transform` method.

If you are confused about how the transformation is happening, try using the `mapping()` method on your category encoder to get a hang of what's going on.

In [38]:
oe = ce.OrdinalEncoder()
oencoded = oe.fit_transform(df)

In [39]:
X = oencoded.drop('visitors', axis = 1)
y = oencoded['visitors']

In [43]:
X

Unnamed: 0,id,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,quarter,month,year
0,1,1,0,1,1,35.658068,139.751599,,1,1,2016
1,1,2,0,1,1,35.658068,139.751599,,1,1,2016
2,1,3,0,1,1,35.658068,139.751599,,1,1,2016
3,1,4,0,1,1,35.658068,139.751599,,1,1,2016
4,1,5,0,1,1,35.658068,139.751599,,1,1,2016
...,...,...,...,...,...,...,...,...,...,...,...
252103,829,3,0,4,10,34.695124,135.197852,6.0,2,4,2017
252104,829,4,0,4,10,34.695124,135.197852,37.0,2,4,2017
252105,829,7,0,4,10,34.695124,135.197852,35.0,1,3,2017
252106,829,5,1,4,10,34.695124,135.197852,3.0,1,3,2017


In [53]:
X.isna().sum()

id                       0
day_of_week              0
holiday                  0
genre                    0
area                     0
latitude                 0
longitude                0
reserve_visitors    143714
quarter                  0
month                    0
year                     0
dtype: int64

In [56]:
X = X.fillna(0)

**2b:** Initialize a `DecisionTreeRegressor` with a `max_depth` set to 5, check the model score to see how it performed using the `score` method.

In [57]:
tree = DecisionTreeRegressor(max_depth = 5)

In [58]:
tree.fit(X,y)

DecisionTreeRegressor(max_depth=5)

In [61]:
tree.score(X,y)

0.0910418888861726

**2c:** Repeat these same steps for the `TargetEncoder` and the `OneHotEncoder`

**Important:** The `OneHotEncoder` can take awhile to fit.  If nothing happens in around 4 minutes, just cancel the process and try it again later on when you have more time.

In [None]:
# your code here

**Step 3:** Look at your most important features

Similar to the previous lab, take your model's most important features and load them into a dataframe to see what's driving your results.

In [None]:
# your code here

**Step 4:** Using the pipeline that was discussed in class, try and do the following:
 
 - Create a pipeline for the encoder that worked best for the previous step, and a decision tree with the same parameters that were used previously

 - Create an in-sample and out-of-sample portion for your dataset.  The in sample portion will be all rows for each restaurant for the dataset, up until the last 15.  The out-of-sample portion of your data will be the last 15 days for each restaurant.  (This same task was completed in an earlier lab, so feel free to use that as a reference if you're not sure how to do this).
 
 - Fit your model on the training set, and then score it on the test set.  Note how the two different values differ.

In [1]:
# your code here