### Lab -- Data Prep & Gradient Boosting

Welcome to today's lab!  Today we're going to shift our attention to a more demanding dataset -- the restaurants data.  A quarter million rows, dates, and categorical data make this a more interesting, realistic use case of boosting.  

The point of today's lab will be to experiment with different encoding methods and model parameters.

**Step 1:**  Load in your dataset

In [3]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
import category_encoders as ce

In [2]:
df = pd.read_csv('/Users/cameronlefevre/Test-Repo/ClassMaterial/Unit3/data/restaurants.csv', parse_dates=['visit_date'])

In [11]:
def denote_null_values(df):
    empty_cols_query = df.isnull().sum() > 0
    empty_df_cols = df.loc[:, empty_cols_query].columns.tolist()
    for col in empty_df_cols:
        col_name = f"{col}_missing"
        df[col_name] = pd.isnull(df[col])
    return df

In [12]:
df = denote_null_values(df)
# and fill in the missing ones
df = df.fillna(0)

**Step 2:** Create a training and test set.

Make the test set the **last 15 observations for each restaurant**.

Turn each of these variables into `X_train, y_train`, and `X_test, y_test`, respectively.

**Hint:**  This harkens back to our grouping lab -- check this if you forget how to do it.

In [13]:
# your code here
# we'll sort the values
df.sort_values(by=['id', 'visit_date'], ascending=True, inplace=True)

# split into training & test
train = df.groupby('id').apply(lambda x: x.iloc[:-15])
test  = df.groupby('id').apply(lambda x: x.iloc[-15:])

# drop the date column -- no need for it
train.drop('visit_date', axis=1, inplace=True)
test.drop('visit_date', axis=1, inplace=True)

# and turn it into X & y
X_train, y_train = train.drop('visitors', axis=1), train['visitors']
X_test, y_test   = test.drop('visitors', axis=1), test['visitors']

**Step 3:** Experiment with different encoding methods

Let's do a quick check to see how different encoding methods work out of the box on our dataset.

You're going to repeat the same process for each of `OrdinalEncoder`, `TargetEncoder`, and `OneHotEncoder` and see which one gives you the best results on our data.

**3a:** Use an `OrdinalEncoder` to transform your training set with the `fit_transform` method.  Then use the `transform` method to transform your test set.  

**Important:** An important detail here is that the test set is being transformed according to the values in your training set.  

If you are confused about how the transformation is happening, try using the `mapping()` method on your category encoder to get a hang of what's going on.

In [25]:
# initialize
encoder = ce.OrdinalEncoder()

# and use the fit_transform() method
X_train1 = encoder.fit_transform(X_train)
X_test1  = encoder.transform(X_test)

**3b:** Initialize a `GradientBoostingRegressor` with the default parameters, fit it on your training set, and score it on your test set.

In [26]:
gbm = GradientBoostingRegressor()
gbm.fit(X_train1, y_train)

KeyboardInterrupt: 

In [None]:
# your answer here
print(f"Train score: {gbm.score(X_train1, y_train)}, Test Score: {gbm.score(X_test1, y_test)}")

**3c:** Repeat these same steps for the `TargetEncoder` and the `OneHotEncoder`

**Important:** The `OneHotEncoder` can take awhile to fit.  If nothing happens in around 4 minutes, just cancel the process and try it again later on when you have more time.

In [27]:
te = ce.TargetEncoder()

X_train2 = te.fit_transform(X_train, y_train)
X_test2  = te.transform(X_test, y_test)




In [28]:
gbm = GradientBoostingRegressor()
gbm.fit(X_train2, y_train)

GradientBoostingRegressor()

In [29]:
print(f"Train score: {gbm.score(X_train2, y_train)}, Test Score: {gbm.score(X_test2, y_test)}")

Train score: 0.48825479062698207, Test Score: 0.4130945639735636


In [23]:
# OneHot Encoding in category_encoders
ohe = ce.OneHotEncoder()
X_train3 = ohe.fit_transform(X_train)
X_test3  = ohe.transform(X_test)

In [24]:
gbm = GradientBoostingRegressor()

gbm.fit(X_train3, y_train)

KeyboardInterrupt: 

In [None]:
print(f"Train score: {gbm.score(X_train3, y_train)}, Test Score: {gbm.score(X_test3, y_test)}")

**Step 4:** Look at your most important features

Similar to the previous lab, take your model's most important features and load them into a dataframe to see what's driving your results.

In [31]:
# your answer here
importances = pd.DataFrame({
    'Column': X_train2.columns,
    'Importance': gbm.feature_importances_
}).sort_values(by='Importance', ascending=False)

importances

Unnamed: 0,Column,Importance
0,id,0.845864
1,calendar_date,0.136464
7,longitude,0.008536
6,latitude,0.003907
2,day_of_week,0.003297
5,area,0.001519
4,genre,0.000256
3,holiday,0.000113
8,reserve_visitors,4.4e-05
9,reserve_visitors_missing,0.0


**Step 5:** Can model parameters improve your score?  

Take the **best** version of your encoding method and try changing some parameters with your model to see if it improves your score.  

You won't have a ton of time to do this, but try some of the following:

 - Try increasing the number of trees your model uses -- 250, 500, or perhaps more trees if time permits
 - Try experimenting with differing values for tree depth -- the default is 3, but perhaps 4, 5 or 6 works better
 - Try improving fitting time by introducing some **randomness** into your data with the following two model parameters:
   - `subsample`: this dictates what proportion of your data will be used for each tree.  A value of `0.7` means 70% of your data will be used for a particular tree, chosen at random
   - `max_features`: this is the portion of columns that are used at each individual split.  If you enter an integer the model will randomly select that number of columns, if you enter a decimal it will randomly select that portion of columns.
   - It can be very useful to find the most sparse model that will still give you comparable results.  Ie, if you find a gbm with 500 trees and a max_depth of 4 gives you the best results, it can be very beneficial if you can get those same results with a `subsample` value of 0.6 and a `max_features` score of 0.7, because your model will fit ~50% faster.
   
This step is open ended, so we will likely have to end class in the middle of it.

In [35]:
?GradientBoostingRegressor

In [37]:
# we'll try doing a parameter sweep
n_estimators  = [100, 250, 500]
learning_rate = [.05, .1]
tree_depth    = [3, 4, 5, 6]
cv_scores     = []

for estimators in n_estimators:
    for rate in learning_rate:
        for depth in tree_depth:
            #print(f"Fitting model for:  {}")
            gbm.set_params(n_estimators=estimators, learning_rate=rate, max_depth=depth)
            gbm.fit(X_train2, y_train)
            cv_scores.append((gbm.score(X_train2, y_train), estimators, rate, depth))

KeyboardInterrupt: 