### Lab -- Data Prep & Gradient Boosting

Welcome to today's lab!  Today we're going to shift our attention to a more demanding dataset -- the restaurants data.  A quarter million rows, dates, and categorical data make this a more interesting, realistic use case of boosting.  

The point of today's lab will be to experiment with different encoding methods and model parameters.

**Step 1:**  Load in your dataset, and declare `X` and `y`

In [2]:
# your code here
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
import category_encoders as ce

#df = pd.read_csv('../data/restaurant_data/master.csv', parse_dates=['visit_date'])
df = pd.read_csv(r'C:\Users\samina\Desktop\GA_DATA_SCIENCE\Data_Sci\Homework\Unit3\data\insurance_premiums.csv');
X = df.drop(['charges'], axis=1)
y = df['charges']


**Step 2:** Experiment with different encoding methods

Let's do a quick check to see how different encoding methods work out of the box on our dataset.

You're going to repeat the same process for each of `OrdinalEncoder`, `TargetEncoder`, and `OneHotEncoder` and see which one gives you the best results on our data.

**2a:** Use an `OrdinalEncoder` to transform your training set with the `fit_transform` method.  

If you are confused about how the transformation is happening, try using the `mapping` attribute on your category encoder to get a hang of what's going on.

In [20]:
# your code here
ore     = ce.OrdinalEncoder()
X1       = ore.fit_transform(X)

**3b:** Initialize a `GradientBoostingRegressor` with the default parameters, fit it on your training set, and score it on your test set.

In [4]:
# your code here
gbm = GradientBoostingRegressor()
gbm.fit(X1, y)
gbm.score(X1, y)

0.8999610643494181

**3c:** Repeat these same steps for the `TargetEncoder` and the `OneHotEncoder`

**Important:** The `OneHotEncoder` can take awhile to fit.  If nothing happens in around 4 minutes, just cancel the process and try it again later on when you have more time.

In [5]:
# for the target encoder
te = ce.TargetEncoder()
# do your transformations
X2 = te.fit_transform(X, y)

  elif pd.api.types.is_categorical(cols):


In [6]:
# and model fitting
gbm = GradientBoostingRegressor()
gbm.fit(X2, y)
gbm.score(X2, y)

0.8976952687334837

In [7]:
# and for onehot encoding
ohe      = ce.OneHotEncoder()
X3 = ohe.fit_transform(X)

  elif pd.api.types.is_categorical(cols):


In [10]:
# and look at the model score
gbm = GradientBoostingRegressor()

gbm.fit(X3, y)
gbm.score(X3, y)

0.8987356739004363

**Step 4:** Look at your most important features

Similar to the previous lab, take your model's most important features and load them into a dataframe to see what's driving your results.  Use the version of your model that gave you the best results.

In [11]:
# for the target encoder
te = ce.TargetEncoder()
# do your transformations
X  = te.fit_transform(X, y)

  elif pd.api.types.is_categorical(cols):


In [12]:
# and model fitting
gbm = GradientBoostingRegressor()
gbm.fit(X, y)

GradientBoostingRegressor()

In [13]:
# let's look at feature scores
feats = pd.DataFrame({
    'Feature': X.columns,
    'Importance': gbm.feature_importances_
}).sort_values(by='Importance', ascending=False)

In [56]:
# and let's take a look -- two features dominate
feats

Unnamed: 0,Feature,Importance
0,id,0.872465
1,day_of_week,0.104502
6,longitude,0.00876
2,holiday,0.007499
5,latitude,0.004987
4,area,0.001253
3,genre,0.000535


**Step 5:** Can model parameters improve your score?  

Take the **best** version of your encoding method and try changing some parameters with your model to see if it improves your score.  

You won't have a ton of time to do this, but try some of the following:

 - Try increasing the number of trees your model uses -- 250, 500, or perhaps more trees if time permits
 - Try experimenting with differing values for tree depth -- the default is 3, but perhaps 4, 5 or 6 works better
 - Try improving fitting time by introducing some **randomness** into your data with the following two model parameters:
   - `subsample`: this dictates what proportion of your data will be used for each tree.  A value of `0.7` means 70% of your data will be used for a particular tree, chosen at random
   - `max_features`: this is the portion of columns that are used at each individual split.  If you enter an integer the model will randomly select that number of columns, if you enter a decimal it will randomly select that portion of columns.
   - It can be very useful to find the most sparse model that will still give you comparable results.  Ie, if you find a gbm with 500 trees and a max_depth of 4 gives you the best results, it can be very beneficial if you can get those same results with a `subsample` value of 0.6 and a `max_features` score of 0.7, because your model will fit ~50% faster.
   
This step is open ended, so we will likely have to end class in the middle of it.

In [14]:
# let's look at number of trees first
num_trees = [250, 500]

for tree in num_trees:
    print(f"Fitting model for {tree} estimators")
    gbm = GradientBoostingRegressor(n_estimators=tree)
    gbm.fit(X, y)
    print(f"Model score:  {gbm.score(X, y)}")

Fitting model for 250 estimators
Model score:  0.9254317507351258
Fitting model for 500 estimators
Model score:  0.9487882459597867


In [15]:
# and let's look at tree depth
tree_depth = [4, 5, 6]
# since there was not a huge difference in scores -- let's stick with 100 boosting rounds for now to keep fitting times down
for depth in tree_depth:
    print(f"Fitting model for max_depth of {depth}")
    gbm = GradientBoostingRegressor(max_depth=depth)
    gbm.fit(X, y)
    print(f"Model score:  {gbm.score(X, y)}")

Fitting model for max_depth of 4
Model score:  0.9223765285137515
Fitting model for max_depth of 5
Model score:  0.9497649960450486
Fitting model for max_depth of 6
Model score:  0.9734752030309293


In each of these cases, we saw modest increases for increasing the value of both parameters.  

It would be interesting to see when out of sample scores begin to decrease -- sometimes you can keep increasing these values and keep seeing these piecemeal improvements until your scores get quite a bit higher.  In fact, lots of high performing models are taking existing architectures and just applying a **lot** of horsepower.

For now, let's look at the juxtaposition of 500 estimators + `max_depth` of 6.

In [16]:
gbm = GradientBoostingRegressor(max_depth=6, n_estimators=500)
gbm.fit(X, y)
gbm.score(X, y)

0.9984162960866797

This score isn't dramatically different, but we were able to improve our model performance about 15% without a lot of additional effort.

Now, let's see if we can recreate similar results by improving fitting times.

In [18]:
# first, let's check some different values of colsample -- we'll start with 0.3 -- and go up from there
subsample_vals = [0.3, 0.4, 0.5]

for num_vals in subsample_vals:
    print(f"Fitting model with colsample value of {num_vals}")
    gbm = GradientBoostingRegressor(subsample=num_vals, n_estimators=500, max_depth=6)
    gbm.fit(X, y)
    print(f"Model score: {gbm.score(X, y)}")

Fitting model with colsample value of 0.3
Model score: 0.9941080390699263
Fitting model with colsample value of 0.4
Model score: 0.9962068158770071
Fitting model with colsample value of 0.5
Model score: 0.996945495839945


The scores don't quite get the same values as with all the data -- but they're quite close, and being able to do this with 1/3 of our data is pretty impressive.

In [19]:
# and let's try some different values for max_features
num_cols = [0.3, 0.4, 0.5, 0.6]

for num_col in num_cols:
    print(f"Fitting model with value of {num_col} for max_depth")
    gbm = GradientBoostingRegressor(subsample=0.3, n_estimators=500, max_depth=6, max_features=num_col)
    gbm.fit(X, y)
    print(f"Model score: {gbm.score(X, y)}")

Fitting model with value of 0.3 for max_depth
Model score: 0.9857741449962144
Fitting model with value of 0.4 for max_depth
Model score: 0.9913552305732263
Fitting model with value of 0.5 for max_depth
Model score: 0.9932716731959361
Fitting model with value of 0.6 for max_depth
Model score: 0.9932018100696856


In a similar vein -- we find that reducing the number of columns at a particular split doesn't affect our final results too much.  This is useful because it means we go with a version of our model that uses < 1/6 of the memory of our original, but get very similar results.

This can be very useful for using rapid prototypes, where long fitting times can be a drag.  

Going with some version of these parameters can be helpful for trying out different versions of our data to see if we can improve our score.