### Lab:  Grid Search & Hyperparameter Tuning

Welcome!  Today's lab is going to allow us to blend together a number of the concepts covered in Unit 3 into one cohesive whole

 - Random Forests
 - Hyperparameter tuning models with a Grid Search
 - Using custom loss functions to keep track of how you're doing

#### Step 1a:  Load in the training and the test set

In [66]:
import pandas as pd
import numpy as np

train = pd.read_csv('/Users/aoifeduna/Documents/GitHub/aoiferepo/Lectures/Unit4/data/iowa_housing/train.csv')
test  = pd.read_csv('/Users/aoifeduna/Documents/GitHub/aoiferepo/Lectures/Unit4/data/iowa_housing/test.csv')


#### Step 1b: Create the `y` variable for `SalePrice`, remove it from the training set, and drop the indexes for both datasets.  Take the log of `SalePrice`.

In [67]:
y = np.log(train['SalePrice'])
train.drop('SalePrice', axis=1, inplace=True)
test_id = test['Id']
train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)

In [68]:
train.head()

Unnamed: 0,MSSubClass,MSZoning,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,2ndFlrSF,GrLivArea.1,FullBath,HalfBath,GarageType,GarageYrBlt,GarageFinish,GarageCars
0,60,RL,8450,CollgCr,7,5,2003,1710,856,854,1710,2,1,Attchd,2003.0,RFn,2
1,20,RL,9600,Veenker,6,8,1976,1262,1262,0,1262,2,0,Attchd,1976.0,RFn,2
2,60,RL,11250,CollgCr,7,5,2001,1786,920,866,1786,2,1,Attchd,2001.0,RFn,2
3,70,RL,9550,Crawfor,7,5,1915,1717,961,756,1717,1,0,Detchd,1998.0,Unf,3
4,60,RL,14260,NoRidge,8,5,2000,2198,1145,1053,2198,2,1,Attchd,2000.0,RFn,3


#### Step 2: Fill in the missing values (Completed For You)

Just so you can see how it works, all the code is listed here.  It is using the variables `train` and `test` to refer to the training and test sets you loaded in.  If these are something different, then you'll need to re-run things appropriately.

In [69]:
# just run this code
train_empty = train.loc[:, train.isnull().sum() > 0]
# grab the columns
cols = train_empty.columns.tolist()
# fill with the appropriate value  -- NA, Other, could also work
train[['GarageType', 'GarageFinish']] = train[['GarageType', 'GarageFinish']].fillna('None')
test[['GarageType', 'GarageFinish']]  = test[['GarageType', 'GarageFinish']].fillna('None')

# we'll use this for GarageYrBlt since it's a numeric column
train['GarageYrBlt'].fillna(0, inplace=True)
test['GarageYrBlt'].fillna(0, inplace=True)

# finding the values to use in the training set
ms_mode   = train['MSZoning'].mode()[0]
gcarsmean = train['GarageCars'].mean()

# and applying them to the test set
test['MSZoning'].fillna(ms_mode, inplace=True)
test['GarageCars'].fillna(gcarsmean, inplace=True)

#### Step 3: Make A Pipeline For a Random Forest

Use the following steps:

  - OrdinalEncoder
  - OneHotEncoder
  - RandomForest
  
**Note:** Do you understand why we're not scaling our data?

In [70]:
from sklearn.pipeline import make_pipeline
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestRegressor


rf = RandomForestRegressor()
ore   = OrdinalEncoder()
ohe   = OneHotEncoder()

rf_pipe = make_pipeline(ore, ohe, rf)

**Step 4:** Import `mean_squared_error` and `make_scorer` from the metrics module, and turn it into a loss function that can be used in cross validation.

**Hint:** Set the argument `greater_is_better` to `False` for the `make_scorer` function.

In [71]:
from sklearn.metrics import mean_squared_error, make_scorer
loss_function = make_scorer(mean_squared_error, greater_is_better=False)

#### Step 5: Setup Your Grid Search

Do the following:

 - Create a dictionary of values to test the following parameters:
   - `min_samples_leaf`: 1, 5, 10, 25
   - `max_features`: 0.3, 0.4, 0.5, 0.6, 0.7, 0.8
   - `n_estimators`: 10, 50, 100
 - Initialize an instance of GridSearchCV with 5 folds, and the loss function from step 4

In [72]:
from sklearn.model_selection import GridSearchCV

In [73]:
params = {
    'randomforestregressor__min_samples_leaf': [1, 5, 10, 25],
    'randomforestregressor__max_features': [0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
    'randomforestregressor__n_estimators': [10, 50, 100]
}

In [74]:
grid = GridSearchCV(estimator=rf_pipe, param_grid=params, scoring=loss_function, cv=5)

**Step 6:** Fit your grid on the pipeline you created in step 3.

In [75]:
grid.fit(train, y)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('ordinalencoder',
                                        OrdinalEncoder(cols=None,
                                                       drop_invariant=False,
                                                       handle_missing='value',
                                                       handle_unknown='value',
                                                       mapping=None,
                                                       return_df=True,
                                                       verbose=0)),
                                       ('onehotencoder',
                                        OneHotEncoder(cols=None,
                                                      drop_invariant=False,
                                                      handle_missing='value',
                                                      handle

**Step 7:** What combination gave you the best results?

In [76]:
grid.best_params_

{'randomforestregressor__max_features': 0.4,
 'randomforestregressor__min_samples_leaf': 1,
 'randomforestregressor__n_estimators': 50}

#### Bonus

**B1: Among the parameters that you searched for, which ones had the strongest assocation with better validation scores?** 

In [77]:
grid_results = pd.DataFrame(grid.cv_results_)

In [78]:
grid_results.sort_values(by='rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_randomforestregressor__max_features,param_randomforestregressor__min_samples_leaf,param_randomforestregressor__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
13,0.156301,0.001753,0.009498,0.000419,0.4,1,50,"{'randomforestregressor__max_features': 0.4, '...",-0.019466,-0.023654,-0.022196,-0.018172,-0.021513,-0.021000,0.001954,1
14,0.295475,0.001608,0.013222,0.000358,0.4,1,100,"{'randomforestregressor__max_features': 0.4, '...",-0.019887,-0.024392,-0.022357,-0.018130,-0.020926,-0.021138,0.002132,2
2,0.274133,0.006556,0.013393,0.000684,0.3,1,100,"{'randomforestregressor__max_features': 0.3, '...",-0.018760,-0.023940,-0.023309,-0.018957,-0.020837,-0.021160,0.002148,3
26,0.348284,0.005227,0.013493,0.000364,0.5,1,100,"{'randomforestregressor__max_features': 0.5, '...",-0.019057,-0.025007,-0.022827,-0.018356,-0.021851,-0.021420,0.002450,4
1,0.143777,0.001576,0.009262,0.000235,0.3,1,50,"{'randomforestregressor__max_features': 0.3, '...",-0.018701,-0.025352,-0.023640,-0.018668,-0.021696,-0.021611,0.002655,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33,0.037274,0.003268,0.005936,0.000664,0.5,25,10,"{'randomforestregressor__max_features': 0.5, '...",-0.027612,-0.032649,-0.029978,-0.029672,-0.030437,-0.030070,0.001612,68
45,0.037845,0.001512,0.005692,0.000202,0.6,25,10,"{'randomforestregressor__max_features': 0.6, '...",-0.026727,-0.030186,-0.033881,-0.029711,-0.031427,-0.030386,0.002331,69
69,0.045948,0.002772,0.007111,0.001630,0.8,25,10,"{'randomforestregressor__max_features': 0.8, '...",-0.029035,-0.032193,-0.033583,-0.030141,-0.030923,-0.031175,0.001584,70
9,0.034162,0.002151,0.006417,0.001055,0.3,25,10,"{'randomforestregressor__max_features': 0.3, '...",-0.026880,-0.033319,-0.034348,-0.029816,-0.031577,-0.031188,0.002650,71


In [79]:
grid_results.groupby('param_randomforestregressor__n_estimators')['mean_test_score'].mean()
# More trees gives a better score

param_randomforestregressor__n_estimators
10    -0.026447
50    -0.024967
100   -0.024689
Name: mean_test_score, dtype: float64

In [80]:
grid_results.groupby('param_randomforestregressor__max_features')['mean_test_score'].mean()
# No big changes here

param_randomforestregressor__max_features
0.3   -0.025561
0.4   -0.025403
0.5   -0.025240
0.6   -0.025255
0.7   -0.025281
0.8   -0.025464
Name: mean_test_score, dtype: float64

In [81]:
grid_results.groupby('param_randomforestregressor__min_samples_leaf')['mean_test_score'].mean()
# Fewer samples did better

param_randomforestregressor__min_samples_leaf
1    -0.022483
5    -0.023745
10   -0.025319
25   -0.029923
Name: mean_test_score, dtype: float64

**B2: What were the 5 most important variables in impacting your housing price?**

In [84]:
rf_pipe.steps[-1][1].feature_importances_

NotFittedError: This RandomForestRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.