## Class 5 - Bagging and Boosting

### Recap of lecture and introductory remarks
In yesterday's lecture, we introduced bagging and boosting as two techniques to reduce variance and reduce variance of decision trees. Bagging and boosting are not specific to decision trees, but we will see them in action with this kind of model.

We used bagging to train a set of small decision trees (weak learners) on subsets of the training data, whose individual predictions we aggregate to make a single prediction. The resulting model is an _ensemble_ model. We have seen that `Random Forests` are popular learning algorithms that combine bagging with random sampling of features, to induce diversity in decision trees and further regularization.

On the other hand, boosting consists in training a sequence of decision trees which iteratively reduce the error of the previous decision tree because they are fitted on the residuals or on the gradients of the previous tree. We have focused specifically on `gradient boosting` and indicated `XGBoost` as a particularly powerful implementation of boosting + bagging.

Today, we will go back to the bike data, and fit `RandomForest` and `XGBoost` models, comparing their performances to those of models fitted previously. We will implement Random Forests using `scikit-learn`, and XGBoost using the XGBoost package: https://xgboost.readthedocs.io/en/stable/ 

**Note**: As last week, under `nbs/class_05` you will find a notebook called `example.ipynb`, where I provide an example of how to run today's exercise on sample data.

### Operational remarks
Two suggestions on how to go about this, based on where you are at regarding exercises from previous weeks.

1. If you have done exercises from class 2 and 3, you will have one/two notebooks with baseline, linear, KNN, and linear regularized models, as well as records of performances (which will be handy to compare performances of our new methods). In this case, my suggestion would be to work on a new notebook where you only fit the new models, and load the performance of the old models for comparison.

2. If you have not done exercises from previous classes, you have three options:
- Work on a new notebook where you only fit the models we work on today (random forest and XGBoost). Optionally, you can "manually" compare the performance of your new models to plots from previous weeks
- Work on a new notebook and also implement a couple of models from previous weeks for comparison

### Today's exercise
Work in groups on the following tasks

1. Fit a `Random Forest` model to the data (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor), using cross-validation to define the best possible range of parameters
    - There are a number of parameters that should be passed to the estimator. Carefully read the documentation, and identify a few hyperparameters you might want to manipulate
    - Define a series of possible values for these hyperparameters, and store this information into a Python dictionary. For each hyperparameter, the dictionary should include the name of the hyperparameters (as a string) as `key`, and a list including a range of possible values as `value`
    - Pass your estimator and the parameter grid to `GridSearchCV`: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html and fit this object to your training set. If you have defined *a lot* of possible values, you can consider using `RandomizedSearchCV`: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html. **Note**: you need to pass something appropriate as the value of the `scoring` argument
    - Try to answer the following questions:
        - What is `GridSearchCV` doing?
        - What is the difference between `RandomizedSearchCV` and `GridSearchCV`?
        - **Bonus question**: Given that we do have a validation set, could we do model selection without using cross-validation? Which parameter in `GridSearchCV` or `RandomizedSearchCV` would you have to change, and how, to do so?
    - Find out which hyperparameters gave the best result
        - **Hint**: look at the `.best_estimator_` attribute on a fitted `GridSearchCV`/`RandomizedSearchCV` and `.get_params()`
    - Compute the performance of this model on the training, validation, and test set
    - Compute and plot feature importances for the resulting model. You can look at the `.feature_importances_` attribute of the best estimator.
        - **Bonus question**: which method is used by default to compute feature importances? Is any other method available in `sklearn`?

2. Do the exact same things as 1., this time using `XGBRegressor` (https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)
    - Note: you will have to install `xgboost` (https://xgboost.readthedocs.io/en/stable/install.html) to run this (in short `pip install xgboost`)
    - You will have to define an appropriate `scoring` parameter
    - Parameters for grid/randomized search will be slightly different: look at the documentation for XGBRegressor, and make informed choices based on what we discussed in class

3. Plot the performance of the best Random Forest models and the best XGBoost models, against models you fitted previously
    - Which models perform best?
    - How does the performance profile of RandomForest compare to XGBoost? Why?

4. Compare feature importances across `RandomForest` and `XGBoost`: do they look similar/different?

5. Overall reflection on modeling process
    - Reflect back on your choices for previous models: should you have transformed any of the features before fitting Linear Regression, KNN, or regularized models?
    - Can you think of ways in which our predictive problem can be made more interesting from a business perspective?
    - Which aspect of the data are we *not* modeling, that we could/should model?


### Extra tasks
- Estimate a `DecisionTreeRegressor` with cross-validation, using the same logic we applied above: how does the performance of the resulting model compare to `RandomForestRegressor` and `XGBoost` regressor?
- Go back to your fitted `GridSearchCV` or `RandomizedSearchCV`, and inspect their attributes. Can you plot performance against values of each of the parameters you are fitting? Is there any systematic pattern?
- Reflect on hyperparameters passed to `GridSearchCV` or `RandomizedSearchCV`: how do you expect that individual manipulations of these parameters would affect the bias/variance profile of your models?



### Dependencies

In [6]:
import pandas as pd
import numpy as np
import json
import pickle

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

import matplotlib.pyplot as plt
import seaborn as sns

### 1 Random forest with CV

In [2]:
# load data
with open('../class_02/data/train_shuffle.pkl', 'rb') as file:
    X_train, y_train = pickle.load(file)

with open('../class_02/data/val_shuffle.pkl', 'rb') as file:
    X_val, y_val = pickle.load(file)

with open('../class_02/data/test_shuffle.pkl', 'rb') as file:
    X_test, y_test = pickle.load(file)

In [8]:
# Parameter dictionary
parameter_grid = {"n_estimators": [50, 100, 150],
                  "criterion": ["friedmann_mse", "poisson"],
                  "min_samples_split": [2,4],
                  "max_features": [10, 20, 25],
                  "bootstrap": [False, True],
                  "oob_score": [True],
                  "n_jobs": [-1],
                  "random_state": [123]}




In [9]:
# GRID SEARCH
rfr = RandomForestRegressor()
clf = GridSearchCV(rfr, parameter_grid)
clf.fit(X_train, y_train)

KeyboardInterrupt: 

In [None]:
# save best model
best_rfr = clf.best_estimator_.params()

file_path = "best_rf_model.pkl"
with open(file_path, "wb") as file:
    pickle.dump(best_rfr, file)

In [None]:
# evaluate model
performance_rfr = []

# evaluate
for x,y,nsplit in zip([X_train, X_val, X_test],
                    [y_train, y_val, y_test],
                    ['train', 'val', 'test']):

    preds_rfr = best_rfr.predict(x)
    r2 = r2_score(y, preds_rfr)
    rmse = np.sqrt(mean_squared_error(y, preds_rfr))

    performance_rfr.append({'model': 'random_forest',
                        'split': nsplit,
                        'rmse': rmse.round(4),
                        'r2': r2.round(4)})

performance_rfr

# save performances
file_path = 'performances_random_forest.txt'
with open(file_path, 'w') as file:
    json.dump(performance_rfr, file)

In [None]:
# load old performances and combine
with open('../class_02/performances.txt', 'rb') as file:
    performances = json.load(file)

with open('../class_03/performances_lasso_ridge.txt', 'rb') as file:
    performances_2 = json.load(file)

performances.extend(performance_2)
performances.extend(performance_rfr)

In [None]:
# plot performances
perf_df = pd.DataFrame(performances)
sns.set_style('whitegrid')
sns.scatterplot(data=perf_df, 
                y='model', 
                x='rmse', 
                marker='s', 
                hue='split', palette=['grey', 'darkorange', 'darkred'])
plt.show()