# Random Forests & Boosting Webinar

## Overview:

- ### Why do we use decision trees, random forests, & boosting
- ### Review of concepts:
    - ### Decision Trees
    - ### Bootstrapping
    - ### Bagging
    - ### Random Forests
    - ### Boosting
- ### Examples:
    - ### Random forests
    - ### Boosting algorithm

## Why do we use decision trees, random forests, & boosting?

- Decision trees make no assumptions about the distribution of the data (i.e. nonparametric)
- Decision trees map regression and classification problems very well
- The resulting model is clear for interpretability (e.g. important variables, model can be easily visualize for non-technical audience, etc.)
- Random forests are a simple, clear extension of decisions trees to improve performance
    - Using the bootstrap is one way it improves over a simple decision tree as the bootstrap often reduces bias in finite samples (i.e. asymptotic refinements)
- Empirical applications have shown random forests and boosted decision trees to be effective out-of-the-box algorithms, especially with a non-linear function
    - XGBoost is a very popular gradient boosted decision tree algorithm that has won many online data science competitions 

## Review of concepts: Decision Trees

![CarType](images/ex_tree_car_own.jpg)

![RegTree](images/reg_tree.png)

### How to build a tree?

Look at all the cut points, of all the variables, and decide which ones improves the algorithm the most.

**Well, what is "improves the algorithm the most"?**

### Decision rules

- ### Classification error
- ### Entropy
- ### Information gain
- ### Gini impurity

### Why not grow a huge tree for minimal training error?

What's the answer in 95% of machine learning questions?

**AVOID OVER-FITTING!**


![Overfit](images/overfit.png)


There are loads of "regularization" methods to find minimum of test error by not over-fitting on training error.

- Tree depth
- Number of leaves
- Number of nodes
- Leaf size
- Limit splits to above a certain classification error reduction
- Pruning (i.e. total cost formula to find optimal tree complexity and training error)
- Among many others...

## Agenda:
- ### Why do we use decision trees, random forests, & boosting
- ### Review of concepts:
    - Decision Trees
    - ### Bootstrapping
    - Bagging
    - Random Forests
    - Boosting
- ### Examples:
    - Random forests
    - Boosting algorithm

## Review of concepts: Bootstrapping

**The bootstrap is resampling from our data to estimate the distribution of an estimator.**

![Bootstrap](images/bootstrap.png)

- I do not follow the math for why the bootstrap actually gives improved finite sample performance -- termed asymptotic refinements -- (it seems like magic), but intuitively it makes some sense:
    - given our random sample, each observation had an equal probability of arising
    - thus, when we take a random sample of our random sample, our data still have an equal probability of ending up in the bootstrap random sample
    - So, it is as if we are able to conduct many experiments but we are working with a smaller data set than the population; but, hopefully because these data were collected that they should be over weighted

## Agenda:
- ### Why do we use decision trees, random forests, & boosting
- ### Review of concepts:
    - Decision Trees
    - Bootstrapping
    - ### Bagging
    - Random Forests
    - Boosting
- ### Examples:
    - Random forests
    - Boosting algorithm

## Review of concepts: Bagging

**Bagging is Bootstrap AGGregation where we create many bootstrap samples, fit models on all, and then combine.**

![Bagging](images/bagging.png)

- A good analogy is it is the wisdom of the crowds: each model doesn't need to be great but their average is highly predictive

## Agenda:
- ### Why do we use decision trees, random forests, & boosting
- ### Review of concepts:
    - Decision Trees
    - Bootstrapping
    - Bagging
    - ### Random Forests
    - Boosting
- ### Examples:
    - Random forests
    - Boosting algorithm

## Review of concepts: Random Forests

**Random Forest is an ensemble algorithm which creates many decision trees using bagging and dropping variables.**

![RF_Algorithm](images/rf.png)


We set m ~= sqrt(d), which is an empirical result as opposed to a mathematical result.

Improvement from reducing the correlation between each tree.

## Agenda:
- ### Why do we use decision trees, random forests, & boosting
- ### Review of concepts:
    - Decision Trees
    - Bootstrapping
    - Bagging
    - Random Forests
    - ### Boosting
- ### Examples:
    - Random forests
    - Boosting algorithm

### What is Ensembling?

**Ensemble learning (or "ensembling")** is the process of combining several predictive models in order to produce a combined model that is more accurate than any individual model. For example, given predictions from several models we could:

- **Regression:** Take the average of the predictions.
- **Classification:** Take a vote and use the most common prediction.

For ensembling to work well, the models must be:

- **Accurate:** They outperform the null model.
- **Independent:** Their predictions are generated using different processes.

**The big idea:** If you have a collection of individually imperfect (and independent) models, the "one-off" mistakes made by each model are probably not going to be made by the rest of the models, and thus the mistakes will be discarded when you average the models.

There are two basic **methods for ensembling:**

- Manually ensembling your individual models.
- Using a model that ensembles for you.

### Comparing Manual Ensembling With a Single Model Approach

**Advantages of manual ensembling:**

- It increases predictive accuracy.
- It's easy to get started.

**Disadvantages of manual ensembling:**

- It decreases interpretability.
- It takes longer to train.
- It takes longer to predict.
- It is more complex to automate and maintain.
- Small gains in accuracy may not be worth the added complexity.

## Review of concepts: Boosting

Boosted is an intelligent way of improving weakness in the model with each new bootstrap sample+model fit.

![boosted](images/boosting.png)

## Agenda:
- ### Why do we use decision trees, random forests, & boosting
- ### Review of concepts:
    - Decision Trees
    - Bootstrapping
    - Bagging
    - Random Forests
    - Boosting
- ### Examples:
    - ### Random forests
    - Boosting algorithm

## Examples: Random Forests

### Admissions data

Classify [admissions] using predictors: [gre, gpa, prestige]

In [2]:
#!pip install pydotplus

Collecting pydotplus
  Using cached https://files.pythonhosted.org/packages/60/bf/62567830b700d9f6930e9ab6831d6ba256f7b0b730acb37278b0ccdffacf/pydotplus-2.0.2.tar.gz
Building wheels for collected packages: pydotplus
  Running setup.py bdist_wheel for pydotplus: started
  Running setup.py bdist_wheel for pydotplus: finished with status 'done'
  Stored in directory: C:\Users\david\AppData\Local\pip\Cache\wheels\35\7b\ab\66fb7b2ac1f6df87475b09dc48e707b6e0de80a6d8444e3628
Successfully built pydotplus
Installing collected packages: pydotplus
Successfully installed pydotplus-2.0.2


In [3]:
# Imports
from sklearn.externals.six import StringIO  
from IPython.display import Image  

import pydotplus
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
from sklearn.model_selection import train_test_split

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.tree import DecisionTreeClassifier

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

In [4]:
# Load admissions data
admit = pd.read_csv('datasets/admissions.csv')

In [5]:
# Drop the nulls instead of imputing right now, to save time
admit = admit.dropna()
admit.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


In [6]:
### Clean up the data set
y_class = [-1 if k == 0 else k for k in admit['admit']]
X_class = admit[['gpa','gre','prestige']]

In [7]:
# Create tts 
X_train, X_test, y_train, y_test = train_test_split(
    X_class, y_class,
    test_size = .25, random_state = 42)

In [8]:
len(y_test)

100

In [9]:
len(y_train)

297

In [10]:
# Build tree
model_dt = DecisionTreeClassifier(max_depth=2)
model_dt.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [11]:
model_dt_scores = cross_val_score(model_dt, X_train, y_train, cv=4)
print("Decision Classifier max depth 4: ", model_dt_scores, 'Mean score:', np.mean(model_dt_scores))
model_dt.score(X_test,y_test)

Decision Classifier max depth 4:  [0.65333333 0.72972973 0.75675676 0.71621622] Mean score: 0.7140090090090091


0.61

In [12]:
model_rf = RandomForestClassifier(n_estimators=20, max_features=2, max_depth=4, bootstrap=True)
model_rf.fit(X_train, y_train)
model_rf_scores = cross_val_score(model_rf, X_train, y_train, cv=4)
print("Decision Classifier max depth 4: ", model_rf_scores, 'Mean score:', np.mean(model_rf_scores))
model_rf.score(X_test,y_test)

Decision Classifier max depth 4:  [0.66666667 0.74324324 0.75675676 0.72972973] Mean score: 0.7240990990990992


0.66

### Let's use the college dataset to see if we can get a similiar improvement

In [13]:
col = pd.read_csv('datasets/College.csv')
y   = col.Private.map(lambda x: 1 if x == 'Yes' else -1)
X   = col.iloc[:, 2:]

In [14]:
y.shape

(777,)

In [15]:
X.head()

Unnamed: 0,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


In [16]:
X.shape

(777, 17)

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size = .25, random_state = 1)

In [18]:
model_dt = DecisionTreeClassifier(max_depth=5)
model_dt.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [19]:
model_dt_scores = cross_val_score(model_dt, X_train, y_train, cv=4)
print("Decision Classifier max depth 4: ",model_dt_scores, 'Mean score:', np.mean(model_dt_scores))

Decision Classifier max depth 4:  [0.88435374 0.91724138 0.91724138 0.91724138] Mean score: 0.9090194698569082


In [20]:
model_dt.score(X_test,y_test)

0.9128205128205128

In [21]:
max_depth_input = 5
model_rf = RandomForestClassifier(n_estimators=100, max_features=6, max_depth=max_depth_input, bootstrap=True)
model_rf.fit(X_train, y_train)
model_rf_scores = cross_val_score(model_rf, X_train, y_train, cv=4)
print("Random Forest Classifier max depth ",max_depth_input, ": ", model_rf_scores, 'Mean score:', np.mean(model_rf_scores))
model_rf.score(X_test,y_test)

Random Forest Classifier max depth  5 :  [0.93197279 0.93793103 0.91724138 0.94482759] Mean score: 0.9329931972789115


0.9384615384615385

## Agenda:
- ### Why do we use decision trees, random forests, & boosting
- ### Review of concepts:
    - Decision Trees
    - Bootstrapping
    - Bagging
    - Random Forests
    - Boosting
- ### Examples:
    - Random forests
    - ### Boosting algorithm

## Examples: Boosting

In [22]:
#Import
from sklearn.ensemble import AdaBoostClassifier

In [23]:
col = pd.read_csv('datasets/College.csv')
y =col.Private.map(lambda x: 1 if x == 'Yes' else -1)
X = col.iloc[:, 2:]
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size = .25, random_state = 1)

In [24]:
model_bst = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_features=6, max_depth=5), n_estimators=200, learning_rate=0.01, random_state=1)
model_bst.fit(X_train, y_train)
model_bst.score(X_test,y_test)

0.958974358974359

# EXTRAS!

## Let's see how gradient boosting does

In [25]:
#Import
from sklearn.ensemble import GradientBoostingClassifier

In [26]:
col = pd.read_csv('datasets/College.csv')
y   = col.Private.map(lambda x: 1 if x == 'Yes' else -1)
X   = col.iloc[:, 2:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state = 1)

In [27]:
model_grad_bst = GradientBoostingClassifier(max_features=8, max_depth=5, n_estimators=400, learning_rate=0.05, random_state=1)
model_grad_bst.fit(X_train, y_train)
model_grad_bst.score(X_test,y_test)

0.9435897435897436

## What about the admissions data?

In [28]:
admit = pd.read_csv('datasets/admissions.csv')
admit = admit.dropna()
y = [-1 if k == 0 else k for k in admit['admit']]
X = admit[['gpa','gre','prestige']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state = 2)

In [29]:
model_grad_bst = GradientBoostingClassifier(max_features=2, max_depth=2, n_estimators=2750, learning_rate=0.001, random_state=1)
model_grad_bst.fit(X_train, y_train)
model_grad_bst_scores = cross_val_score(model_grad_bst, X_train, y_train, cv=4)
print("Grad Boost Classifier: ",model_grad_bst_scores, 'Mean score:', np.mean(model_grad_bst_scores))
model_grad_bst.score(X_test,y_test)

Grad Boost Classifier:  [0.74666667 0.70666667 0.66216216 0.73972603] Mean score: 0.7138053807231889


0.69

## Why do we care about balanced classes?

In [30]:
# let's get the college private/public data again, but let's make it imbalanced
col = pd.read_csv('datasets/College.csv')
col_priv = col.loc[col['Private'] == 'Yes']
col_pub = col.loc[col['Private'] == 'No']

In [31]:
col_priv.shape

(565, 19)

In [32]:
col_pub.shape

(212, 19)

In [33]:
col_priv = col_priv.sample(200)
col_pub = col_pub.sample(200)
col = col_priv.append(col_pub)
col.shape

(400, 19)

In [34]:
y   = col.Private.map(lambda x: 1 if x == 'Yes' else -1)
X   = col.iloc[:, 2:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state = 1)

In [35]:
max_depth_input = 5
model_rf = RandomForestClassifier(n_estimators=100, max_features=6, max_depth=max_depth_input, bootstrap=True)
model_rf.fit(X_train, y_train)
model_rf_scores = cross_val_score(model_rf, X_train, y_train, cv=4)
print("Random Forest Classifier max depth ",max_depth_input, ": ", model_rf_scores, 'Mean score:', np.mean(model_rf_scores))
model_rf.score(X_test,y_test)

Random Forest Classifier max depth  5 :  [0.92105263 0.94736842 0.93243243 0.91891892] Mean score: 0.9299431009957325


0.87

In [36]:
model_grad_bst = GradientBoostingClassifier(max_features=2, max_depth=9, n_estimators=200, learning_rate=0.005, random_state=1)
model_grad_bst.fit(X_train, y_train)
model_grad_bst.score(X_test,y_test)

0.86

#### References:

- https://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29
- https://en.wikipedia.org/wiki/Random_forest
- https://en.wikipedia.org/wiki/Gradient_boosting#Informal_introduction
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
- Joel Horowitz. 2001. The Bootstrap. Handbook of Econometrics. https://www.sciencedirect.com/science/article/pii/S157344120105005X?via%3Dihub
- Elements of Stat. Learning by Hastie https://web.stanford.edu/~hastie/ElemStatLearn//