#  Decision Trees, Random Forests, and Boosting

In [None]:
%matplotlib inline

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
#this import is new
from sklearn import tree
# this import is new
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.metrics import zero_one_loss, roc_curve, roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
import numpy as np
import pandas as pd

Let's read in the "Hitters" dataset from ISLR that has information on baseball players, their stats, and their salaries.  Also, we'll drop any rows with missing values.

In [2]:
hitters = pd.read_csv("./hitters.csv")
hitters = hitters.dropna(inplace=False)
hitters.head()

IOError: File ./hitters.csv does not exist

We'll get rid of a few categorical columns rather than deal with converting them.  Then we'll create a binary variable for whether a player makes more than the median salary.

In [3]:
X_df = hitters.drop(["Salary", "League", "Division", "NewLeague"], axis=1)
X = np.array(X_df)
y = (hitters["Salary"] >= np.median(hitters["Salary"])).astype("int")

NameError: name 'hitters' is not defined

Let's make the usual train/test split:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

Next, we'll train a scikit-learn classification [decision tree](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) using the "gini" splitting criterion.  We could also have selected the "entropy" criterion.  Here's the documentation for the scikit-learn regression [decision tree](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html), which splits based on MSE.

In [None]:
dt = tree.DecisionTreeClassifier(criterion="gini", min_samples_split=10)
dt.fit(X_train, y_train)

dt_test_preds = dt.predict_proba(X_test)[:, 1]
dt_test_preds

There are several different ways to control the tree structure in the all of the scikit-learn tree classes.  Usually you would select just one of them:
- `max_depth`: the number of layers deep to grow the tree (decision tree and random forest go as deep as possible by default; gradient boosting goes 3 deep)
- `min_samples_split`: don't continue to split an internal node if there are fewer than this many examples in the node; the default is 2, i.e. always keep splitting
- `min_samples_leaf`: don't consider something a leaf node if it has more than this many examples; default is 1
- `max_leaf_nodes`: maximum number of final leaf nodes; default is None, i.e. as many as we can get

In [None]:
fpr_decision_tree, tpr_decision_tree, thresholds_decision_tree = roc_curve(y_test, dt_test_preds)

In [None]:
# we want to draw the random baseline ROC line too
fpr_rand = tpr_rand = np.linspace(0, 1, 10)

plt.plot(fpr_decision_tree, tpr_decision_tree)
plt.plot(fpr_rand, tpr_rand, linestyle='--')
plt.show()

In [None]:
roc_auc_score(y_test, dt_test_preds)

Note that the scikit-learn decision tree doesn't do pruning like R does, so you should read through the lab in the book to see the R code for doing that.

Let's visualize the tree that we've grown.  For this to work, you have to install [graphviz](http://www.graphviz.org/Download..php) and do `pip install pydot`.

In [None]:
from sklearn.externals.six import StringIO  
import pydot 
dot_data = StringIO() 
tree.export_graphviz(dt, out_file=dot_data) 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
graph.write_pdf("./hitters.pdf") 

In [None]:
X_df.columns[3]

## The Decision Boundary of a Decision Tree

Let's build a model off of only two predictors, `Walks` and `Hits` so that we visualize the decision boundary.  That is, how is the tree partitioning up the two-dimensional space.

In [None]:
X_2 = np.array(hitters[["Walks", "Hits"]])
X_2_train, X_2_test, y_2_train, y_2_test = train_test_split(X_2, y, train_size=0.7)

dt.fit(X_2_train, y_2_train)

As we've done before with k-NN and logistic regression, we'll make a two-dimensional grid and get the model prediction at each point, and then color the plane by the prediction.

In [None]:
# step size of the mesh
h = 20
# range of the mesh
x_min, x_max = X_2[:, 0].min() - .5, X_2[:, 0].max() + .5
y_min, y_max = X_2[:, 1].min() - .5, X_2[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

In [None]:
# ravel is the same as reshape(-1)
all_preds = dt.predict(np.column_stack((xx.ravel(), yy.ravel())))
grid_preds = all_preds.reshape(xx.shape)

In [None]:
plt.pcolormesh(xx, yy, grid_preds, cmap=plt.cm.Paired)

plt.scatter(X_2[:, 0], X_2[:, 1], c=y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Walks')
plt.ylabel('Hits')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())

plt.show()

# Random Forests

scikit-learn also has an easy-to-use random forest [classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and [regressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).

The parameter `n_estimators` controls how many different trees we want to fit, each one on a bootstrap sampled version of the dataset.  The parameter `max_features` controls how many of the different predictors we consider splitting on at each node.  `n_jobs` means that we can split the training up over multiple cores to make it faster.  And `oob_score` tells it to save the out-of-bag scores for each example.

In [None]:
rf = RandomForestClassifier(n_estimators=1000, criterion='gini', max_depth=None, max_features='sqrt', oob_score=True,
                            n_jobs=1, verbose=0)

In [None]:
rf.fit(X_train, y_train)

rf_test_preds = rf.predict_proba(X_test)[:, 1]
rf_test_preds

In [None]:
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test, rf_test_preds)

In [None]:
plt.plot(fpr_decision_tree, tpr_decision_tree)
plt.plot(fpr_rf, tpr_rf)
plt.plot(fpr_rand, tpr_rand, linestyle='--')
plt.show()

In [None]:
roc_auc_score(y_test, rf_test_preds)

random forests and gradient boosting tree ensembles have a metric that tells us how important different features are.  This comes from adding up the decrease in error each time that predictor is used as a split.

In [None]:
rf_importances = rf.feature_importances_

In [None]:
# we get the indices in the order that would make the importances sorted
sorted_indices = np.argsort(rf_importances)

y_pos = range(len(rf_importances))
plt.barh(y_pos, rf_importances[sorted_indices], align='center')
plt.yticks(y_pos, X_df.columns[sorted_indices])
plt.show()

We can get the accuracy on the out-of-bag examples:

In [None]:
rf.oob_score_

And the out-of-bag scores for the training examples themselves:

In [None]:
rf.oob_decision_function_

As we saw last week, we can use the `GridSearchCV` function to do a scan over the tuneable parameters of the random forest to get the best values.

In [None]:
param_grid = {"n_estimators": [1, 10, 100, 500], "max_depth": [1, 2, 5, None], "max_features": ['sqrt', 'auto']}

cv = GridSearchCV(rf, param_grid, cv=5, n_jobs=4, refit=True)
cv.fit(X_train, y_train)

In [None]:
cv.best_estimator_

The most important parameter is the number of trees:

In [None]:
cv.grid_scores_

In [None]:
# step size of the mesh
h = 20
# range of the mesh
x_min, x_max = X_2[:, 0].min() - .5, X_2[:, 0].max() + .5
y_min, y_max = X_2[:, 1].min() - .5, X_2[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

In [None]:
rf = RandomForestClassifier(n_estimators=1000, criterion='gini', max_depth=None, max_features='sqrt',
                            n_jobs=1, verbose=0)
rf.fit(X_2_train, y_2_train)

In [None]:
# ravel is the same as reshape(-1)
all_preds = rf.predict(np.column_stack((xx.ravel(), yy.ravel())))
grid_preds = all_preds.reshape(xx.shape)

In [None]:
plt.pcolormesh(xx, yy, grid_preds, cmap=plt.cm.Paired)

plt.scatter(X_2_train[:, 0], X_2_train[:, 1], c=y_train, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Walks')
plt.ylabel('Hits')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())

plt.show()

# Boosting

Next, let's look at the gradient boosting [classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) (there's also a gradient boosting [regressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html).)

Like random forets, the `n_estimators` parameter controls how many trees we want to fit.  The `learning_rate` parameter controls how slowly we want the classifier to learn.  Typically, the smaller the value of the learning rate the more trees you'll need.

In [None]:
gbt = GradientBoostingClassifier(learning_rate=0.1, n_estimators=200, max_depth=3)

In [None]:
gbt.fit(X_train, y_train)

gbt_test_preds = gbt.predict_proba(X_test)[:, 1]
gbt_test_preds

In [None]:
fpr_gbt, tpr_gbt, thresholds_gbt = roc_curve(y_test, gbt_test_preds)

In [None]:
plt.plot(fpr_decision_tree, tpr_decision_tree)
plt.plot(fpr_rf, tpr_rf)
plt.plot(fpr_gbt, tpr_gbt)
plt.plot(fpr_rand, tpr_rand, linestyle='--')
plt.show()

In [None]:
roc_auc_score(y_test, rf_test_preds)

In [None]:
gbt_importances = gbt.feature_importances_
sorted_indices = np.argsort(gbt_importances)

y_pos = range(len(gbt_importances))
plt.barh(y_pos, gbt_importances[sorted_indices], align='center')
plt.yticks(y_pos, X_df.columns[sorted_indices])
plt.show()

GBT has a `staged_predict` function that shows you what prediction it would make after each tree in the ensemble sum.

In [None]:
num_trees = []
train_errs = []
for i, y_pred in enumerate(gbt.staged_predict(X_train)):
    num_trees.append(i)
    train_errs.append(zero_one_loss(y_train, y_pred))
    
test_errs = []
for i, y_pred in enumerate(gbt.staged_predict(X_test)):
    test_errs.append(zero_one_loss(y_test, y_pred))

In [None]:
plt.plot(num_trees, train_errs)
plt.plot(num_trees, test_errs)
plt.xlabel("Number of Trees")
plt.xlabel("Error")
plt.show()

[This](http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html) is probably a better example of what this plot would look like on a bigger dataset.

I don't have time to go into these, but **[partial dependence plots](http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html)** are a great way to visualize what's happening in a regression problem with tree ensembles.

## Extra Random Trees

Amazingly, just taking random predictors and random split thresholds, rather than choosing the best ones, does a great job.  That's what the extra random trees [classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) and [regressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html) do.  Why would you ever want to do this??  It's blazingly fast because you don't have to check which predictors and which values to split on.  You just take random ones!!

In [None]:
et = ExtraTreesClassifier(n_estimators=500, n_jobs=1)

In [None]:
et.fit(X_train, y_train)

et_test_preds = et.predict_proba(X_test)[:, 1]
et_test_preds

In [None]:
fpr_et, tpr_et, thresholds_et = roc_curve(y_test, et_test_preds)

In [None]:
plt.plot(fpr_decision_tree, tpr_decision_tree)
plt.plot(fpr_rf, tpr_rf)
plt.plot(fpr_gbt, tpr_gbt)
plt.plot(fpr_et, tpr_et)
plt.plot(fpr_rand, tpr_rand, linestyle='--')
plt.show()

In [None]:
roc_auc_score(y_test, et_test_preds)

In [None]:
et = ExtraTreesClassifier(n_estimators=1000, n_jobs=1)
et.fit(X_2_train, y_2_train)

In [None]:
# ravel is the same as reshape(-1)
all_preds = et.predict(np.column_stack((xx.ravel(), yy.ravel())))
grid_preds = all_preds.reshape(xx.shape)

In [None]:
plt.pcolormesh(xx, yy, grid_preds, cmap=plt.cm.Paired)

plt.scatter(X_2_train[:, 0], X_2_train[:, 1], c=y_train, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Walks')
plt.ylabel('Hits')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())

plt.show()