# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and Y.
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds.
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn import preprocessing
from sklearn.grid_search import GridSearchCV

In [2]:
# Load the data and set out features and target
bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target

In [3]:
# Initialize a Decision Tree Classifier
dtc = DecisionTreeClassifier()
# Use cross_val_score to evaluate DTC performance
print 'DTC score: %f' % cross_val_score(dtc,X,y,cv=5,n_jobs=-1).mean()

DTC score: 0.919169


In [4]:
# Wrap a Bagging Classifier around the DTC
bagging = BaggingClassifier(DecisionTreeClassifier())
# Use cross_val_score to evaluate performance with bagging
print 'Bagging score: %f' % cross_val_score(bagging,X,y,cv=5,n_jobs=-1).mean()

Bagging score: 0.954475


In [5]:
# We see that score when using bagging is significantly better than a regular decision tree classifier

### 1.b Scale (normalize) data
As you may have noticed the features are not normalized. Do the score improve with normalization?

1. Normalize the predictors.
2. Build a decision tree classifier and bagging decision tree classifier.
3. Are scores different from non-scaled data?


In [6]:
# Normalize the predictors
Xs = preprocessing.normalize(X,norm='l1')

In [7]:
# Build a DTC and Bagging Classifier and check the cross val scores
dtc = DecisionTreeClassifier()
print 'DTC score: %f' % cross_val_score(dtc,Xs,y,cv=5,n_jobs=-1).mean()
bagging = BaggingClassifier(DecisionTreeClassifier())
print 'Bagging score: %f' % cross_val_score(bagging,Xs,y,cv=5,n_jobs=-1).mean()

DTC score: 0.936745
Bagging score: 0.957830


In [8]:
# We see that the scores aren't different than the non-normalized data

### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier.
2. Search for few parameter values to try and improve the score of the classifier.
4. Check the best\_score\_ once you've trained it. Is it better than before?
5. How does the score of the Grid-searched DT compare with the score of the Bagging DT?
6. Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
7. Repeat the search
    - Note that you'll have to change parameter names for the base_estimator (see example).
    - Note that there are also additional parameters to change (see example).
    - Note that you may end up with a grid space to large to search in a short time - choose smaller ranges of parameters!
    - Make use of the n_jobs parameter to speed up your grid search (-1 uses all cores).
8. Does the score improve for the Grid-searched Bagging Classifier?
9. Which score is better? Are the score significantly different? How could/would you judge that?

In [9]:
# Set the parameters for our first grid search
params = {'max_features': range(1,31),
          'max_depth': range(1,21),
          'min_samples_split': [2,5,7],
          'min_samples_leaf': [1,3,5,7,10]
         }

In [10]:
# Perform a grid search on a Decision Tree Classifier
grid = GridSearchCV(DecisionTreeClassifier(),params,cv=5,n_jobs=-1,verbose=1)
grid.fit(X,y)

Fitting 5 folds for each of 9000 candidates, totalling 45000 fits


[Parallel(n_jobs=-1)]: Done 912 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 5112 tasks      | elapsed:    5.0s
[Parallel(n_jobs=-1)]: Done 12112 tasks      | elapsed:   12.4s
[Parallel(n_jobs=-1)]: Done 21912 tasks      | elapsed:   23.3s
[Parallel(n_jobs=-1)]: Done 34512 tasks      | elapsed:   37.1s
[Parallel(n_jobs=-1)]: Done 45000 out of 45000 | elapsed:   48.9s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_features': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'min_samples_split': [2, 5, 7], 'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 'min_samples_leaf': [1, 3, 5, 7, 10]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [11]:
print 'Best score using GridSearch on DTC: %f' % grid.best_score_

Best score using GridSearch on DTC: 0.956063


In [12]:
# Set the parameters for our grid search using bagging
params = {"base_estimator__max_depth": [3,5,10,20],
          "base_estimator__max_features": [None,"auto"],
          "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
          "base_estimator__min_samples_split": [2, 5, 7],
          'bootstrap_features': [False, True],
          'max_features': [0.5, 0.7, 1.0],
          'max_samples': [0.5, 0.7, 1.0],
          'n_estimators': [2, 5, 10, 20],
         }

In [13]:
# Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
bdtc = BaggingClassifier(DecisionTreeClassifier())
# Perform the grid search
grid = GridSearchCV(bdtc,params,cv=5,n_jobs=-1,verbose=1)
grid.fit(X,y)

Fitting 5 folds for each of 8640 candidates, totalling 43200 fits


[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 3736 tasks      | elapsed:   23.4s
[Parallel(n_jobs=-1)]: Done 8736 tasks      | elapsed:   49.9s
[Parallel(n_jobs=-1)]: Done 15736 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 24736 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 35736 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 43200 out of 43200 | elapsed:  4.3min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, ...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [2, 5, 10, 20], 'max_samples': [0.5, 0.7, 1.0], 'base_estimator__min_samples_split': [2, 5, 7], 'base_estimator__max_depth': [3, 5, 10, 20], 'bootstrap_features': [False, True], 'max_features': [0.5, 0.7, 1.0], 'base_estimator__min_samples_leaf': [1, 3, 5, 7, 10], 'base_estimator__max_features': [None, 'auto']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [14]:
print 'Best score using GridSearch on Bagging DTC: %f' % grid.best_score_

Best score using GridSearch on Bagging DTC: 0.971880


In [15]:
# Grid searching a Bagging DTC gives us the best score.
# The improvement over other methods is significant.
# I judge the score by looking at them. A change of .002 to the score would be insignificant, but a change of .02 like we saw here is significant.

## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging Regressor instead of classifiers.

### 2.a Simple comparison
1. Load the data and create X and Y
2. Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. What does the score mean (look at documentation!).
3. Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
4. Which score is better? Are the score significantly different? How could/would you judge that?

In [16]:
# Load the data and set the features and target
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

In [17]:
# Initialize a DTR and evaluate it's performance using cross_val_score
dtr = DecisionTreeRegressor()
print 'DTR score: %f' % cross_val_score(dtr,X,y,cv=5,n_jobs=-1,scoring='r2').mean()

DTR score: -0.164144


In [18]:
# This r2 score means there is a very weak negative correlation

In [19]:
# Initialize a Bagging DTR and evaluate it's performance using cross_val_score
bdtr = BaggingRegressor(DecisionTreeRegressor())
print 'Bagging score: %f' % cross_val_score(bdtr,X,y,cv=5,n_jobs=-1).mean()

Bagging score: 0.367130


In [None]:
# The r2 score of the bagging regressor is much better than just the DTR.
# It is signficantly better, but 0.36 is still not a very good r2 score.

### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor.
2. Search for few values of the parameters in order to improve the score of the regressor.
3. Check the best\_score\_ once you've trained it. Is it better than before?
4. How does the score of the Grid-searched DT compare with the score of the Bagging DT?
5. Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
6. Repeat the search
    - Note that you'll have to change parameter names for the base_estimator.
    - Note that there are also additional parameters to change.
    - Note that you may end up with a grid space to large to search in a short time.
    - Make use of the n_jobs parameter to speed up your grid search.
7. Does the score improve for the Grid-searched Bagging Regressor?
8. Which score is better? Are the score significantly different? How could/would you judge that?


In [20]:
# Set the parameters for our first grid search
params = {'max_features': range(1,11),
          'max_depth': range(1,11),
          'min_samples_split': [2,5,7],
          'min_samples_leaf': [1,3,5,7,10]
         }

In [21]:
# Perform a grid search on a Decision Tree Regressor
grid = GridSearchCV(DecisionTreeRegressor(),params,cv=5,n_jobs=-1,verbose=1)
grid.fit(X,y)

Fitting 5 folds for each of 1500 candidates, totalling 7500 fits


[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 4540 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 7500 out of 7500 | elapsed:    7.3s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_features': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'min_samples_split': [2, 5, 7], 'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'min_samples_leaf': [1, 3, 5, 7, 10]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [22]:
print 'Best score using GridSearch on DTR: %f' % grid.best_score_

Best score using GridSearch on DTR: 0.378855


In [23]:
# The GridSearch on a DTR gives us a slightly better score than a Bagging DTR.

In [24]:
# Set the parameters for our grid search using bagging
params = {"base_estimator__max_depth": [3,5,10,20],
          "base_estimator__max_features": [None,"auto"],
          "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
          "base_estimator__min_samples_split": [2, 5, 7],
          'bootstrap_features': [False, True],
          'max_features': [0.5, 0.7, 1.0],
          'max_samples': [0.5, 0.7, 1.0],
          'n_estimators': [2, 5, 10, 20],
         }

In [25]:
# Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
bdtr = BaggingRegressor(DecisionTreeRegressor())
# Perform the grid search
grid = GridSearchCV(bdtr,params,cv=5,n_jobs=-1,verbose=1)
grid.fit(X,y)

Fitting 5 folds for each of 8640 candidates, totalling 43200 fits


[Parallel(n_jobs=-1)]: Done 432 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done 2832 tasks      | elapsed:   11.6s
[Parallel(n_jobs=-1)]: Done 6832 tasks      | elapsed:   29.6s
[Parallel(n_jobs=-1)]: Done 12432 tasks      | elapsed:   56.2s
[Parallel(n_jobs=-1)]: Done 19632 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 28432 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 38832 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 43200 out of 43200 | elapsed:  3.5min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingRegressor(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [2, 5, 10, 20], 'max_samples': [0.5, 0.7, 1.0], 'base_estimator__min_samples_split': [2, 5, 7], 'base_estimator__max_depth': [3, 5, 10, 20], 'bootstrap_features': [False, True], 'max_features': [0.5, 0.7, 1.0], 'base_estimator__min_samples_leaf': [1, 3, 5, 7, 10], 'base_estimator__max_features': [None, 'auto']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [26]:
print 'Best score using GridSearch on Bagging DTR: %f' % grid.best_score_

Best score using GridSearch on Bagging DTR: 0.471675


In [None]:
# Using GridSearch on our Bagging DTR gives us the best score out of all the tested methods by a significant amount.