# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.datasets import load_breast_cancer

In [3]:
data = load_breast_cancer()

In [4]:
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

In [5]:
y.describe()

count    569.000000
mean       0.627417
std        0.483918
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
dtype: float64

In [6]:
y.value_counts()/y.count()

1    0.627417
0    0.372583
dtype: float64

In [7]:
from sklearn import tree
from sklearn import ensemble
from sklearn import model_selection

In [8]:
dt = tree.DecisionTreeClassifier()

In [9]:
scores = model_selection.cross_val_score(dt, X, y)
print scores.mean(), '+/-', scores.std()

0.91383087348 +/- 0.0261388655449


In [10]:
en = ensemble.BaggingClassifier(dt)
scores = model_selection.cross_val_score(en, X, y)
print scores.mean(), '+/-', scores.std()

0.934948482317 +/- 0.0163903633322


### 1.b Scaled pipelines
As you may have noticed the features are not normalized. Do the score improve with normalization?
By now you should be very familiar with pipelines and scaling, so:

1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.
- Which score is better? Are the score significantly different? How can you judge that?
- Are the scores different from the non-scaled data?

In [19]:
from sklearn import pipeline
from sklearn import preprocessing
from sklearn import model_selection
from sklearn import tree

In [12]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)

In [13]:
p1 = pipeline.Pipeline([
        ('standard_scaler', preprocessing.StandardScaler()),
        ('decision_tree', tree.DecisionTreeClassifier())
    ])
p2 = pipeline.make_pipeline(preprocessing.RobustScaler(), ensemble.BaggingClassifier(tree.DecisionTreeClassifier()))

In [14]:
p1.fit(X_train, y_train).score(X_test, y_test)

0.95104895104895104

In [15]:
p2.fit(X_train, y_train).score(X_test, y_test)

0.94405594405594406

### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Classifier?
- Which score is better? Are the score significantly different? How can you judge that?

In [27]:
params = {
    'n_estimators': np.arange(1,20,2),
    'max_samples': np.arange(0.1, 1, .1)
}
gs = model_selection.GridSearchCV(ensemble.BaggingClassifier(tree.DecisionTreeClassifier()), params, n_jobs=-1, verbose=2)

In [28]:
gs.fit(X_train, y_train)

Fitting 3 folds for each of 90 candidates, totalling 270 fits
[CV] n_estimators=1, max_samples=0.1 .................................
[CV] n_estimators=1, max_samples=0.1 .................................
[CV] n_estimators=1, max_samples=0.1 .................................
[CV] n_estimators=3, max_samples=0.1 .................................
[CV] .................. n_estimators=1, max_samples=0.1, total=   0.0s
[CV] .................. n_estimators=1, max_samples=0.1, total=   0.0s
[CV] .................. n_estimators=1, max_samples=0.1, total=   0.0s
[CV] n_estimators=3, max_samples=0.1 .................................
[CV] n_estimators=5, max_samples=0.1 .................................
[CV] .................. n_estimators=3, max_samples=0.1, total=   0.1s
[CV] n_estimators=3, max_samples=0.1 .................................
[CV] n_estimators=5, max_samples=0.1 .................................
[CV] .................. n_estimators=3, max_samples=0.1, total=   0.1s
[CV] ..........

[Parallel(n_jobs=-1)]: Done  58 tasks      | elapsed:    3.0s


[CV] .................. n_estimators=5, max_samples=0.3, total=   0.1s
[CV] .................. n_estimators=5, max_samples=0.3, total=   0.1s
[CV] n_estimators=9, max_samples=0.3 .................................
[CV] n_estimators=7, max_samples=0.3 .................................
[CV] .................. n_estimators=7, max_samples=0.3, total=   0.1s
[CV] n_estimators=7, max_samples=0.3 .................................
[CV] ................. n_estimators=19, max_samples=0.2, total=   0.3s
[CV] .................. n_estimators=7, max_samples=0.3, total=   0.1s
[CV] n_estimators=9, max_samples=0.3 .................................
[CV] .................. n_estimators=7, max_samples=0.3, total=   0.1s
[CV] n_estimators=11, max_samples=0.3 ................................
[CV] .................. n_estimators=9, max_samples=0.3, total=   0.1s
[CV] n_estimators=9, max_samples=0.3 .................................
[CV] n_estimators=13, max_samples=0.3 ................................
[CV] .

[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed:   14.0s finished


GridSearchCV(cv=None, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
        ...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]), 'max_samples': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=2)

In [29]:
gs.best_params_

{'max_samples': 0.80000000000000004, 'n_estimators': 17}

In [None]:
gs.best_estimator_

## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging version of the same.

### 2.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?
- Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [45]:
data = sklearn.datasets.load_diabetes()

In [48]:
X = pd.DataFrame(data.data, columns = ['age', 'sex', 'bmi', 'avg_bp', 'serum1', 'serum2', 'serum3', 'serum4', 'serum5', 'serum6'])

In [49]:
y = data.target

In [51]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)

In [59]:
params = {
    'n_estimators': np.arange(1,20,2),
    'max_samples': np.arange(0.1, 1, .1)
}

d_pip = pipeline.Pipeline([{'ensemble' : ensemble.BaggingClassifier(tree.DecisionTreeRegressor()), 'grid' : model_selection.GridSearchCV(param_grid=params).fit(X_train, y_train).best_estimator_()}])

TypeError: __init__() takes at least 3 arguments (2 given)

In [64]:
params = {
    'n_estimators': np.arange(1,20,2),
    'max_samples': np.arange(0.1, 1, .1)
}
ens = ensemble.BaggingClassifier(tree.DecisionTreeRegressor())

In [76]:
bs = model_selection.GridSearchCV(ens, param_grid=params, verbose=True, cv=5)
bs.fit(X, y)

Fitting 5 folds for each of 90 candidates, totalling 450 fits


[Parallel(n_jobs=1)]: Done 450 out of 450 | elapsed:   39.8s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]), 'max_samples': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=True)

In [77]:
bs.best_estimator_

BaggingClassifier(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=0.20000000000000001, n_estimators=11, n_jobs=1,
         oob_score=False, random_state=None, verbose=0, warm_start=False)

In [78]:
best_ens = ensemble.BaggingClassifier(base_estimator=tree.DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=0.2, n_estimators=11, n_jobs=1,
         oob_score=False, random_state=None, verbose=0, warm_start=False)

In [79]:
best_ens.fit(X_train, y_train).score(X_test, y_test)

0.0090090090090090089

In [None]:
d_pip

In [None]:
ensemble.BaggingClassifier(tree.DecisionTreeRegressor())
model_selection.GridSearchCV(param_grid=)

### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor
- Search for few values of the parameters in order to improve the score of the regressor
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Regressor?
- Which score is better? Are the score significantly different? How can you judge that?


In [117]:
dtree = tree.DecisionTreeClassifier()
ts = model_selection.GridSearchCV(dtree, {
        'criterion':['gini', 'entropy'],
        'max_depth':np.arange(1,20,2),
    }, cv=5)
ts.fit(X, y)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'criterion': ['gini', 'entropy'], 'max_depth': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [118]:
best_tree = ts.best_estimator_.fit(X, y)

In [119]:
ts.best_score_

0.018099547511312219

In [120]:
ensem = ensemble.BaggingClassifier(best_tree)
ensem.fit(X_train, y_train).score(X_test, y_test)

0.0

In [121]:
p1 = pipeline.Pipeline([
        ('standard_scaler', preprocessing.StandardScaler()),
        ('decision_tree', tree.DecisionTreeRegressor())
    ])
p2 = pipeline.make_pipeline(preprocessing.RobustScaler(), ensemble.BaggingRegressor(tree.DecisionTreeRegressor()))

In [132]:
p1.fit(X_train, y_train).score(X_test, y_test)

-0.40335486081572713

In [133]:
p2.fit(X_train, y_train).score(X_test, y_test)

0.31377426269744169

## Bonus: Project 6 data

Repeat the analysis for the Project 6 Dataset