# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [61]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from scipy.stats import ttest_ind
from sklearn.pipeline import make_pipeline, make_union

%matplotlib inline


In [63]:
# Load the data and create X and y
X, y = datasets.load_breast_cancer(return_X_y=True)

In [64]:
# Initialize a Decision Tree Classifier and use cross_val_score to evaluate its performance.  Set cv to 5-folds

# Create and Fit Model
model_dtc = DecisionTreeClassifier(random_state=1)
model_dtc.fit(X_train, y_train)

# Train, Test, Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Return Scores
scores = cross_val_score(model_dtc, X_train, y_train,cv=5)

print np.mean(scores)


0.902836637047


In [65]:
# Wrap a Bagging Classifier around the Decision Tree Classifier and 
# use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds.

model_bagger = BaggingClassifier(dtc, random_state=1)
model_bagger.fit(X_train, y_train)
scores_bagger = cross_val_score(model_bagger, X_train, y_train, cv=5)

print np.mean(scores_bagger)

0.936910457963


In [66]:
print "Which score is better? Are the score significantly different? How can you judge that?"
print "The bagger socre is better by .03."

Which score is better? Are the score significantly different? How can you judge that?
The bagger socre is better by .03.


### 1.b Scaled pipelines
As you may have noticed the features are not normalized. Do the score improve with normalization?
By now you should be very familiar with pipelines and scaling, so:

1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.
- Which score is better? Are the score significantly different? How can you judge that?
- Are the scores different from the non-scaled data?

In [67]:
# Which score is better? Are the score significantly different? How can you judge that?
# Are the scores different from the non-scaled data?

# Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.

pipe_no_bagger = make_pipeline(StandardScaler(),DecisionTreeClassifier(random_state=1))

model_no_bagger = pipe_no_bagger.fit(X_train, y_train)

pipe_bagger = make_pipeline(StandardScaler(),BaggingClassifier(model_no_bagger, random_state=1))

model_bagger = pipe_bagger.fit(X_train, y_train)

print np.mean(cross_val_score(model_no_bagger, X_train, y_train, cv=5))
print np.mean(cross_val_score(model_bagger, X_train, y_train, cv=5))

print 'The scores are the same'

0.902836637047
0.936910457963
The scores are the same


### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Classifier?
- Which score is better? Are the score significantly different? How can you judge that?

In [68]:
# Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
from sklearn.model_selection import GridSearchCV

param_grid = {"criterion": ["gini", "entropy"],
              "min_samples_split": [2, 10, 20],
              "max_depth": [None, 2, 5, 10],
              "min_samples_leaf": [1, 5, 10],
              "max_leaf_nodes": [None, 5, 10, 20],
              }

model_gs = GridSearchCV(dtc, param_grid=param_grid, cv=5)

#Use the whole X, y dataset for your test
model_gs.fit(X,y)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_split': [2, 10, 20], 'max_leaf_nodes': [None, 5, 10, 20], 'criterion': ['gini', 'entropy'], 'max_depth': [None, 2, 5, 10], 'min_samples_leaf': [1, 5, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [56]:
print "Check the best_score_ once you've trained it. Is it better than before?"
print gs.best_score_
print 'Yes, barely better.'

Check the best_score_ once you've trained it. Is it better than before?
0.942003514938
Yes, barely better than bagging


## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging version of the same.

### 2.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?
- Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [72]:
# Load the data and create X and y
X, y = datasets.load_diabetes(return_X_y=True)

# Initialize a Decision Tree Regressor model
model_dtr = DecisionTreeRegressor(random_state=1)

# Train, Test, Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Create and Fit Model
model_dtr.fit(X_train, y_train)

# Use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?
scores = cross_val_score(model_dtr, X_train, y_train, cv=5)
print np.mean(scores)


-0.030236250409


In [74]:
# Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. 
# Set crossvalidation to 5-folds.

model_dtr_bagger = BaggingRegressor(model_dtr, random_state=1)
model_dtr_bagger.fit(X_train, y_train)
scores_bagged = cross_val_score(model_dtr_bagger, X_train, y_train, cv=5)

print np.mean(scores_bagged)

print "Score is much, much higher than without bagging."

0.351810840327


### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor
- Search for few values of the parameters in order to improve the score of the regressor
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Regressor?
- Which score is better? Are the score significantly different? How can you judge that?


In [79]:
param_grid = {"criterion": ["mse", "mae"],
              "min_samples_split": [2, 10, 20],
              "max_depth": [None, 2, 5, 10],
              "min_samples_leaf": [1, 5, 10],
              "max_leaf_nodes": [None, 5, 10, 20],
              }

model_dtr_gs = GridSearchCV(model_dtr, param_grid=param_grid, cv=5)

model_dtr_gs.fit(X,y)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=1,
           splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_split': [2, 10, 20], 'max_leaf_nodes': [None, 5, 10, 20], 'criterion': ['mse', 'mae'], 'max_depth': [None, 2, 5, 10], 'min_samples_leaf': [1, 5, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [82]:
# Check the best_score_ once you've trained it. Is it better than before?
print model_dtr_gs.best_score_
print 'The score is not as good as bagging'

0.348315687965
The score is not as good as bagging


## Bonus: Project 6 data

Repeat the analysis for the Project 6 Dataset