# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and Y.
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds.
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [35]:
import sklearn.datasets as dataloader

bc = dataloader.load_breast_cancer()

In [36]:
X = bc.data
Y = bc.target

In [37]:
# load required packages
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score

dtc = DecisionTreeClassifier(max_depth = 5)
scores = cross_val_score(dtc, X, Y, verbose = 1, cv = 5)

[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


In [38]:
print scores
print np.mean(scores)

[ 0.90434783  0.93913043  0.92035398  0.94690265  0.90265487]
0.922677953059


In [39]:
bag = BaggingClassifier(dtc, n_estimators = 7, max_samples = 0.5, max_features = 0.33)
bag_scores = cross_val_score(bag, X, Y, verbose = 1, cv = 5)

print bag_scores
print np.mean(bag_scores)

[ 0.93913043  0.93043478  0.96460177  0.94690265  0.97345133]
0.950904193921


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished


The Bagging Classifier results in a higher score with 94.0% as opposed to the Decision Tree Classifier score of 92.4%. 

## 1.b Scale (normalize) data
As you may have noticed the features are not normalized. Do the score improve with normalization?

1. Normalize the predictors.
2. Build a decision tree classifier and bagging decision tree classifier.
3. Are scores different from non-scaled data?


In [40]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_norm = scaler.fit_transform(X)

In [41]:
dtc_norm = DecisionTreeClassifier(max_depth = 5)
scores_norm = cross_val_score(dtc, X_norm, Y, verbose = 1, cv = 5)

print scores_norm
print np.mean(scores_norm)

[ 0.89565217  0.92173913  0.92920354  0.92035398  0.90265487]
0.913920738746


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


In [42]:
bag_norm = BaggingClassifier(dtc, n_estimators = 7, max_samples = 0.5, max_features = 0.33)
bag_scores_norm = cross_val_score(bag, X_norm, Y, verbose = 1, cv = 5)

print bag_scores_norm
print np.mean(bag_scores_norm)

[ 0.93043478  0.91304348  0.95575221  0.9380531   0.97345133]
0.942146979608


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


Scores are different from the non-scaled data

### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier.
2. Search for few parameter values to try and improve the score of the classifier.
4. Check the best\_score\_ once you've trained it. Is it better than before?
5. How does the score of the Grid-searched DT compare with the score of the Bagging DT?
6. Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
7. Repeat the search
    - Note that you'll have to change parameter names for the base_estimator (see example).
    - Note that there are also additional parameters to change (see example).
    - Note that you may end up with a grid space to large to search in a short time - choose smaller ranges of parameters!
    - Make use of the n_jobs parameter to speed up your grid search (-1 uses all cores).
8. Does the score improve for the Grid-searched Bagging Classifier?
9. Which score is better? Are the score significantly different? How could/would you judge that?

---

**EXAMPLE**
```python
params = {"base_estimator__max_depth": [3,5,10,20],
          "base_estimator__max_features": [None, "auto"],
          "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
          "base_estimator__min_samples_split": [2, 5, 7],
          'bootstrap_features': [False, True],
          'max_features': [0.5, 0.7, 1.0],
          'max_samples': [0.5, 0.7, 1.0],
          'n_estimators': [2, 5, 10, 20],
         }

bagged_decision_trees = BaggingClassifier(DecisionTreeClassifier())

gsbdt = GridSearchCV(bagged_decision_trees, params, n_jobs=-1, cv=5)
```

In [43]:
from sklearn.grid_search import GridSearchCV

In [70]:
dtc_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None],
    'class_weight': ['balanced', None]
    }

est = GridSearchCV(dtc, dtc_params)

In [71]:
grid = est.fit(X_norm, Y)

In [72]:
grid.best_score_

0.92794376098418274

In [73]:
grid.best_params_

{'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 4}

In [79]:
bag_params = {
    'max_features': [0.5, 0.7, 1.0],
    'max_samples': [0.5, 0.7, 1.0],
    'n_estimators': [5, 10, 20],
    }

est_bag = GridSearchCV(bag, bag_params)

In [80]:
grid_bag = est_bag.fit(X_norm, Y)

In [81]:
grid_bag.best_score_

0.95957820738137078

In [82]:
grid_bag.best_params_

{'max_features': 0.5, 'max_samples': 1.0, 'n_estimators': 10}

## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging Regressor instead of classifiers.

### 2.a Simple comparison
1. Load the data and create X and Y
2. Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. What does the score mean (look at documentation!).
3. Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
4. Which score is better? Are the score significantly different? How could/would you judge that?

In [124]:
diabetes = dataloader.load_diabetes()

In [125]:
X = diabetes.data
Y = diabetes.target

In [131]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor(max_depth = 3)

scores = cross_val_score(dtr, X, Y, cv = 5)
print scores
print np.mean(scores)

[ 0.22259211  0.40957228  0.38868188  0.20381595  0.36922476]
0.318777395721


In [132]:
bag_r = BaggingRegressor(dtr, n_estimators = 7, max_samples = 0.5, max_features = 0.33)

scores_bag_r = cross_val_score(bag_r, X, Y, cv = 5)
print scores_bag_r
print np.mean(scores_bag_r)

[ 0.10833403  0.47223579  0.46288241  0.39879971  0.3671781 ]
0.361886008056


### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor.
2. Search for few values of the parameters in order to improve the score of the regressor.
3. Check the best\_score\_ once you've trained it. Is it better than before?
4. How does the score of the Grid-searched DT compare with the score of the Bagging DT?
5. Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
6. Repeat the search
    - Note that you'll have to change parameter names for the base_estimator.
    - Note that there are also additional parameters to change.
    - Note that you may end up with a grid space to large to search in a short time.
    - Make use of the n_jobs parameter to speed up your grid search.
7. Does the score improve for the Grid-searched Bagging Regressor?
8. Which score is better? Are the score significantly different? How could/would you judge that?


In [154]:
dtr_params = {
    'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None],
    'max_features': ['sqrt', 'log2']
    }

est_r = GridSearchCV(dtr, dtr_params, cv = 5)

In [155]:
grid_r = est_r.fit(X, Y)

In [156]:
grid_r.best_score_

0.32535994274034141

In [157]:
grid_r.best_params_

{'max_depth': 2, 'max_features': 'log2'}

In [158]:
bag_r_params = {
    'max_features': [0.5, 0.7, 1.0],
    'max_samples': [0.5, 0.7, 1.0],
    'n_estimators': [5, 10, 20],
    }

est_bag_r = GridSearchCV(bag_r, bag_r_params)

In [161]:
grid_bag_r = est_bag_r.fit(X, Y)

In [162]:
grid_bag_r.best_score_

0.4602677529336151

In [163]:
grid_bag_r.best_params_

{'max_features': 0.7, 'max_samples': 1.0, 'n_estimators': 10}

## [BONUS]: Project 5 data

Repeat the appropriate analysis (classification/regression) for the Project 5 Dataset.