# data: past present future
## lab 10b: trees and forests

### supervised learning

In [1]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
plt.style.use
plt.rcParams['figure.figsize'] = (15, 5)

Illustrate the standard scikit pipeline for training models.

Our dataset:

https://archive.ics.uci.edu/ml/datasets/Hepatitis

The attributes are:

1. Class: DIE, LIVE
2. AGE: 10, 20, 30, 40, 50, 60, 70, 80
3. SEX: male, female
4. STEROID: no, yes
5. ANTIVIRALS: no, yes
6. FATIGUE: no, yes
7. MALAISE: no, yes
8. ANOREXIA: no, yes
9. LIVER BIG: no, yes
10. LIVER FIRM: no, yes
11. SPLEEN PALPABLE: no, yes
12. SPIDERS: no, yes
13. ASCITES: no, yes
14. VARICES: no, yes
15. BILIRUBIN: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00
16. ALK PHOSPHATE: 33, 80, 120, 160, 200, 250
17. SGOT: 13, 100, 200, 300, 400, 500,
18. ALBUMIN: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0
19. PROTIME: 10, 20, 30, 40, 50, 60, 70, 80, 90
20. HISTOLOGY: no, yes 


In [2]:
names=['CLASS','AGE','SEX','STEROID','ANTIVIRALS','FATIGUE','MALAISE','ANOREXIA','LIVER_BIG','LIVER_FIRM','SPLEEN_PALPABLE','SPIDERS','ASCITES','VARICES','BILIRUBIN','ALK_PHOSPHATE','SGOT','ALBUMIN','PROTIME','HISTOLOGY']

In [3]:
hep_data=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data', sep=',', header=None, na_values="?")

In [4]:
# works better if extract from pandas dataframe
# separate the existing classification (the diagnosis) from the features tested
hep_data_array=hep_data.values
y = hep_data_array[:,0]   #diagnosis
X = hep_data_array[:,1:19]  #features

This data has a problem: lots of question marks, imported as "NaN"s. scikit learn no friend of NaNs

In [5]:
#dodgy data munging!
#impute values to missing ones by using mean values of testing results. This may be a total garbage move. Is it?
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(X)
X_imputed=imp.transform(X)

In [6]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split

We need to pick out a subset of our data to "train" the model, and a subset to "test" the model. 

There are loads of ways to do this, but `sklearn` offers a simple function to do so.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.45, random_state=42)

In [8]:
Now we set up our classifier and fit the data using it.

SyntaxError: invalid syntax (<ipython-input-8-0461a7d3e247>, line 1)

In [None]:
dt = DecisionTreeClassifier() #set up classifier, with all **default** values
clf=dt.fit(X_train, y_train) #fit on all the data


We can do a little bit of testing to see how well the classifier predicts using cross validation of the testing data.

In [None]:
from sklearn.model_selection import cross_val_score
scores_train = cross_val_score(clf, X_train, y_train)
scores_test = cross_val_score(clf, X_test, y_test)

In [None]:
scores_train

In [None]:
scores_test

Decision trees are cool in part because they "rate an A+ on interpretability" according to Breiman.

Let's see why.

In [None]:
#this may take FOREVAH in class

!conda install -y graphviz

In [None]:
!conda install -y pydotplus

In [None]:
# show us the graph of the trees

from IPython.display import Image 
import pydotplus 
from sklearn import tree
dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names=names[1:19],  
                     class_names=['die', 'live'],  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

What is gini, I hear you cry. 

It's the *splitting criterion* that the decision tree algorithm defaults to.

[Wikipedia](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity)

"Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset."

To compute Gini impurity for a set of items with $J$ classes, suppose $i \in \{1, 2, ...,J\}$, and let $f_i$ be the fraction of items labeled with class $i$ in the set.

$I_{G}(f) = \sum_{i=1}^{J} f_i (1-f_i) = \sum_{i=1}^{J} (f_i - {f_i}^2) = \sum_{i=1}^J f_i - \sum_{i=1}^{J} {f_i}^2
 = 1 - \sum^{J}_{i=1} {f_i}^{2} = \sum_{i\neq k}f_i f_k$

Exercise to the reader:

trees are very prone to *overfitting* 

![overfitting](http://scikit-learn.org/stable/_images/sphx_glr_plot_underfitting_overfitting_001.png)


among the many techniques for dealing with this:

- constrain the depth of the trees using `max_depth=` 
- reduce the number of features training on


Take five minutes to change the default settings--can you make a more predictive decision tree?

In [None]:
DecisionTreeClassifier??

## Random forests

dramatically increase predictive power at the cost of interpretability

combine 

![forests](https://dimensionless.in/wp-content/uploads/RandomForest_blog_files/figure-html/voting.png)

In [None]:
# Build a forest and compute the feature importances
from sklearn.ensemble import RandomForestClassifier
seed = 7
num_trees = 100
max_features = 3
#kfold = model_selection.KFold(n_splits=10, random_state=seed)
forest = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
clf=forest.fit(X_train, y_train)

In [None]:
scores = cross_val_score(clf, X_test, y_test)

In [None]:
scores

Breiman:

>...forests are A+ predictors But their mechanism for producing a prediction is difficult to understand. Trying to delve into the tangled web that generated a plurality vote from 100 trees is a Herculean task. So on interpretability they rate an F.