# Decision Trees   

## Representation   
---
Ask a series of questions, moving down a tree (starting at top/root), until you get to some output (answer).    

### Logical structure containing:      
- Nodes
	- decision nodes
	- pick a particular attribute and ask a question about it (raining?)
- Edges
	- represent values from decision node (yes/no)
- Leaves
	- answer
	- end states representing the output

### Goal   
Ask questions to keep narrowing down possibilities until you get to an answer.   
Usefulness of a question depends on the answer from the previous question and how much the question narrows the space of possibilities.    

### Example    
This would be a simple candidate concept for classifying animals.     
<img src="../images/simpledt.png" align="left"/>    

## Expressiveness  
---
Decision trees can be used to represent logic conditions such as AND, OR or XOR.   

<img src="../images/dt_and.png" width=200 align="left"/>   
**AND** (any)
- Has linear complexity when mapped to a decision tree  
- Easy because any (A or B) can result in False (e.g, 2 attributes only need 2 nodes)  
- For $n$ number of nodes, there will be $n$ operations, $O(n)$

<img src="../images/dt_or.png" width=200 align="left"/>   
**OR** (any)
- Has linear complexity when mapped to a decision tree  
- Easy because any (A or B) can result in True (e.g, 2 attributes only need 2 nodes)  
- For $n$ number of nodes, there will be $n$ operations, $O(n)$  

<img src="../images/dt_xor.png" width=200 align="left"/>   
**XOR** (parity)
- Has exponential complexity when mapped to a decision tree  
- Hard because all nodes need to be evaluated (parity) (e.g, 2 attributes need 3 nodes)  
- For $n$ nodes, there is a bound of $2^n -1$ operations, $O(2^n)$   

In general in ML, we hope to encounter problems that are more like *any* vs *parity*.

### How expressive is a decision tree?   
If we have to search over all possible decision trees to find the best one, how many decision trees do we need to worry about?   

**XOR** example   
- $n$ boolean attributes  
- output is boolean   

How many trees?  
Looking at the truth table  
- All combinations need to be evaluated  
- $2^n$ rows, where n is number of attributes   
- $2^m$ outputs, where m is number of row combinations (from above)  
- number of trees is $2^{2^n}$   
for n=6, number of trees is 1.844674407×10¹⁹   

Decision Trees are extremeley expressive, allowing for an extremeley large hypothosis space. We must have clever ways to focus search using algorithims.   
<img src="../images/dt_xor_exp.png" align="left"/>   

## Algorithm   
---

### Simple algorithm  
1. Pick best attribute (best splits the data)
2. Ask question   
3. Follow path of answer   
4. Go to 1 (repeat until you get answer)  

### Example best attribute   
Best attribute and question to ask.  Perfectly splits the data.  
<img src="../images/dt_best_attribute.png" align="left"/>   

### ID3 algorithm   
Top down learning algorithm

#### Psuedocode  
Loop:

 - $A$ $\leftarrow$ best attribute
 - Assign $A$ as decision attribute for *node*
 - For each value of $A$ create a descendent of *node*.
 - Sort training examples to *leaves*.
 - IF Examples perfectly classified $\rightarrow$ **STOP**
 - ELSE Iterate over leaves.

#### Best Attribute
*Information Gain*   
One way to define best attribute is information gain.  The amount of information gained by picking a particular attribute.   

Mathmatically, information gain quantifies the reduction in randomness over the labels you have with a set of data, based upon knowing the value of a particular attribute.  

It is defined as:   
$$\textrm{Gain}(S, A) = \textrm{Entropy}(S) - \sum_v \frac{\left|S_v\right|}{\left|S\right|} \cdot \textrm{Entropy}(S_v) $$ where:   

- $S$ is the set of training examples    
- $A$ is an attribute.    
    
Entropy is defined as:  
$$  Entropy(S) = - \sum_v p(v)\log(v) $$      

#### ID3's biases:    
1. Good Splits at the top of the tree
2. Correct over Incorrect
3. Prefers shorter trees vs. longer (due to bias #1).

### Continuous attributes
- Check certain ranges   
- With continuous attributes, it can make sense to repeat asking about an attribute along a path in a decision tree.  You are essentially asking a different question. Ask a different question, not the same. (e.g., is age > 50, later maybe need to ask if age < 20)   
For discrete attributes it does not make sense to repeat asking about the attribute. 

### When to stop  
- When everything is classified correctly
- When we've run out of attributes
- When overfitting starts to occur (tree is too big, too complicated)  
    - Use cross validation to stop expanding tree when error is low enough   
    - Prune to reduce decition tree size (vote)  

 ### Regression
- Splitting: variance?
- Output: average, local linear fit  

TODO more details on decision tree regression

In [43]:
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import math
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Resources   
https://github.com/xbno/Projects/blob/master/Models_Scratch/Decision%20Trees%20from%20scratch.ipynb    
https://github.com/xbno/Projects/blob/master/Models_Scratch/Random%20Forest%20from%20scratch.ipynb   

### Read in the datasets

In [2]:
file_path = '~/data/bank-note/'
file_name = 'data_banknote_authentication.txt'
file = file_path + file_name
df = pd.read_csv(file, names=['variance','skewness','kurtosis','entropy','class'])
df.shape

(1372, 5)

In [3]:
df.head()

Unnamed: 0,variance,skewness,kurtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [4]:
def normalize(data):
    """
    @summary: normalize data to a value between 0 and 1 
    @param data: np.ndarray 
    @returns: normalized np.ndarray 
    """
    # norm = (x - x_min) / (x_max - x_min)
    x_max = data.max()
    x_min = data.min()
    return (data - x_min) / (x_max - x_min)

In [60]:
X = np.array(df.iloc[:, :-1])
y = np.array(df.iloc[:, -1])
X

array([[  3.6216 ,   8.6661 ,  -2.8073 ,  -0.44699],
       [  4.5459 ,   8.1674 ,  -2.4586 ,  -1.4621 ],
       [  3.866  ,  -2.6383 ,   1.9242 ,   0.10645],
       ...,
       [ -3.7503 , -13.4586 ,  17.5932 ,  -2.7771 ],
       [ -3.5637 ,  -8.3827 ,  12.393  ,  -1.2823 ],
       [ -2.5419 ,  -0.65804,   2.6842 ,   1.1952 ]])

In [29]:
X_norm = normalize(X)
X_norm

array([[0.54872005, 0.70785003, 0.34591883, 0.42037539],
       [0.57787732, 0.69211842, 0.35691866, 0.3883535 ],
       [0.55642971, 0.35124998, 0.49517515, 0.43783379],
       ...,
       [0.31617167, 0.00992098, 0.98945758, 0.3468715 ],
       [0.32205801, 0.17004148, 0.825416  , 0.39402533],
       [0.35429094, 0.41371776, 0.51914954, 0.47217867]])

In [52]:
kfold = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
shufflesplit = StratifiedShuffleSplit(n_splits=1, random_state=42, test_size=.2)

dt = tree.DecisionTreeClassifier()
print("KFold")
for train_index, test_index in kfold.split(x, y):
    # print("TRAIN:", train_index, "TEST:", test_index)
    # Train the model
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print('Accuracy:      ', accuracy)
    
dt = tree.DecisionTreeClassifier()
print("Shuffle Split")
for train_index, test_index in shufflesplit.split(x, y):
    # print("TRAIN:", train_index, "TEST:", test_index)
    # Train the model
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print('Accuracy:      ', accuracy)
    
dt = tree.DecisionTreeClassifier()
print("train test split")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:      ', accuracy)

KFold
Accuracy:       0.9680232558139535
Accuracy:       0.9912790697674418
Accuracy:       0.9766081871345029
Accuracy:       0.9736842105263158
Shuffle Split
Accuracy:       0.9927272727272727
train test split
Accuracy:       0.9818181818181818
