# Decision Trees  

## [Dr Kieu's Classification Lecture Notes](http://120.108.116.237/~ktduc/DA/Lecs/Topic01%20Classification%20Basics%20Jiawei%20Han.pdf) 
(slides 10 to 43) for indepth explanations on the theory behind Decision Trees and Attribute Selection Methods.

***
## Inducing Decision Trees with [`sklearn.tree.DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

### Module Imports
- We use `numpy` for its arrays
- `pandas` for dataframes and easy handling of `.csv` files

In [8]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, KFold
from sklearn.tree import DecisionTreeClassifier

### Reading data from `diabetes.csv`


In [9]:
data = pd.read_csv('diabetes.csv')
print(data)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
5              5      116             74              0        0  25.6   
6              3       78             50             32       88  31.0   
7             10      115              0              0        0  35.3   
8              2      197             70             45      543  30.5   
9              8      125             96              0        0   0.0   
10             4      110             92              0        0  37.6   
11            10      168             74              0        0  38.0   
12            10      139             

[`data.loc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) Access a group of rows and columns by label(s) or a boolean array.

In [10]:
inputs = data.loc[:, :'Age']   # returns all rows, and all columns up to and including Age
print(inputs)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
5              5      116             74              0        0  25.6   
6              3       78             50             32       88  31.0   
7             10      115              0              0        0  35.3   
8              2      197             70             45      543  30.5   
9              8      125             96              0        0   0.0   
10             4      110             92              0        0  37.6   
11            10      168             74              0        0  38.0   
12            10      139             

In [11]:
target = data['Outcome']       # returns all rows with values in the outcome column
print(target)

0      1
1      0
2      1
3      0
4      1
5      0
6      1
7      0
8      1
9      1
10     0
11     1
12     0
13     1
14     1
15     1
16     1
17     1
18     0
19     1
20     0
21     0
22     1
23     1
24     1
25     1
26     1
27     0
28     0
29     0
      ..
738    0
739    1
740    1
741    0
742    0
743    1
744    0
745    0
746    1
747    0
748    1
749    1
750    1
751    0
752    0
753    1
754    1
755    1
756    0
757    1
758    0
759    1
760    0
761    1
762    0
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


### Making training and test data with [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- `X` is the dataset minus the class label attribute
- `Y` is the class label attribute
- `train_test_split(X, y)` randomly splits the dataset into 75% training data and 25% testing data:
    - `x_train` training data minus class label attribute
    - `x_test` testing data minus class label attribute
    - `y_train` training data class label attributes
    - `y_test` testing data class label attributes

The testing data is compaired later on with `y_pred`

In [12]:
X = inputs.values
y = target.values
x_train, x_test, y_train, y_test = train_test_split(X, y) 

### Executing [`sklearn.tree.DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
We **train** the model `dt` with `x_train` and `y_train`.  
Recall that Decision Trees are an example of **supervised learning**, so we need to know the class label attribute `y_train` during model training.

In [13]:
dt = DecisionTreeClassifier()  #instance of classifier (model) object
dt.fit(x_train, y_train)       #training the model (classifier)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

### A quick Prediction and Accuracy Test
[**`confusion_matrix`**](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)  
By definition a confusion matrix $C$ is such that $C_{i,j}$ is equal to the number of observations known to be in group $i$ but predicted to be in group $j$.

Thus in binary classification, the count of true negatives is $C_{0,0}$, false negatives is $C_{1,0}$, true positives is $C_{1,1}$ and false positives is $C_{0,1}$.
\begin{bmatrix}
    TN & FP \\
    FN & TP \\
\end{bmatrix}  
  
  
[**`accuracy_score`**](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
\begin{equation}
    \frac{TP+TN}{P+N}
\end{equation}

In [28]:
y_pred = dt.predict(x_test) #predicts using the 25% remaining test data  

print('Confusion Matrix: \n{}\n'.format(confusion_matrix(y_test, y_pred)))
print('A quick accuracy test: \n{}'.format(accuracy_score(y_test, y_pred)))

Confusion Matrix: 
[[101  27]
 [ 28  36]]

A quick accuracy test: 
0.7135416666666666


### Rapid accuracy tesing with [`sklearn.model_selection.KFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
`KFold` makes multiple train, test data splits (in this case $k=5$) for quick accuracy testing on **multiple models**.  
  
  
**NOTE:** `KFold` returns lists of indicies that refer to the dataset, whereas `train_test_split` returns copies of the actual data from the set.  
  
  

- `get_n_splits([X, y, groups])`	Returns the number of splitting iterations in the cross-validator  
- `split(X[, y, groups])`	Generate `np.array` of indices to split data into training and test set.

In [27]:
k_fold = KFold(n_splits=5)

k = k_fold.get_n_splits(X)   # not actually necessary!
print('Number of splitting iterations: {}\n'.format(k))

accuracies = []
dt2 = DecisionTreeClassifier()
for train_idx, test_idx in k_fold.split(X):
    train_X, test_X = X[train_idx], X[test_idx]
    train_y, test_y = y[train_idx], y[test_idx]
    dt2.fit(train_X, train_y)
    predictions = dt2.predict(test_X)
    accuracy = accuracy_score(test_y, predictions)
    accuracies.append(accuracy)

print('The mean accuracy is {}'.format(np.mean(accuracies)))
print(accuracies)

Number of splitting iterations: 5

The mean accuracy is 0.7019692725575079
[0.6688311688311688, 0.5974025974025974, 0.7272727272727273, 0.7843137254901961, 0.7320261437908496]
