Decision Trees: An Example
===

In this notebook we are going to use decision trees to help us classify flowers!

In particular we are going to classify each iris into one of three sub-types:

* Iris setosa
* Iris virginica
* Iris versicolor

Each observation of an iris records 4 features:

1. Sepal Length (cm)
2. Sepal Width  (cm)
3. Petal Length (cm)
4. Petal Width  (cm)

This is a very famous dataset and used quite a bit in the history of Machine Learning. If you'd like to know more aobut the dataset and its history, the wikipedia page is quite useful: [https://en.wikipedia.org/wiki/Iris_flower_data_set](https://en.wikipedia.org/wiki/Iris_flower_data_set).

The SciKit Learn library provides this dataset as part of its installation, so we don't have to go hunting for it. Information on the dataset as it's provided by SciKit Learn is provided on the documentation site: [https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html)

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import sklearn.model_selection as ms
import sklearn.metrics as met

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The `load_iris` function
---

The dataset is provided via the `load_iris` function. While there are a few possible options for how we call this function (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html), the default (which we will use) provides a [Bunch](https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html#sklearn.utils.Bunch) object, which provides the dataset as well as some meta-data.

In [2]:
iris_bunch = load_iris()
iris_bunch.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [4]:
# looking at the description of the dataset
print(iris_bunch["DESCR"])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

Representing the Features
---


Because we want to train a _classifier_ we've got to separate out the input features from the output category.

Luckily the input features are already provided for us by SKLearn. If we had compiled the dataset ourselves, we'd separate out the columns from the dataset that we believe are the independent variables.

In [5]:
# Create a dataframe from the data provided by sklearn.
# The column names are also nicely provided by sklearn,
# so no need for us to make sure they match
ind = pd.DataFrame(iris_bunch.data, columns=iris_bunch.feature_names)
print(ind.size)
ind.head()

600


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Representing the output categories
---

Unlike in our linear regression examples, the output for the decision tree is a _category_ or _class_ (hence "classifier"). So we need to properly represent the classes so that we know what the prediction is telling us.

In [6]:
# First let's look at the target classifications, as provided by sklearn:
iris_bunch.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [7]:
iris_bunch.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [8]:
# We can use Pandas `Categorical` to help us make more sense of this:
dep = pd.Categorical.from_codes(iris_bunch.target, iris_bunch.target_names)
dep

['setosa', 'setosa', 'setosa', 'setosa', 'setosa', ..., 'virginica', 'virginica', 'virginica', 'virginica', 'virginica']
Length: 150
Categories (3, object): ['setosa', 'versicolor', 'virginica']

This makes it more obvious that we're dealing with categorical data, but now we face a different problem, we have to encode this in a way that we can numerically represent it!

You may think "but we had that!" why'd you change it in the first place?". Well, what we had was `0`, `1`, `2`. Our machine learning algorithms don't know that those are categories, for all it knows it's a continuous variable, which would be bad!

The way we do this is by representing the categories as a _set_ of 0/1 columns, called 'dummy variables':

In [9]:
dep_split = pd.get_dummies(dep)
dep_split

Unnamed: 0,setosa,versicolor,virginica
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
...,...,...,...
145,0,0,1
146,0,0,1
147,0,0,1
148,0,0,1


Train + Test
---

Now we can split the input features and output categories into training/testing sets. SKLearn provides _a lot_ of goodies in the 'Model Selection' module: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection

We're going to use the `train_test_split` function, which splits the dataset into _random_ testing and training sets:

In [19]:
# Figures in our paper used random_state=X
ind_train, ind_test, dep_train, dep_test = ms.train_test_split(ind, dep, random_state=42)

In [20]:
print(ind_train.size, ind_test.size)

448 152


Training the Decision Tree
---

Now we have everything ready to train our DT classifier, we just gotta put our data in the right place:

In [21]:
dt = DecisionTreeClassifier()
dt.fit(ind_train, dep_train)

DecisionTreeClassifier()

In [22]:
predicted = dt.predict(ind_test)
predicted

array(['versicolor', 'setosa', 'virginica', 'versicolor', 'versicolor',
       'setosa', 'versicolor', 'virginica', 'versicolor', 'versicolor',
       'virginica', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor',
       'virginica', 'versicolor', 'versicolor', 'virginica', 'setosa',
       'virginica', 'setosa', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'setosa', 'setosa', 'setosa', 'setosa',
       'versicolor', 'setosa', 'setosa', 'virginica', 'versicolor',
       'setosa'], dtype=object)

Seeing how we did
---

SKLearn provides a module called `metrics` which has a lot of useful functions for calculating metrics 

In [23]:
actual = np.array(dep_test)
predictions = np.array(predicted)
met.confusion_matrix(actual, predictions)

array([[15,  0,  0],
       [ 0, 11,  0],
       [ 0,  0, 12]])

In [17]:
orig = np.array([[ 8,  0,  0],
       [ 0, 15,  1],
       [ 0,  0, 14]])

array([[ 8,  0,  0],
       [ 0, 15,  1],
       [ 0,  0, 14]])