# Introduction

In this tutorial, learn Decision Tree Classification, attribute selection measures, and how to build and optimize a Decision Tree Classifier using Python Scikit-learn package. Based on the following article: https://www.datacamp.com/community/tutorials/decision-tree-classification-python

## Decision Trees

A decision tree is a flowchart-like tree structure where an internal node represents a feature (or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in a recursive manner, imaginatively called recursive partitioning. The learned tree can be visualised as a flowchart and is hopefully easy to interpret...

### The Algorithm

The basic idea behind any decision tree algorithm is as follows:

1. Place the best attribute, selected according to an Attribute Selection Measure (ASM), of the dataset at the root of the tree.
2. Split the training set into subsets. Subsets should be made in such a way that each subset contains data with the same value for an attribute.
3. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.

Various ASMs can be used, such as: Information gain, gain ratio and Gini, these may give subtly different results so it's worth having a go with a few of them.

Ok, as with all machine learning, let's start by loading in some data. We'll use the Pima Indian Diabetes (public domain) data, obtained from Kaggle: https://www.kaggle.com/uciml/pima-indians-diabetes-database

The data is stored as a CSV, which we can load in using the pandas library for Python

In [None]:
import urllib.request
# urllib.request.urlretrieve('https://github.com/ecs-vlc/ml_workshop_2019/raw/master/scikit-learn/pima-indians-diabetes-database.zip', 'pima-indians-diabetes-database.zip')

import zipfile
zip_ref = zipfile.ZipFile('pima-indians-diabetes-database.zip', 'r')
zip_ref.extractall()
zip_ref.close()

import pandas as pd
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv('diabetes.csv', header=1, names=col_names)

Now that we have some data, we can peak at the top few rows using pandas `head` method

In [None]:
pima.head()

We first split the data into features and targets. Once we have these, we can use scikit-learn to create a train and test split

In [None]:
from sklearn.model_selection import train_test_split

feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

### Creating a Decision Tree Classifier

Ok, we now have some data, that has been split so that we can think about training a model. We begin with a default DecisionTree from scikit-learn, and evaluate with a simple accuracy score

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print('Accuracy:', metrics.accuracy_score(y_test, y_pred))

So, we have a trained model and an accuracy. It's a binary classification task (there are only two outcomes), so how do we know if this is good or not. Well, we can never be certain, but it helps to see what we would get if we just predicted that no one had diabetes.

In [None]:
print('Percentage without diabetes:', 1 - (y.sum() / y.count()))

Ok, our accuracy looks less impressive now, usually just a few percent above what you get by predicting the same thing each time. Our model could still have learned something useful though, maybe we should visualise it...

### Visualising a Decision Tree Classifier

First, we'll need to install some dependencies

In [None]:
!pip install graphviz
!pip install pydotplus

With that out of the way, we can use a few lines of code to save an image of our tree and load it back in to the notebook (note, this probably won't work outside of Colab):

In [None]:
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes.png')
Image(graph.create_png())

As you can hopefully now see, the above tree looks significantly more complicated than we would like, especially if we are going to try to understand what it has learned. This leads us to the most common task in any machine learning, hyper-parameter optimisation.

### Hyper-parameter Optimisation

Hyper-parameters are any parameters that we control about our model. We have briefly discussed one already, the ASM. The sci-kit learn has a few others, the full list can be found in the docs: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Here's an example of a better (accuracy) tree:

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

# Create Decision Tree classifer object
clf = DecisionTreeClassifier(max_depth=5)

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print('Accuracy:', metrics.accuracy_score(y_test, y_pred))

That's a much nicer accuracy for this problem. When the max depth of the tree is high, we run in to a common problem in machine learning, overfitting. Overfitting happens when you have a powerful model and not enough data, it just learns to reproduce the training data. That's why we always have a test set, it helps ensure we're not just overfitting. In the case of trees, limiting the depth is an immediate solution that helps the model to obtain better generlisation (generalising means that our model performs well on the test data and the training data).

#### Interpretability

Even though our model performs better, we'd still like something that we can interpret. Let's see how our new tree looks...

In [None]:
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes_smaller.png')
Image(graph.create_png())

Hmm, it's a little better but still not great. Use the space below (or go back and re-run the cells) to try limiting the depth a bit more (perhaps 3) and visualise the tree again:

## Random Forests

Suppose now that accuracy is our primary concern, can we do any better? The random forest classifier uses a collection of decision trees (which each learn different patterns) to classify the data. Specifically, we train a whole forest (each tree on a different part of the data) and then, at test time, choose the best class.

We can do this easily with sci-kit learn using the `RandomForestClassifier`: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

# Create Decision Tree classifer object
clf = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=1)

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print('Accuracy:', metrics.accuracy_score(y_test, y_pred))

After some hyper-parameter tuning we get something slightly better. The cost is interpretability, we can no longer draw our decision tree. However, we can at least print out the importance of each input feature

In [None]:
print(dict(zip(feature_cols, clf.feature_importances_)))

Perhaps unsurprisingly, the top three features are `glucose level`, `bmi` and `age`. 

## Other Data

The techniques you have just used / learned can all be applied to any data which has this format (features with numeric values and a zero or one target). See if you can load in some other data and perform the classification task using a tree or random forest. A good example dataset would be the Heart Disease UCI data on kaggle (you may need an account to access it): https://www.kaggle.com/ronitf/heart-disease-uci