# Tree-based Models for Classification
© Explore Data Science Academy

## Learning Objectives

In this train you will learn how to:

- Build Decision Tree and Random Forest classification models using `sklearn`;
- Understand how tree-based models work in the classification setting.

## Outline

This train is structured as follows:

- Load and Preprocess the Iris dataset;
- Train a Decision Tree Classifier;
- Train a Random Forest Classifier.

## Introduction

You would have covered __decision trees__ and __random forests__ during the regression sprint. In this train we will discuss these tree-based models for their use in classification. If you do need a refresher on these tree-based models, be sure to check out the previous trains. Here are the links to the relevant videos:

- [Decision Trees for Regression](https://youtu.be/6UwBOkKOUGk);
- [Random Forest for Regression](https://youtu.be/UbUDwk0BjuI);
- [Ensemble Methods for Regression](https://youtu.be/3uHrFDDs_RE), which typically make use of decision trees as the base model.

## Decision Trees

In a previous train we covered how to build a decision tree regression model. In this train we will look at how to build a decision tree classification model. Let's refresh our memories about decision trees:

### What is a Decision Tree?

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences. It is one way to display an algorithm that only contains conditional control statements.

Decision trees are extremely intuitive ways to classify objects or predict continuous values: you simply ask a series of questions designed to zero-in on the classification/prediction. For example, if you wanted to build a decision tree to classify an animal you come across while on a hike, you might construct the one shown here:

<img src="https://cocalc.com/share/raw/8b892baf91f98d0cf6172b872c8ad6694d0f7204/PythonDataScienceHandbook/notebooks/figures/05.08-decision-tree.png">

The binary splitting makes this extremely efficient: in a well-constructed tree, each question will cut the number of options by approximately half, very quickly narrowing the options even among a large number of classes.
The trick, of course, comes in deciding which questions to ask at each step.
In machine learning implementations of decision trees, the questions generally take the form of axis-aligned splits in the data, that is, each node in the tree splits the data into two groups using a cutoff value within one of the features.

The predictions made by the tree are the _modes_ of the class labels in each specific group of observations (i.e.: the training data). This is different to regression, where the predictions were the _means_ of the response values in each group.

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/sketch-classification-tree-mode.png" alt="sketch-classification-tree-mode" style="width: 400px;"/>

### Building a Decision Tree Classification Model

Let's work through an example of how to create a decision tree classifier using `sklearn`. 

#### Imports
Here we import all the packages we will need.

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

#### Data
In this train the dataset we will be using is the Iris dataset, which is a multivariate dataset where each class refers to a type of Iris plant. This dataset is free and is publicly available at the UCI Machine Learning Repository.

This dataset contains a set of 150 records with five attributes - Sepal Length, Sepal Width, Petal Length, Petal Width, and Species. Species is the type of Iris plant we will be classifying.

Lets import the data to see what we are dealing with.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint/iris.csv')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


#### Pre-Processing
We will start by pre-processing the data so that we can run it through the algorithm. This involves:

- Splitting the data into features and labels;
- Standardise the data using `sklearn`'s `StandardScaler`;
- Splitting the data into training and testing data.

In [19]:
y = df['species']
X = df.drop('species', axis=1)

In [20]:
#Standarise the data
standard_scaler = StandardScaler()
X_transformed = standard_scaler.fit_transform(X)

In [21]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.30, random_state=50)

#### Training
We will now fit a Decision Tree Classification model to our data by using `sklearn`'s `DecisionTreeClassifier` with default parameters and a random state of 42 (because 42 is the answer to life, the universe, and everything).

In [22]:
tree = DecisionTreeClassifier(random_state=42)

In [23]:
tree.fit(X_train, y_train)

DecisionTreeClassifier(random_state=42)

#### Testing

Now let's predict the labels for our test set and examine the performance of our model using a confusion matix.

In [24]:
y_pred = tree.predict(X_test)

I'm curious to see how many of each class we have in this test set. Let's print that off before we print out the confusion matrix.

In [25]:
y_test.value_counts()

Iris-versicolor    17
Iris-setosa        14
Iris-virginica     14
Name: species, dtype: int64

In [26]:
labels = ['Iris-setosa', 'Iris-versicolor','Iris-virginica']

pd.DataFrame(data=confusion_matrix(y_test, y_pred), index=labels, columns=labels)

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
Iris-setosa,14,0,0
Iris-versicolor,0,16,1
Iris-virginica,0,1,13


Our model does extremely well! Let's also take a look at the classification report for our predicted values.

In [13]:
print(classification_report(y_test, y_pred, target_names=['Iris-setosa', 'Iris-versicolor','Iris-virginica']))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        14
Iris-versicolor       0.94      0.94      0.94        17
 Iris-virginica       0.93      0.93      0.93        14

       accuracy                           0.96        45
      macro avg       0.96      0.96      0.96        45
   weighted avg       0.96      0.96      0.96        45



Even though our model does really well, you can use this classification report to gain some insight into how to improve it. You can see here that `Iris-virginica` has the lowest f1-score. This could be due to this class having a smaller number of samples. If you were the researcher involved in creating this dataset, you might use this insight as reason to collect more data on `Iris-virginica.`

### Tuning parameters to improve the model

For the decision tree algorithm we can tune parameters to improve the model. The most commonly tuned parameters are:

- `max_depth`: maximum depth of the tree;
- `min_samples_leaf`: minimum number of samples required to be at a leaf node.

Tuning the parameters is left as an exercise to the reader.

### Decision trees and overfitting

Overfitting turns out to be a general property of decision trees: it is very easy to go too deep in the tree, and thus to fit details of the particular data rather than the overall properties of the distributions they are drawn from. This issue can be addressed by using **random forests**.

## Random Forest

A random forest is a powerful non-parametric algorithm that is an example of an **ensemble** method built on decision trees, meaning that it relies on aggregating the results of an ensemble of decision trees. The ensemble of trees are randomized and the output is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

<img src="https://www.researchgate.net/profile/Evaldas_Vaiciukynas/publication/301638643/figure/fig1/AS:355471899807744@1461762513154/Architecture-of-the-random-forest-model.png">

The somewhat surprising result with such ensemble methods is that the sum can be greater than the parts: that is, a majority vote among a number of estimators can end up being better than any of the individual estimators doing the voting!

### Building a Random Forest Classification Model

We will use the above data used in the Decision Tree classifier in this Random Forest classifier. 

#### Imports
First, we need to import `sklearn`'s `RandomForestClassfier`. All other imports needed were declared above.

In [15]:
from sklearn.ensemble import RandomForestClassifier

#### Training
We will now fit a Random Forest Classification model to our data by using `sklearn`'s `RandomForestClassifier` with default parameters, a random state of 42, and the number of trees set to 100.

In [16]:
forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

#### Testing

As we did with the Decision Tree model, let's predict the labels for our test set and examine the performance of our model using a confusion matrix.

In [17]:
pred_forest = forest.predict(X_test)

In [18]:
labels = ['Iris-setosa', 'Iris-versicolor','Iris-virginica']

pd.DataFrame(data=confusion_matrix(y_test, pred_forest), index=labels, columns=labels)

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
Iris-setosa,14,0,0
Iris-versicolor,0,16,1
Iris-virginica,0,1,13


Let's also take a look at the classification report for our predicted values.

In [27]:
print(classification_report(y_test, pred_forest, target_names=['Iris-setosa', 'Iris-versicolor','Iris-virginica']))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        14
Iris-versicolor       0.94      0.94      0.94        17
 Iris-virginica       0.93      0.93      0.93        14

       accuracy                           0.96        45
      macro avg       0.96      0.96      0.96        45
   weighted avg       0.96      0.96      0.96        45



### Tuning parameters to Improve Model

For the Random Forest algorithm we can tune parameters to improve the model. The most commonly tuned parameters are:

- `n_estimators`: number of trees to include in forest;
- `max_depth`: maximum depth of the tree;
- `min_samples_leaf`: minimum number of samples required to be at a leaf node.

Tuning the parameters is left as an exercise to the reader.

## Conclusion

The Random Forest model performs similar to the Decision Tree models. This is likely due to the fact that our dataset is small and rather uncomplicated. Practise your model-building skills with other datasets and try to build an intuition about which models work best for different types of tasks/datasets.

In this train we covered building Decision Tree and Random Forest Classification models using `sklearn`.