<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Examples.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Examples: Tree-based models for classification
 
© ExploreAI Academy

In this train, we review tree-based models, specifically decision trees and random forests, examining their application in classification and their implementation using `sklearn`.

## Learning objectives

By the end of this train, you should be able to:

* Understand how tree-based models work in the classification setting.
* Build decision tree and random forest classification models using `sklearn`.

## Introduction

__Decision trees__ and __random forests__ are models frequently used to solve regression problems. In this train, we will discuss how these tree-based models can be used in classification. 


## 1. Decision trees

In this train, we will look at how to build a decision tree classification model. 

First, let's refresh our memories about decision trees:

### 1.1 What is a decision tree?

A decision tree is a decision support tool that uses a **tree-like graph** or model of decisions and their possible consequences. It is one way to display an algorithm that only contains conditional control statements.

Decision trees are extremely intuitive ways to classify objects or predict continuous values. You simply ask a series of questions designed to zero in on the classification/prediction. 

For example, if you wanted to build a **decision tree to classify an animal** you come across while on a hike, you might construct the one shown here:

<img src="https://cocalc.com/share/raw/8b892baf91f98d0cf6172b872c8ad6694d0f7204/PythonDataScienceHandbook/notebooks/figures/05.08-decision-tree.png">

The binary splitting makes this extremely efficient: In a well-constructed tree, each question will cut the number of options by approximately half, very quickly narrowing the options even among a large number of classes.

The trick, of course, comes in deciding **which questions to ask at each step**.

In machine learning implementations of decision trees, the questions generally take the form of axis-aligned splits in the data. That is, each node in the tree splits the data into two groups using a cut-off value within one of the features.

The predictions made by the tree are the **modes** of the class labels in each specific group of observations (i.e. the training data). This is different from regression, where the predictions were the **means** of the response values in each group.

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/sketch-classification-tree-mode.png" alt="sketch-classification-tree-mode" style="width: 400px;"/>

### 1.2 Building a decision tree classification model

Let's work through an example of how to create a decision tree classifier using `sklearn`. 

#### Imports

Here, we import all the packages we will need.

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

#### Data
In this train, the dataset we will be using is the `Iris dataset`, which is a multivariate dataset where each class refers to a type of Iris plant. This dataset is free and publicly available at the UCI Machine Learning Repository.

This dataset contains a set of 150 records with **five attributes**:
- Sepal length.
- Sepal width.
- Petal length.
- Petal width.
- Species – the type of Iris plant we will be classifying.

Let's import the data to see what we are dealing with.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint/iris.csv')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


#### Preprocessing

We will start by preprocessing the data so that we can run it through the model algorithm. This involves:

- Splitting the data into features and labels.
- Standardising the data using `sklearn`'s `StandardScaler`.
- Splitting the data into training and testing data.

In [3]:
# Separate into features and target
y = df['species']
X = df.drop('species', axis=1)

In [4]:
# Standardise the data
standard_scaler = StandardScaler()
X_transformed = standard_scaler.fit_transform(X)

In [5]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.30, random_state=50)

#### Training

We will now fit a **decision tree classification** model to our data using `sklearn`'s `DecisionTreeClassifier` with default parameters and a random state of 42.

In [6]:
tree = DecisionTreeClassifier(random_state=42)

In [7]:
tree.fit(X_train, y_train)

#### Testing

Now let's predict the labels for our test set and examine the performance of our model using a **confusion matrix**.

In [8]:
y_pred = tree.predict(X_test)

Let's first see how many of each class we have in this test set:

In [9]:
y_test.value_counts()

species
Iris-versicolor    17
Iris-setosa        14
Iris-virginica     14
Name: count, dtype: int64

In [10]:
labels = ['Iris-setosa', 'Iris-versicolor','Iris-virginica']

pd.DataFrame(data=confusion_matrix(y_test, y_pred), index=labels, columns=labels)

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
Iris-setosa,14,0,0
Iris-versicolor,0,16,1
Iris-virginica,0,1,13


Our model does extremely well! Let's also take a look at the **classification report** for our predicted values.

In [11]:
print(classification_report(y_test, y_pred, target_names=['Iris-setosa', 'Iris-versicolor','Iris-virginica']))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        14
Iris-versicolor       0.94      0.94      0.94        17
 Iris-virginica       0.93      0.93      0.93        14

       accuracy                           0.96        45
      macro avg       0.96      0.96      0.96        45
   weighted avg       0.96      0.96      0.96        45



Even though our model does really well, we can use this classification report to gain some insight into how to improve it. 

We can see here that `Iris-virginica` has the **lowest F1 score**. It is useful to see whether there is a reason for the lower F1 score – a factor to consider is whether the original sample size we trained our model on was truly big enough to expect accurate predictions. If you were the researcher involved in creating this dataset, you might use this insight as a reason to collect more data on `Iris-virginica.`

### 1.3 Tuning parameters to improve the model

For the decision tree algorithm, we can tune parameters to improve the model. The most commonly tuned parameters are:

- `max_depth`: the maximum depth of the tree.
- `min_samples_leaf`: the minimum number of samples required to be at a leaf node.

We encourage you to explore tuning the model's parameters.

### 1.4 Decision trees and overfitting

Overfitting is a general property of decision trees. It is very easy to go too deep in the tree and fit details of the particular data rather than the overall properties of the distributions they are drawn from. This issue can be addressed using **random forests**.

## 2. Random forest

A random forest is a powerful non-parametric algorithm that is an example of an **ensemble** method built on decision trees, meaning that it relies on aggregating the results of an ensemble of decision trees. 

The ensemble of trees is randomised and the output is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

<img src="https://www.researchgate.net/profile/Evaldas_Vaiciukynas/publication/301638643/figure/fig1/AS:355471899807744@1461762513154/Architecture-of-the-random-forest-model.png">

The somewhat surprising result with such ensemble methods is that the sum can be greater than the parts. That is, a majority vote among a number of estimators can end up being better at predicting outcomes than any of the individual estimators used in the voting.

### 2.1 Building a random forest classification model

We will use the same data used in the decision tree classifier above to train a **random forest classifier**. 

#### Imports

First, we need to import `sklearn`'s `RandomForestClassifier`. All other imports needed were declared above.

In [12]:
from sklearn.ensemble import RandomForestClassifier

#### Training

We now fit a random forest classification model to our data using `sklearn`'s `RandomForestClassifier` with default parameters, a **random state** of `42`, and the **number of trees** set to `100`.

In [13]:
forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest.fit(X_train, y_train)

#### Testing

As we did with the decision tree model, let's predict the labels for our test set and examine the performance of our model using a confusion matrix.

In [14]:
pred_forest = forest.predict(X_test)

In [15]:
labels = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

pd.DataFrame(data=confusion_matrix(y_test, pred_forest), index=labels, columns=labels)

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
Iris-setosa,14,0,0
Iris-versicolor,0,16,1
Iris-virginica,0,1,13


Let's also take a look at the classification report for our predicted values.

In [17]:
print(classification_report(y_test, pred_forest, target_names=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        14
Iris-versicolor       0.94      0.94      0.94        17
 Iris-virginica       0.93      0.93      0.93        14

       accuracy                           0.96        45
      macro avg       0.96      0.96      0.96        45
   weighted avg       0.96      0.96      0.96        45



### 2.2 Tuning parameters to improve the model

For the random forest algorithm, we can tune parameters to improve the model. The most commonly tuned parameters are:

- `n_estimators`: the number of trees to include in a forest.
- `max_depth`: the maximum depth of the tree.
- `min_samples_leaf`: the minimum number of samples required to be at a leaf node.

We encourage you to explore tuning the model's parameters.

## Conclusion

In this train, we covered building decision tree and random forest classification models using `sklearn`.

The random forest model performs similarly to the decision tree models. This is likely due to the fact that our dataset is small and rather uncomplicated. 

Practise your model-building skills with other datasets and try to build an intuition for which models work best for different types of tasks/datasets.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>