# Managing the Quality Metric of Global Ecological Footprint

> Managing the Quality Metric of Global Ecological Footprint

- author: Victor Omondi
- toc: true
- comments: true
- categories: [classification, machine-learning]
- image: images/mqmgef-shield.png

# Overview

![image.png](datasets/images/poster.png "poster.png")

## Machine Learning: Classification - Managing the Quality Metric of Global Ecological Footprint


The dataset used  was obtained from the National Footprint and Biocapacity Accounts. It provides Ecological Footprint per capita data for years 1961-2016 in global hectares (gha). The National Footprint and Biocapacity Accounts (NFAs) measure the ecological resource use and resource capacity of nations from 1961 to 2016. The calculations in the National Footprint and Biocapacity Accounts are primarily based on United Nations data sets.

We will use the data to classify and predict the quality metrics (qascore) of the ecological footprint data for the different countries. This data includes total and per capita national biocapacity, the ecological footprint of consumption, the ecological footprint of production and total area in hectares.

Data Source: https://data.world/footprint/nfa-2019-edition

# Libraries

In [2]:
import warnings
warnings.filterwarnings("ignore")


import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use("ggplot")

from sklearn.utils import shuffle
from sklearn.preprocessing import (LabelEncoder, 
                                   MinMaxScaler)
from sklearn.model_selection import (cross_val_score, 
                                     KFold, 
                                     LeaveOneOut, 
                                     StratifiedKFold, 
                                     train_test_split)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (accuracy_score,
                             confusion_matrix, 
                             f1_score, 
                             precision_score, 
                             recall_score)

from imblearn.over_sampling import SMOTE

## Libraries Setup

In [14]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

# Linear Classification and Logistic Regression

We will Explore linear classification.

In machine learning, classification is a supervised method of segmenting data points into various labels or classes. Unlike regression, the target variable in a classification problem is discrete. Each data point used in training classification models must have a corresponding label in order for the characteristics and patterns in the classes to be learnt appropriately. Classification can either be binary - identifying that a given email is spam or not or, multi-class - classifying a fruit as orange, mango or banana.

## Introduction

Every year people demand more from nature than it can regenerate. Individuals, communities and government leaders use ecological footprint data to better manage limited resources, reduce economic risk, and improve well-being. The Dataset provides Ecological Footprint per capita data for years 1961-2016 in global hectares (gha). Ecological Footprint is a measure of how much area of biologically productive land and water an individual, population, or activity requires to produce all the resources it consumes and to absorb the waste it generates, using prevailing technology and resource management practices. The Ecological Footprint is measured in global hectares. Since trade is global, an individual or country's Footprint tracks area from all over the world. 

Apart from predicting numeric values, another important supervised machine learning method is classification and it involves predicting classes (either binary or multinomial classes). In this section, we will cover how to measure performances of class prediction, linear classification methods and non-linear/tree-based methods. We’ll also focus on strategies for applying a successful classification model like interpretability-accuracy trade-off, class and imbalance.

The National Footprint and Biocapacity Accounts (NFAs) measure the ecological resource use and resource capacity of nations from 1961 to 2016. The calculations in the National Footprint and Biocapacity Accounts are primarily based on United Nations data sets, including those published by the Food and Agriculture Organization, United Nations Commodity Trade Statistics Database, and the UN Statistics Division, as well as the International Energy Agency. In this project, we will use this data to classify and predict the quality metrics (qascore) of the ecological footprint data for the different countries. This data includes total and per capita national biocapacity, the ecological footprint of consumption, the ecological footprint of production and total area in hectares.

Data Source: https://data.world/footprint/nfa-2019-edition

## Linear Classification and Logistic Regression

In machine learning, classification is a supervised method of segmenting data points into various labels or classes. Unlike regression, the target variable in a classification problem is discrete. Each data point used in training classification models must have a corresponding label in order for the characteristics and patterns in the classes to be learnt appropriately. Classification can either be binary - identifying that a given email is spam or not or, multi-class - classifying a fruit as orange, mango or banana.

### Linear classifiers and the importance of class probabilities

For simplicity, we define a linear classifier as a binary classifier that separates two classes (positive and negative class) using a linear separator by computing a linear combination of the features and comparing against a set threshold.

### Logistic Regression: Sigmoid, logit and the log-likelihood

Logistic regression is a linear algorithm that can be used for binary or multiclass classification. It is a discriminative classifier that estimates the probability that an instance belongs to a class using an s-shape function curve called the sigmoid function. The predicted values obtained after using a linear equation on the predictors by applying logistic regression can fall in the range of negative infinity to positive infinity. The sigmoid maps these results by shrinking the value to fall between 0 and 1.  We can say that we use the sigmoid function to transform linear regression into logistic regression.

$$
sigmoid\ \sigma \ (x) = \frac{1}{1+e^{-x}} 
$$

![image.png](datasets/images/sigmoid-curve.png "sigmoid-curve.png")

The sigmoid function can be applied to a linear equation,

$$
z = \beta_0 + \beta_{1}x
$$

to obtain values h between 0 and 1 such that

$$
h = \sigma(z) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-{\beta_0 + \beta_{1}x}}}
$$

For a binary classification task with classes A and B, if a threshold is set for 0.5 and the probability of an instance belonging to a class is $p$, we can say that if $p < 0.5$ the instance if of class A while it is of class B is $p > 0.5$. 

Also known as the log of odds, logit is the logarithm of odds ratio where the odds ratio is the probability that an event occurs divided by the probability that the event does not occur. Logit is the inverse of the sigmoid such that it maps values from negative infinity to positive infinity.

$$
\log{it}(p) = \log(\frac{p}{1 - p})
$$

> Note: Recall that in linear regression, we minimized the sum of squared errors SSE; in logistic regression, the log-likelihood is maximized.

In [15]:
df = pd.read_csv("datasets/raw/NFA 2019 public_data.csv")
df.head()

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
0,Armenia,1992,1,AreaPerCap,0.140292,0.199546,0.097188051,0.036888,0.02932,0.0,0.5032351,3A
1,Armenia,1992,1,AreaTotHA,483000.0,687000.0,334600.0,127000.0,100943.0008,0.0,1732543.0,3A
2,Armenia,1992,1,BiocapPerCap,0.159804,0.135261,0.084003213,0.013742,0.033398,0.0,0.4262086,3A
3,Armenia,1992,1,BiocapTotGHA,550176.2427,465677.9722,289207.1078,47311.55172,114982.2793,0.0,1467355.0,3A
4,Armenia,1992,1,EFConsPerCap,0.38751,0.189462,1.26e-06,0.004165,0.033398,1.114093,1.728629,3A


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72186 entries, 0 to 72185
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         72186 non-null  object 
 1   year            72186 non-null  int64  
 2   country_code    72186 non-null  int64  
 3   record          72186 non-null  object 
 4   crop_land       51714 non-null  float64
 5   grazing_land    51714 non-null  float64
 6   forest_land     51714 non-null  object 
 7   fishing_ground  51713 non-null  float64
 8   built_up_land   51713 non-null  float64
 9   carbon          51713 non-null  float64
 10  total           72177 non-null  float64
 11  QScore          72185 non-null  object 
dtypes: float64(6), int64(2), object(4)
memory usage: 6.6+ MB


In [17]:
df.isnull().sum()

country               0
year                  0
country_code          0
record                0
crop_land         20472
grazing_land      20472
forest_land       20472
fishing_ground    20473
built_up_land     20473
carbon            20473
total                 9
QScore                1
dtype: int64

The dataset has a lot of missing values from `crop_land:carbon` columns

### distribution of target variable

In [18]:
df.QScore.value_counts()

3A    51481
2A    10576
2B    10096
1A       16
1B       16
Name: QScore, dtype: int64

### Handling Missing Values

For simplicity, we will drop the rows with missing values.

In [19]:
df = df.dropna()
df.isnull().sum()

country           0
year              0
country_code      0
record            0
crop_land         0
grazing_land      0
forest_land       0
fishing_ground    0
built_up_land     0
carbon            0
total             0
QScore            0
dtype: int64

In [20]:
df.QScore.value_counts()

3A    51473
2A      224
1A       16
Name: QScore, dtype: int64

An obvious change in our target variable after removing the missing values is that there are only three classes left. From the distribution of the 3 classes, we can see that there is an obvious imbalance between the classes. There are methods that can be applied to handle this imbalance such as oversampling and undersampling.

- Oversampling involves increasing the number of instances in the class with fewer instances
- Undersampling involves reducing the data points in the class with more instances.
For now, we will convert this to a binary classification problem by combining class '2A' and '1A'.

In [21]:
df['QScore'] = df.QScore.replace(['1A'], '2A')
df.QScore.value_counts()

3A    51473
2A      240
Name: QScore, dtype: int64

In [22]:
df_2A = df[df.QScore=='2A']
df_3A = df[df.QScore=='3A'].sample(350)
data_df = df_2A.append(df_3A)
data_df.sample(10)

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
33160,Kyrgyzstan,2016,113,AreaPerCap,0.2290231,1.540869,0.105612578,0.136843,0.04267672,0.0,2.055024,2A
40653,Mauritania,2016,136,EFConsTotGHA,1481184.0,5432897.0,852452.4511,270421.0,195135.9,1727426.0,9959516.0,2A
39126,Mali,1993,133,EFProdPerCap,0.4922205,0.6718759,0.212944288,0.02300339,0.07137923,0.0171919,1.488615,3A
29675,Iraq,2016,103,BiocapTotGHA,4605966.0,626123.3,1541760.681,79779.87,1046806.0,0.0,7900435.0,2A
69328,"Congo, Democratic Republic of",1970,250,EFProdPerCap,0.3008255,0.02181854,0.53706,0.01872915,0.03828945,0.04684737,0.9635702,3A
6308,Bolivia,2016,19,EFConsPerCap,0.4825594,1.662504,0.171887765,0.01046143,0.06755527,0.7894736,3.184442,3A
20520,Finland,2016,67,AreaPerCap,0.5116943,0.08317771,4.458854285,1.977021,0.09190805,0.0,7.122655,2A
20523,Finland,2016,67,BiocapTotGHA,4145792.0,271659.0,51643421.75,12697160.0,744647.2,0.0,69502680.0,2A
45343,Niger,1995,158,EFConsTotGHA,4330783.0,5913366.0,2940934.525,10467.4,114038.4,559558.9,13869150.0,3A
28853,Indonesia,2004,101,EFConsTotGHA,94555590.0,4746897.0,52946760.84,37237240.0,11740590.0,117857400.0,319084500.0,3A


In [23]:
data_df = shuffle(data_df)
data_df = data_df.reset_index(drop=True)
data_df.shape

(590, 12)

In [24]:
data_df.QScore.value_counts()

3A    350
2A    240
Name: QScore, dtype: int64

### More Data Preprocessing

In [25]:
data_df = data_df.drop(columns=['country_code', 'country', 'year'])
X = data_df.drop(columns='QScore')
y = data_df['QScore']

### split the data into training and testing sets

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
y_train.value_counts()

3A    235
2A    178
Name: QScore, dtype: int64

There is still an imbalance in the class distribution. For this, we use SMOTE only on the training data to handle this.


### encode categorical variable

In [27]:
encoder = LabelEncoder()
X_train['record'] = encoder.fit_transform(X_train.record)
X_test['record'] = encoder.fit_transform(X_test.record)

In [28]:
smote = SMOTE(random_state=1)
X_train_balanced, y_balanced = smote.fit_sample(X_train, y_train)

In [29]:
scaler = MinMaxScaler()
normalised_train_df = scaler.fit_transform(X_train_balanced.drop(columns=['record']))
normalised_train_df = pd.DataFrame(normalised_train_df, columns=X_train_balanced.drop(columns=['record']).columns)
normalised_train_df['record'] = X_train_balanced.record
normalised_train_df.head()

Unnamed: 0,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,record
0,0.03786112,0.01029289,0.01735369,0.001715168,0.08648018,0.009998,0.02040938,5
1,0.01635984,0.002123552,0.008275276,0.001411935,0.01441415,0.0,0.005827019,1
2,0.04175654,0.1158301,0.002504287,0.0,0.03059039,0.0,0.02614894,1
3,0.02631355,0.02363502,0.001216972,0.0,0.01710721,0.0,0.007952276,3
4,1.173502e-09,5.496204e-10,2.695632e-10,9.039378e-11,1.915074e-09,0.0,3.537642e-10,2


In [30]:
X_test = X_test.reset_index(drop=True)
normalised_test_df = scaler.fit_transform(X_test.drop(columns=['record']))
normalised_test_df = pd.DataFrame(normalised_test_df, columns=X_test.drop(columns=['record']).columns)
normalised_test_df['record'] = X_test.record
normalised_test_df.head()

Unnamed: 0,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,record
0,0.001011359,0.0001217512,0.0001505777,0.0001230732,0.00745996,0.0,0.00029112,1
1,0.00131406,0.0004147361,0.0005087191,7.624855e-05,0.005392793,0.01120927,0.0006872237,7
2,0.06765315,0.001535051,0.01270009,0.01275176,0.1417544,0.627561,0.02736108,5
3,3.291722e-10,5.867514e-10,3.142206e-12,4.058977e-11,9.098758e-10,0.0,1.959085e-10,2
4,1.467227e-10,5.549262e-12,6.66331e-11,7.212987e-12,5.536216e-10,4.220273e-10,3.85343e-11,6


### Logistic Regression

In [31]:
log_reg = LogisticRegression()
log_reg.fit(normalised_train_df, y_balanced)

LogisticRegression()

# Measuring Classification Performance {% fn 1 %}

We will explore cross validation techniques used by data scientist to avoid overfitting and enable generalization.

## Cross-validation and accuracy

Cross Validation (CV) is a well known and trusted method applied to avoid overfitting and enable generalization. Although there are different techniques used in performing cross validation, the fundamental concept involves partitioning the dataset into a number of subsets, holding out a set for evaluation then training the model on the other sets. This gives a more reliable estimate of how the model performs across different training sets because it provides an average score across different training samples used. The only drawback with cross validation is that it takes more time and computational resources however, the gain obtained in having a better model is very well worth this cost. **K-Fold cross validation**, **Stratified K-Fold cross validation** and **Leave One Out Cross Validation (LOOCV)** are some cross validation techniques.



In [32]:
scores = cross_val_score(log_reg, normalised_train_df, y_balanced, cv=5, scoring='f1_macro')
scores

array([0.54525074, 0.48843537, 0.52306785, 0.52078849, 0.51617647])

### K-Fold Cross Validation

This technique is called K-Fold because the data is split into K equal groups.  If $k = 5$ a 5-fold cross validation can be performed such that the data is split into $k_1$, $k_2$, $k_3$, $k_4$ and $k_5$. The model is trained on $k_2 - k_5$ and evaluated on $k_1$ then repeated $k$ times until every group is used to train and test the model. 

![image.png](datasets/images/kfold.png "kfold.png")

In [33]:
kf = KFold(n_splits=5)
kf.split(normalised_train_df)
f1_scores = []

# run for every split
for train_index, test_index in kf.split(normalised_train_df):
    X_train_k, X_test_k = normalised_train_df.iloc[train_index], normalised_train_df.iloc[test_index]
    y_train_k, y_test_k = y_balanced[train_index], y_balanced[test_index]
    model = LogisticRegression().fit(X_train_k, y_train_k)
    f1_scores.append(
        f1_score(y_true=y_test_k, y_pred=model.predict(X_test_k), pos_label='2A')*100
    )
f1_scores

[54.71698113207547,
 54.99999999999999,
 57.731958762886606,
 62.18487394957983,
 0.0]

### Stratified K-Fold Cross Validation

Stratified K-Fold cross validation ensures that in every fold, there is an equal proportion of each target class to obtain a good representation of the data and avoid imbalance and biased results. For example, if there are two target classes $t_1$ and $t_2$ with equal distribution in the data, it is best to ensure that the folds also have the same distribution.

In [34]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
f1_scores_skf = []
for train_index, test_index in skf.split(normalised_train_df, y_balanced):
    X_train_skf, X_test_skf = np.array(normalised_train_df)[train_index], np.array(normalised_train_df)[test_index]
    y_train_skf, y_test_skf = y_balanced[train_index], y_balanced[test_index]
    model = LogisticRegression().fit(X_train_skf, y_train_skf)
    f1_scores_skf.append(
        f1_score(y_true=y_test_skf, y_pred=model.predict(X_test_skf), pos_label='2A')
    )
f1_scores_skf

[0.5892857142857143,
 0.6363636363636364,
 0.6666666666666666,
 0.5535714285714286,
 0.6285714285714287]

### Leave One Out Cross Validation (LOOCV)

In this method, one instance is left out and used as the test set while the model is trained on $N-1$ data points where $N$ is the number of data points. This means that the number of instances and folds are equal.

In [35]:
loo = LeaveOneOut()
scores_loo = cross_val_score(
    LogisticRegression(), normalised_train_df, y_balanced, cv=loo, scoring='f1_macro'
)
average_score_loo = scores_loo.mean()
average_score_loo

0.4829787234042553

## Confusion Matrix, Precision-Recall, ROC curve and the F1-score

Accuracy, precision, recall, F1-score and many others are evaluation metrics used in measuring the performance of classification models. We will discuss these metrics.

### Confusion Matrix

It is an $N$ x $N$ matrix that gives a summary of the correct and incorrect predicted classification results for the N target classes. The values in the diagonal of the matrix represent the number of correctly predicted classes while every other cell in the matrix indicates the misclassified classes. This means that the more predicted values that fall in the diagonal, the better the model. True positive, false positive, true negative and false negative are terms used when interpreting a confusion matrix.

![image.png](datasets/images/confusion-matrix.png "confusion-matrix.png")

#### True Positive (TP): 
This is a correct classification where the predicted value is the same as the actual value. Using the table above, this means that actual value was positive and the predicted value was also positive.

#### True Negative (TN): 
The predicted value also matches the actual value. In this case, it is for the negative class. The actual value is negative and the predicted value is negative.

#### False Positive (FP): 
Also called a Type I error, this is a misclassification such that the model predicted a positive class while the actual class is negative. Telling a man that he is pregnant is definitely a false positive.

#### False Negative (FN): 
Also another misclassification where the predicted value is negative and the actual value is positive. Another example will be telling a pregnant woman that she is not pregnant. FN is known as a Type II error.

In [36]:
new_prediction = log_reg.predict(normalised_test_df)
new_prediction[:5]

array(['2A', '3A', '3A', '2A', '2A'], dtype=object)

In [37]:
cnf_mat = confusion_matrix(y_true=y_test, y_pred=new_prediction, labels=['2A', '3A'])
cnf_mat

array([[ 56,   6],
       [102,  13]], dtype=int64)

### Accuracy

This is the ratio of the number of correctly predicted instances to the total number of instances. It is a commonly used metric suitable when the target classes are not imbalanced. A high accuracy does not necessarily mean that the model has high predicting power. Hence, depending on the task, it is important to not use only the accuracy metric because it does not provide enough information about the model.

$$
Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
$$

In [38]:
accuracy = accuracy_score(y_true=y_test, y_pred=new_prediction)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.39


### Precision

The ratio of correctly predicted instances of a class to the total number of items predicted by the model to be in that class is referred to as precision (known as Positive Predicted Value - PPV). This translates to the total percentage of the results obtained that are relevant. For the positive class, it is the ratio of true positives to the sum of true positives and false positives

$$
Precision = \frac{TP}{TP + FP}
$$

In [39]:
precision = precision_score(y_true=y_test, y_pred=new_prediction, pos_label='2A')
print(f'precision: {precision:.2f}')

precision: 0.35


### Recall

Known as the sensitivity of the model, recall gives a percentage of total relevant results correctly predicted by the model. It is the ratio of the true positives to the actual number of positives (true positives and false negatives).

$$
Recall = \frac{TP}{TP + FN}
$$

there is also a trade-off between precision and recall. It is impossible to maximise both metrics simultaneously because an increase in recall decreases precision. Identify which metric is important based on your task and optimise.

In [40]:
recall = recall_score(y_true=y_test, y_pred=new_prediction, pos_label='2A')
print(f"Recall: {recall:.2f}")

Recall: 0.90


### F1-Score

This metric is the harmonic mean of precision and recall that aims to have an optimal balance of both. The F1-Score is quite easy to use and can be focused on to maximize as opposed to maximizing precision and recall.

$$
F_1 = 2 * \frac{precision * recall}{precision + recall}
$$

In [41]:
f1 = f1_score(y_true=y_test, y_pred=new_prediction, pos_label='2A')
print(f"F1: {f1:.2f}")

F1: 0.51


### ROC Curve

The Receiver Operating Characteristics (ROC) curve is a probability curve that measures the performance of a classification model at different set thresholds. Recall also known as the True Positive Rate (TPR) is plotted on the y-axis against the False Positive Rate (FPR) on the x-axis.

The code examples above are not the optimal results that can be obtained with the model. Hyperparameter tuning can be performed to improve the model.

# Multiclass Classification

We will explore how to deal with more than two classes where an instance is classified into a single class.

## Multilabel and Multiclass classification

Multiclass classification deals with more than two classes where an instance is classified into a single class. For example, given a dataset with a set of features that describe the weather such that the classes are sunny, rainy and windy, a multiclass classification task will only give a single class as the result. In contrast, multilabel classification classifies an instance into a set of target labels. Articles and movies are examples where this can apply. An article can discuss a single topic but can also be about politics, religion, education and many more while movies are commonly tagged to multiple genres such as comedy, adventure, action.

## The Sigmoid and the Softmax function

The softmax function is quite similar to the sigmoid explained earlier. It is used for multiclass classification because it can obtain the probabilities for various classes such that the probabilities of each class sum to 1. This means that an increase in the probability of a class causes a decrease in the probability of at least one of the other classes. It can also be referred to as a generalization of logistic regression or the sigmoid function and can be used for multi-class classification while the sigmoid function is used in multi-label classification. The softmax function is popularly used in the output layers of neural networks. Although the sum of the outputs of the softmax must be 1, this is not the same for the sigmoid function. 

# Tree-Based Methods and The Support Vector Machine {% fn 3 %}

We will explore Support Vector Machine (SVM), a supervised machine learning algorithm that is used to solve both classification and regression tasks.

## Linear and non-linear Support Vector Machine

Support Vector Machine (SVM)  is a supervised machine learning algorithm that is used to solve both classification and regression tasks. In classification, the algorithm uses a line or hyperplane to separate classes by using data points close to the boundary (support vector)  for each class and a hyperplane that maximizes the distance between the classes. 


> Important: For clarity, a hyperplane is a line that linearly separates data points. Although there can be several hyperplanes between classes, the optimal hyperplane which has the maximum distance or margin between itself and the support vectors is chosen.

As we know, data is not always linearly separable such that a straight line might not be able to adequately segregate classes. Although SVM is a linear classifier, it can be used to classify a non-linear dataset by transforming the dataset to a higher dimensional feature space where it can be linearly separable. This is done using the kernel trick such that a kernel function is applied on each data point to map to a higher dimensional space. 

## Decision Trees and CART algorithm 

The decision tree is a widely used non-parametric supervised machine learning approach that splits instances in a dataset based on different decision rules inferred from the features in the dataset. It is a tree-based algorithm with nodes that represent a specific attribute or decision rule such that for an instance, a question is asked at a node and possible answers to the question found on both edges. This is a sequential process that involves recursive partitioning of nodes for several features until the leaves for the tree provides the final output or class for that instance. Decision trees can also be used to solve regression problems.

ID3 - Iterative Dichotomiser 3, CART - Classification and Regression Trees, and C4.5 are some examples of decision tree algorithms. In this section, we only discuss the CART algorithm. The CART predictive model generates decision rules that have a binary tree representation such that each non-terminal node has two child nodes as opposed to some other tree-based methods that have more child nodes. It supports numerical target variables. At every node, the best split is chosen such that the splitting criterion is maximised. Gini impurity index is used as the splitting criterion in CART.

**Gini Impurity**: this is a measure of the chance that a randomly selected instance will be wrongly classified when selected. For different classes in a dataset, with $p(i)$ as the probability that the chosen instance belongs to class $i$, the gini impurity index for all classes $G$, can be calculated such that:

Gini impurity index values range between 0 and 1 such that 0 translates to a pure classification where all instances belong to the same class while 1 means that there is a random distribution of the instances across different classes. To select the best split, the gini gain is calculated by taking a weighted sum of the gini impurity index then subtracting from the original impurity. Higher gini gain leads to better splits simply put, the lower the gini impurity, the better the split.

## Overfitting in Decision Trees, Early Stopping and Pruning

The recursive partitioning of nodes until the final subsets are obtained in decision trees makes it prone to overfitting. The deeper the tree, the higher the chances of the overfitting. This can be prevented using a stopping criterion such as early stopping and pruning. Early stopping or pre-pruning involves stopping the tree-building process before the tree becomes too complex and the training data is perfectly classified. An early stopping condition like the maximum depth can be set to avoid deep trees such that the tree stops growing after reaching the set maximum depth for the tree. Another early stopping criterion that can be used is the classification error. At every splitting stage, the error is checked. If there is no significant decrease in the error, there is no need to make the tree more complex. When there are fewer data points than a set threshold value, early stopping can also take place. Early stopping may also produce underfit models if it stops too early. Post-pruning, on the other hand, allows the tree to be fully built before simplifying by removing sections of the tree at different levels by calculating the error rate.

In [43]:
dtc = DecisionTreeClassifier()
dtc.fit(normalised_train_df, y_balanced)
dtc_pred = dtc.predict(normalised_test_df)
dtc_pred[:5]

array(['2A', '2A', '3A', '3A', '3A'], dtype=object)

# Ensemble Methods

We will explore how to combine several classifiers to obtain an optimal model with better performance as opposed to just a single classifier.

## Beyond decision trees and ensemble classifiers

Ensembling in machine learning involves the combination of several classifiers to obtain an optimal model with better performance as opposed to just a single classifier. These classifiers can be of different algorithms and hyperparameters. Bagging, boosting, stacking and blending are methods classifiers can be combined.

### Bagging

Bootstrap Aggregation or Bagging is a parallel ensembling technique that randomly bootstraps or samples the dataset with replacement to create subsets from the original. Multiple models are then trained using these subsets and the predicted results from these models aggregated to return final predictions. Bagging results in a final model that has less variance than its base classifiers.

#### Bagging: Random Forests

When bagging is applied to decision trees, it results in random forests which is a supervised learning algorithm that has a large number of decision trees. For an instance in the dataset, each tree returns a prediction for the class the instance belongs to then, the class with the most votes becomes the final class for that instance. In random forests, it is assumed that a group of uncorrelated trees will do better than an individual tree. While some of the trees might be wrong in their predictions, many others will be correct.

#### Boosting: AdaBoost, Gradient Boosting and XGBoost

### Boosting

Boosting is a sequential process where every phase attempts to correct the errors made by the previous model. The main principle is to fit multiple weak learners which are slightly better than just random guessing. In contrast to bagging, boosting attempts to reduce both variance and bias. AdaBoost, Gradient Boosting and XGBoost are examples of boosting algorithms.

#### AdaBoost: 

Adaptive Boosting is the first boosting algorithm. It is a very popular method for boosting that can be used on any classifier to present a more accurate model and improve its performance.  It can be described with the following steps: create a subset from the entire dataset, assign equal weights to the data points, create a base model using this subset, predict using this model, calculate errors from the predicted results, assign higher weights to misclassified instances to increase their chances of being selected, create another model that tries to correct these mistakes and make new predictions then repeat until the maximum number of models specified are created. The final model is the weighted average of all the weak learners created. AdaBoost is very sensitive to noisy data and outliers so it is important to remove these when using AdaBoost.

#### Gradient Boosting: 

This is another boosting algorithm that improves model performance where each model in the ensemble minimizes a loss function using gradient descent. The loss function which is used to obtain an estimate of how the model is performing, a weak learner - a model only slightly better than random guessing typically decision stumps (a decision tree with a single split - one level) and an additive model that combines the weak learners to make the final model are three important components in gradient boosting. 

#### XGBoost: 
Extreme Gradient Boosting is a supervised learning algorithm that implements gradient boosting by building trees parallely while applying regularization. It is well known for its scalability and fast execution. XGBoost can automatically identify missing values in data and it builds very deep trees before pruning for optimisation.

## Additional Reading Resources {% fn 4 %}

# Index

{{'The codes can also be found [here: Linear Classification and Logistic Regression](https://gist.github.com/HamoyeHQ/94d52ad113d1eac80d073a4affb0a490)' | fndetail: 1}}
{{'The codes can also be found [here: Measuring Classification Performance](https://gist.github.com/HamoyeHQ/bf8f7062e2acbaa48dc94993e8487b3d)' | fndetail: 2}}
{{'The codes can also be found [here: Tree-Based Methods and The Support Vector Machine](https://gist.github.com/HamoyeHQ/fb9265ee0d668480918466583d143f2f)' | fndetail: 3}}
{{'Additional Reading List and Links
[Ensemble Learning: Bagging and Boosting](https://becominghuman.ai/ensemble-learning-bagging-and-boosting-d20f38be9b1e)
[Feature Engineering by Wale Akinfaderin.](https://www.youtube.com/watch?v=ZQ5wF7z01I0)
[Scikit-Learn Classification.](https://stackabuse.com/overview-of-classification-methods-in-python-with-scikit-learn/)
[Gentle Introduction to XGBoost.](https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/)
[Learning from Imbalanced Class.](https://www.jeremyjordan.me/imbalanced-data/)
[Hands-on Machine Learning - NUMBER ONE GUIDE'](https://www.lpsm.paris/pageperso/has/source/Hand-on-ML.pdf) | fndetail: 4}}