Bagging and boosting methods are "meta-algorithms" whose approach is to combine several machine learning algorithms into a predictive model to reduce their variance or bias and improve the final performance.

Both methods work similarly and consist of 2 main steps:

1) Build different simple machine learning models on subsets of the original data.

2) Produce a new model by assembling the previous ones.






Boosting is a modeling technique aimed at improving the performance of simple machine learning models by iteratively combining them to form a more robust and powerful model.
Unlike other methods that use independent models, boosting recursively constructs a series of weak models, each focusing on observations mispredicted by the previous model.

The boosting process begins by:

Creating a weak model, usually a simple decision rule.
At each iteration, new weak models are built, placing more emphasis on observations mispredicted by previous models.
These models are then combined into a single, more powerful model by assigning weights to each model based on their respective accuracies.
By weak, it is implied a decision rule whose error rate is slightly better than that of a purely random rule.

Each estimator is an improved version of the previous one, aiming to give more weight to misfitted or mispredicted observations.
So at each iteration, the estimator evaluation allows data resampling, with greater weight given to mispredicted observations.
The estimator built at step i will thus focus its efforts on observations misfitted by the estimator at step i − 1.
Finally, classifiers are combined and weighted by coefficients associated with their respective predictive performances.




There are many boosting algorithms.
The most popular is the AdaBoost algorithm (for Adaptive Boosting) developed by Freund & Schapire (1997).
Its operation is as follows:

Choose a "weak" classification rule. The idea is to apply this rule several times, judiciously assigning a different weight to observations at each iteration.
The weights of each observation are initialized to  1𝑛
  ( 𝑛
  being the number of observations) for the estimation of the first model.
They are then updated for each iteration. The importance of an observation remains unchanged if the observation is correctly classified; otherwise, it increases with the measured fitting quality of the model.
The final aggregation is a combination of the estimators obtained weighted by the fitting qualities of each model.
The sklearn.ensemble package allows implementing the AdaBoost algorithm in the case of multi-class classification, notably through the AdaBoostClassifier class, which allows creating a classifier using a simple decision tree as the initial classification rule by default.

In the following exercise, we will use the dataset 'letter-recognition.csv', which contains certain features specific to images representing one of the 26 capital letters of the Latin alphabet, as well as the 'letter' column containing the respective letter.

In [1]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import pandas as pd
import numpy as np

In [3]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
letter_recognition = fetch_ucirepo(id=59) 
  
# data (as pandas dataframes) 
X = letter_recognition.data.features 
y = letter_recognition.data.targets 
  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


In [5]:
dtc = DecisionTreeClassifier(max_depth=5)
dtc.fit(X_train, y_train)

In [6]:
dtc.score(X_test, y_test)

0.36083333333333334

In [11]:
y = y.values.ravel()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)



ac = AdaBoostClassifier(estimator=dtc, n_estimators=400, algorithm='SAMME')
ac.fit(X_train, y_train)

In [12]:
y_pred = ac.predict(X_test)
pd.crosstab(y_test, y_pred)

col_0,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,225,0,0,0,0,0,1,0,0,1,...,0,1,1,1,0,0,0,0,2,0
B,0,201,0,2,1,0,0,1,0,0,...,0,8,3,0,0,6,0,1,0,0
C,0,0,200,0,6,0,10,0,0,0,...,0,2,0,5,1,0,0,0,0,0
D,0,6,0,226,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
E,0,1,3,0,211,0,5,0,0,0,...,0,0,3,1,0,0,0,12,0,0
F,0,3,0,0,2,212,1,2,0,0,...,0,0,2,2,0,1,0,0,0,0
G,0,1,13,4,4,1,178,1,0,0,...,1,0,1,1,0,5,0,0,0,0
H,0,3,0,15,0,0,1,154,0,0,...,1,9,1,0,0,2,0,1,4,0
I,0,0,0,1,0,3,0,0,163,8,...,2,0,4,0,0,0,0,43,0,0
J,0,4,0,1,0,9,0,2,10,201,...,0,0,0,0,0,0,0,1,0,0


 This technique is suitable for multi-label problems and performs well in classifying outliers.

The term Bagging comes from the contraction of Bootstrap Aggregating, it encompasses a set of     

methods introduced by Léo Breiman (1996) aiming to reduce variance and increase the stability of   

Machine Learning algorithms used for classification or regression.  

The Bagging method involves training a model on different subsets of data, each of which is of 

the same size as the original sample, using the Bootstrap technique, which involves random sampling     

with replacement. This allows the construction of a set of independent estimators, which are then   

aggregated or "bagged" into a meta-model using majority voting for classification or averaging for regression.    



  




  


  
  

  

Unlike Boosting, choosing a large number of estimators in Bagging does not incur additional risk of overfitting.

Indeed, the higher the number of estimators, the more the bias of the final model will be   

equivalent to the average of the aggregated biases, and the variance will decrease even more as   

the aggregated estimators become more uncorrelated. Therefore, it is advisable to choose as many   

estimators as possible, depending on the time you want to allocate to the training process.      



  


  

  


The method thus constructs a set of independent estimators, which are subsequently aggregated 

(or   "bagged") into a meta-model, with majority voting for classification and averaging for regression.  

Furthermore, the higher the number of estimators, the more the bias of the final model will be   

equivalent to the average of the aggregated biases, and the variance will decrease even more as   

the estimators being aggregated become more uncorrelated.  








The prediction error calculated, in general, for Bagging methods is the so-called Out Of Bag (OOB) error, meaning that for each observation, we calculate the average error for all models trained on a bootstrap sample that it does not belong to. This technique helps prevent overfitting.

The BaggingClassifier class from the sklearn.ensemble package allows creating a classifier using the Bagging algorithm based on default classification trees.

In [13]:
from sklearn.ensemble import BaggingClassifier
bc = BaggingClassifier(n_estimators=1000, oob_score=True)
bc.fit(X_train, y_train)
bc.oob_score_


0.9474285714285714

In [14]:
bc.score(X_test, y_test)


0.942

In [15]:
y_pred = bc.predict(X_test)
pd.crosstab(y_test, y_pred)


col_0,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,227,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
B,0,212,0,0,1,1,2,1,0,0,...,0,3,1,0,0,2,0,2,0,0
C,0,0,213,0,4,0,2,0,0,0,...,1,1,1,2,1,0,0,0,0,0
D,0,2,0,234,0,0,0,1,0,0,...,1,0,0,0,0,0,0,2,0,0
E,0,1,1,0,220,2,2,0,0,0,...,5,0,5,0,0,0,0,1,0,4
F,0,2,0,1,0,230,0,1,1,0,...,0,0,0,1,0,0,1,0,0,0
G,0,3,1,3,4,0,200,0,0,0,...,0,0,1,0,0,1,1,0,0,0
H,2,3,0,3,0,0,1,188,0,0,...,0,7,1,0,1,1,0,0,0,1
I,0,2,0,1,1,4,0,0,206,4,...,3,0,2,0,0,0,0,1,0,2
J,0,0,0,0,1,3,0,2,13,207,...,0,0,1,0,0,0,0,0,0,1


The results obtained with Bagging are even significantly better than those of the earlier models.

Bagging is a simple and robust ensemble method that reduces variance when predictors are unstable. Its prediction error estimation using Bootstrap prevents overfitting.

The use of Bagging is suitable for algorithms with high variance, which are thus stabilized, notably neural networks and decision trees.
However, it can also degrade the performance of more stable algorithms, such as k-nearest neighbors method or linear regression.



In conclusion, we can say that Bagging and Boosting methods are similar in essence but quite different in form.
Indeed, Bagging and Boosting:

are both ensemble methods that produce N estimators to obtain one, but while they are independent for Bagging, Boosting creates models that iteratively improve by focusing on where previous models have failed.
generate different datasets through resampling, but while resampling is completely random for Bagging, Boosting calculates different weights to select the observations that are most difficult to predict at each step.
both determine the final decision by majority voting or averaging over the N estimators, but the averaging is equally weighted for Bagging, and weighted by coefficients relative to the performance of the estimators for Boosting.
are effective at reducing variance and providing greater stability, but only Boosting attempts to reduce bias, while Bagging is better at avoiding overfitting that Boosting can sometimes create.
In the next notebook, we will explore Random Forests in Scikit-Learn !