# Ensemble Models

Ensemble models are a class of machine learning techniques that combine multiple individual models (often called "weak learners") to create a stronger model. This technique can improve the performance of predictive models by reducing bias and variance. There are 2 types of Ensemble learning algorithms:
## 1. Bagging
These are variance reduction models commonly used for overfitting models. Bagging, or **Bootstrap Aggregation** takes out random samples from the data and builds each model taking a bootstrap sample of the data as training example. Prediction is done by taking the mode of the labels from each model in case of classification or average of the target values in case of regression models.

## 2. Boosting
This is a Bias reduction model that combines weak learners to get a reliable learning algorithm. It works by sequentially training models, where each subsequent model corrects the errors made by the previous one. In each iteration, the algorithm assigns a weight to the data points that were misclassified by the previous model, ensuring that the new model now focuses on those instances more, hece fitting the data better.

Mathematically, boosting creates a final model $H(x)$ as a weighted sum of the individual weak models $h_t(x)$:

$$
H(x) = \sum_{t=1}^{T} \alpha_t h_t(x)
$$
where $T$ is the number of models and $\alpha_t$ is the weight of each model $h_t(x)$.

## AdaBoost

AdaBoost, or **Adaptive Boosting** is one of the most popular boosting algorithms. It changes weights of the datapoints after training each tree, giving more weight to instances that were misclassified by previous models. In AdaBoost, each model is assigned a weight based on its accuracy, and the final prediction is made by taking a weighted majority vote (for classification) or a weighted sum (for regression).

### Algorithm

#### 1. Initialize weights
Start with equal weights for all training samples: 
   
   $$ w_1 = w_2 = \dots = w_m = \frac{1}{m} $$

   where $m$ is the total number of training examples.

#### 2. Train weak model
Train a weak model $h_t(x)$ using the weighted datapoints. (For decision trees, they are made weak learners by forcing the model to make only one split.)

#### 3. Calculate error
Compute the error rate $err_t$ of the model $h_t(x)$ on the weighted dataset:

   $$ err_t = \frac{\sum_{i=1}^{m} w_i \cdot I(y_i \neq h_t(x_i))}{\sum_{i=1}^{m} w_i} $$

   Where $I$ is the indicator function.
   
#### 4. Compute model weight
Calculate the weight $\alpha_t$ of the model $h_t(x)$:

   $$ \alpha_t = \frac{1}{2} \ln\left(\frac{1-err_t}{err_t}\right) $$

#### 5. Update sample weights

   $$ w_{i} := w_i \cdot \exp\left(\alpha_t \cdot I(y_i \neq h_t(x_i))\right) $$

#### 6. Prediction

   $$ H(x) = \text{sgn}\left(\sum_{t=1}^{T} \alpha_t h_t(x)\right) $$


## Implementation

The following is an implementation of the Adaboost Algorithm on the Breast Cancer dataset.

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
pd.set_option('future.no_silent_downcasting', True)
import matplotlib.pyplot as plt
%matplotlib inline

from scipy.stats import mode
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import *

### Creating Weighted Decision Tree

For a weighted binary decision tree, the gini impurity is:
$$
G(S) = 1 - \sum_{c \in \{S_l,S_r\}} \left( \frac{\sum_{s \in S_c} w_s}{\sum_{s \in S} w_s} \right)^2,
$$
The information Gain:
$$
IG(S) = G(S) - \left( \frac{\sum_{s \in S_l} w_s}{\sum_{s \in S} w_s} G(S_l) + \frac{\sum_{s \in S_r} w_s}{\sum_{s \in S} w_s} G(S_r) \right)
$$

In [2]:
class Node:
    def __init__(self, feature_index=None, threshold=None, left=None, right=None, label=None):
        self.j = feature_index
        self.t = threshold
        self.l = left
        self.r = right
        self.label = label

class Weighted_DT:
    def __init__(self, max_depth = 5):
        self.max_d = max_depth
        
    def gini(self, y, mask = True):
        classes = np.unique(y)
        w = []
        for c in classes:
            w.append(np.sum(self.w[mask][y[mask]==c])/np.sum(self.w[mask]))
        w = np.array(w)
        return 1-np.sum(w**2)
        
    def info_gain(self, y, lm, rm):
        return self.gini(y)-((np.sum(self.w[lm]))/np.sum(self.w) * self.gini(y,lm) + (np.sum(self.w[rm]))/np.sum(self.w) * self.gini(y,rm))
    
    def split(self,X,j,t):
        l_mask = X[:,j] <= t
        return l_mask, ~l_mask
    
    def best_split(self,X,y):
        best_j, best_t = None, None
        max_ig = 0
        
        for j in range(X.shape[1]):
            possible_t = X[:,j]
            
            for t in possible_t:
                lm,rm = self.split(X,j,t)
                l, r = y[lm],y[rm]
                if len(l)==0 or len(r)==0:
                    continue
                
                ig = self.info_gain(y, lm, rm)
                
                if ig > max_ig:
                    max_ig = ig
                    best_j = j
                    best_t = t
        return best_j, best_t

    def make_tree(self,X,y, curr_d=0):
        if len(np.unique(y)) == 1 or curr_d >= self.max_d or len(y) < 5:
            majority_class = np.bincount(y).argmax()
            return Node(label=majority_class)
            
        j,t = self.best_split(X,y)
        
        if j==None:
            majority_class = np.bincount(y).argmax()
            return Node(label=majority_class)
        
        lm,rm = self.split(X,j,t)
        l = self.make_tree(X[lm],y[lm],curr_d+1)
        r = self.make_tree(X[rm],y[rm],curr_d+1)
        return Node(j,t,l,r)

    def train(self, X, y, w=None):
        if w is None:
            self.w = np.full((X.shape[0],),1/X.shape[0])
        else:
            self.w = w
        self.root = self.make_tree(X,y)
    
    def predict_single(self, x, node):
        if node.label is not None:
            return node.label
        if x[node.j] <= node.t:
            return self.predict_single(x,node.l)
        else:
            return self.predict_single(x,node.r)

    def predict(self, X_test):
        return np.array([self.predict_single(x, self.root) for x in X_test])

### Implementing AdaBoost

In [3]:
class Adaboost:
    def __init__(self, tree_ct=10):
        self.T = tree_ct
        self.trees = []
        self.alphas = []


    def train(self,X,y):
        self.w = np.full((X.shape[0],),1/X.shape[0])
        for i in range(self.T):
            tree = Weighted_DT(max_depth=1)
            tree.train(X,y,self.w)

            pred = tree.predict(X)
            err = np.sum(self.w * (y!=pred))/np.sum(self.w)
            alpha = np.log((1 - err)/err)
            
            self.alphas.append(alpha)
            self.trees.append(tree)
            
            self.w = self.w * np.exp(alpha * (y!=pred))

    def predict(self,X):
        predictions = np.empty(X.shape[0])
        for tree, alpha in zip(self.trees, self.alphas):
            predictions += tree.predict(X) * alpha
        return predictions > 0

### Loading Data

In [4]:
data = load_breast_cancer()

df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [5]:
dat = df.to_numpy()

np.random.shuffle(dat)
X = dat[:450,:-1]
y = dat[:450,-1].astype(int)
y_test = dat[450:,-1].astype(int)
X_test = dat[450:,:-1]

### Training the Model

In [6]:
ada = Adaboost(tree_ct=40)
ada.train(X,y)

### Prediction and Testing

In [7]:
predictions = ada.predict(X_test)

print("Confusion Matrix:\n",confusion_matrix(y_test, predictions), end = "\n\n")

accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Confusion Matrix:
 [[21 27]
 [ 0 71]]

Accuracy: 0.773109243697479
Precision: 0.7244897959183674
Recall: 1.0
F1 Score: 0.8402366863905325
