# Introduction

Boosting is an ensemble learning method that combines a set of weak learners into a strong learner to minimize training errors. In boosting, a random sample of data is selected, fitted with a model and then trained sequentially—that is, each model tries to compensate for the weaknesses of its predecessor. With each iteration, the weak rules from each individual classifier are combined to form one, strong prediction rule.

If the weak learners are trained in parallel, it is called Baaging. Bagging and boosting are two main types of ensemble learning methods.
AdaBoost, short for Adaptive Boosting, represents a prominent boosting algorithm widely used for classification tasks.



The steps in adaboost are:

**1- Initiate the observation weights:**
 Begin by assigning equal weights, such that w_i = 1/N , i = 1,2,3... N

 At the beguining all the points have equal weights. Note that the weighted samples always sum to 1, so the value of each individual weight will always lie between 0 and 1.


**2- Iteration:**


 **2.1- Fitting the Classifier:** Train a classifier using the current weights w_i

 **2.2- Computing the Error: ** Calculate the error rate ε as the weighted sum of incorrectly classified samples divided by the sum of observation weights:
 ε = Σ(w_i * Incorrectly classified) / Σ(w_i)

 **2.3- Determining Significance (α):**  It is the importance (influence) of this model in the final clasification. Significance is inversely proportional to error.
 α = 0.5 * ln((1 - ε) / ε)


 **2.4- Updating Weights:** Adjust weights by reducing those associated with correctly classified training samples. Misclassified points receive increased weight, enhancing the algorithm's focus on correctly classifying them in next round.


**3- Generatig output:**
 The final prediction is obtained by summing all the predicted values multiplied by their corresponding model significances:
Output G(x) = sign [∑ alpha * G(x)]


This is how the combination Error, Alpha and Weights is used to create iteratively better learners.


The goal of this project is to build AdaBoost implementation in Python from scratch using Decision Trees, SVMs (Support Vector Machines), and Logistic Regression as base learners. We will then compare its performance, such as accuracy against both standalone Decision Trees and the AdaBoost library available in scikit-learn.

---







### References
https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning

https://www.ibm.com/topics/boosting

https://statchaitya.github.io/adaboostclassifier/

https://blog.paperspace.com/adaboost-optimizer/#:~:text=Alpha%20is%20how%20much%20influence,ranging%20from%200%20to%201.


https://xavierbourretsicotte.github.io/AdaBoost.html

# Data set

This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The "target" field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

Columns
1- age

2- sex

3- chest pain type (4 values)

4- resting blood pressure

5- serum cholestoral in mg/dl

6- fasting blood sugar > 120 mg/dl

7- resting electrocardiographic results (values 0,1,2)

8- maximum heart rate achieved

9- exercise induced angina

10- oldpeak = ST depression induced by exercise relative to rest

11- the slope of the peak exercise ST segment

12- number of major vessels (0-3) colored by flourosopy

13- thal: 0 = normal; 1 = fixed defect; 2 = reversable defect

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.


### Reference
https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset

# Code

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Replace with the path to your CSV file
csv_path = '/content/drive/MyDrive/Colab Notebooks/heart.csv'

# Read CSV file into a DataFrame
data = pd.read_csv(csv_path)

In [4]:
data.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [9]:


file_path = '/content/drive/MyDrive/Colab Notebooks/heart.csv'

class Adaboost:

    def __init__(self,file_name):
        self.file_name = file_name

    def read_and_organize_data(self):

        full_path = file_path
        dt = pd.read_csv(full_path)

        #conversion to handle categorical variables properly before further processing.
        #here are 4 ways to achive the same result
        dt['cp'] = dt['cp'].astype('category')
        dt['fbs'] = dt['fbs'].astype(str)
        dt['restecg'] = dt['restecg'].apply(str)
        dt['exang'] = dt['exang'].map(str)
        dt['slope'] = dt['slope'].astype(str)
        dt['thal'] = dt['thal'].astype(str)

        #get_dummiesconverts categorical variables into dummy/indicator variables.
        #it creates a new DataFrame with binary columns for each category in the specified column(s).
        dt = pd.get_dummies(dt, drop_first=True)


        #Extracts target value and convert it to  -1 and 1
        #This is a case of Binary Classification. Using -1 and +1 facilitates the calculation of errors, where misclassified samples can easily be identified by their signs
        y_vals = dt['target'].values
        y_vals = 2*y_vals-1
        data = dt.drop('target',axis=1).values

        self.x_train, self.x_test, self.y_train, self.y_test = \
            train_test_split(data,y_vals , test_size = .25)
        return self.x_train, self.x_test, self.y_train, self.y_test


    def fit(self,num_estimators=100,base_model = DecisionTreeClassifier,**params):


        # self._read_and_organize_data()

        num_samples = len(x_train)

        self.estimator_list, self.estimator_weight_list = [], []

        #initializes a weight vector with equal weight for each sample in the training set.
        sample_weight = np.ones(num_samples) / num_samples

        for i in range(num_estimators):

            estimator = base_model(**params)

            estimator.fit(self.x_train, self.y_train, sample_weight=sample_weight)
            y_predict = estimator.predict(self.x_train)

            #calculationg error, misclassification
            incorrect = (y_predict != self.y_train)

            #total error
            estimator_error =  np.average(incorrect, weights=sample_weight, axis=0)

            #significance
            significance =  np.log((1. - estimator_error) / estimator_error)

            #update weights

            #updates the sample weights based on the incorrect boolean array by exponentiating the significance term where the predictions were incorrect.
            sample_weight *= np.exp(significance * incorrect)
            #Normalizes the sample weights
            sample_weight /= sample_weight.sum()

            #Appends the current estimator and its significance
            self.estimator_list.append(estimator)
            self.estimator_weight_list.append(significance)


        #Reshapes the list of estimator weights into a 2D array.
        self.estimator_weight_list = np.array(self.estimator_weight_list).reshape(-1,1)

    def predict(self):

        y_test_pred_list = [model.predict(self.x_test) for model in self.estimator_list]

        #organizing arrays for matrix multiplication
        y_test_pred_list = np.asarray(y_test_pred_list)

        #Transposes y_test_pred_list to align the dimensions for matrix multiplication.
        preds = np.sign(y_test_pred_list.T@self.estimator_weight_list) #np.sign() applies the sign function to the result of the matrix multiplication, converting the resulting values to +1 or -1.

        #calculates accuracy
        accuracy = accuracy_score(preds,self.y_test)
        return preds,accuracy




In [11]:
import warnings
warnings.filterwarnings("ignore")

Running predictions for all different the models

In [20]:
if __name__ == '__main__':

    file_name = 'heart.xls'
    adaboost_obj = Adaboost(file_name)
    x_train, x_test, y_train, y_test = adaboost_obj.read_and_organize_data()

    my_dct = DecisionTreeClassifier(max_depth=6)
    my_dct.fit(x_train,y_train)
    dct_score = my_dct.score(x_test,y_test)
    print('Decision tree accuracy: ',dct_score)

    my_adaboost = AdaBoostClassifier()
    my_adaboost.fit(x_train,y_train)
    adaboost_score = my_adaboost.score(x_test,y_test)
    print('Scikit learn adaboost accuracy: ',adaboost_score)

    adaboost_obj.fit(50,DecisionTreeClassifier,max_depth=1)
    preds,adaboost_score = adaboost_obj.predict()
    print('Adaboost accuracy DecisionTree: ',adaboost_score)

    adaboost_obj.fit(100,SVC,kernel='linear')
    preds,adaboost_score_with_svm = adaboost_obj.predict()
    print('Adaboost accuracy with SVC: ',adaboost_score_with_svm)

    adaboost_obj.fit(100,LogisticRegression,C=1000)
    preds,adaboost_score_with_logistic = adaboost_obj.predict()
    print('Adaboost accuracy with logistic: ',adaboost_score_with_logistic)






Decision tree accuracy:  0.6447368421052632
Scikit learn adaboost accuracy:  0.7236842105263158
Adaboost accuracy DecisionTree:  0.8026315789473685
Adaboost accuracy with SVC:  0.7894736842105263
Adaboost accuracy with logistic:  0.8026315789473685


Compared to just using Decision Trees, AdaBoost gives us a bigger accuracy. Moreover, the algorithm developed from scratch achieves a similar level of accuracy as the one obtained using the scikit-learn library.