# Boosting Exercise

In this exercise, you will learn about the Boosting technique, which is an ensemble method used to primarily reduce bias, and also variance in supervised learning. It combines multiple weak learners into a single strong learner. The learners are trained sequentially, each trying to correct its predecessor.

## Dataset
We will use the Breast Cancer dataset for this exercise. This dataset contains features computed from digitized images of breast mass and is used to predict whether the mass is malignant or benign. **Feel free to use another dataset!!**

## Task
Your task is to:
1. Load the dataset.
2. Preprocess the data (if necessary).
3. Implement boosting models.
4. Evaluate the models performance.

Please fill in the following code blocks to complete the exercise.

## AdaBoost Tutorial


### Step 1: Import Required Libraries
First, import the necessary libraries for data manipulation, model training, and evaluation.

In [39]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Step 2: Load and Preprocess the Dataset
Load the dataset and preprocess it. This includes handling missing values, encoding categorical variables, and splitting the data into features and target variables.

In [1]:
! kaggle datasets download -d uciml/breast-cancer-wisconsin-data

Dataset URL: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data
License(s): CC-BY-NC-SA-4.0
Downloading breast-cancer-wisconsin-data.zip to /content
  0% 0.00/48.6k [00:00<?, ?B/s]
100% 48.6k/48.6k [00:00<00:00, 50.9MB/s]


In [2]:
! unzip  /content/breast-cancer-wisconsin-data.zip

Archive:  /content/breast-cancer-wisconsin-data.zip
  inflating: data.csv                


In [30]:
data = pd.read_csv("/content/data.csv")

In [8]:
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [31]:
data = data.drop(columns=["Unnamed: 32"])
data= pd.get_dummies(data,  drop_first=True)

In [32]:
print(data)

           id  radius_mean  texture_mean  perimeter_mean  area_mean  \
0      842302        17.99         10.38          122.80     1001.0   
1      842517        20.57         17.77          132.90     1326.0   
2    84300903        19.69         21.25          130.00     1203.0   
3    84348301        11.42         20.38           77.58      386.1   
4    84358402        20.29         14.34          135.10     1297.0   
..        ...          ...           ...             ...        ...   
564    926424        21.56         22.39          142.00     1479.0   
565    926682        20.13         28.25          131.20     1261.0   
566    926954        16.60         28.08          108.30      858.1   
567    927241        20.60         29.33          140.10     1265.0   
568     92751         7.76         24.54           47.92      181.0   

     smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0            0.11840           0.27760         0.30010              

### Step 3: Split the Dataset
Split the dataset into training and testing sets to evaluate the performance of the models.

In [43]:
X = data.drop(columns=["diagnosis_M"], axis=1)
y = data["diagnosis_M"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 ,random_state=42)

In [24]:
X.head()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [35]:
y.head()

Unnamed: 0,diagnosis_M
0,True
1,True
2,True
3,True
4,True


### Step 4: Initialize and Train the AdaBoost Classifier
Initialize a Decision Tree classifier and use it as the base estimator for the AdaBoost classifier.

In [45]:
base_est = DecisionTreeClassifier()

adaboost_cls = AdaBoostClassifier(estimator=base_est, n_estimators=50)

adaboost_cls.fit(X_train, y_train)

ada_y_pred = adaboost_cls.predict(X_test)

accuracy = accuracy_score(y_test, ada_y_pred)

print(f"Adaboost classifier accuracy {accuracy * 100:.2f}%")


Adaboost classifier accuracy 94.74%


## XGBoost Tutorial


### Step 1: Import Required Libraries
First, import the necessary libraries for data manipulation, model training, and evaluation.

In [46]:
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Step 2: Load and Preprocess the Dataset
Load the dataset and preprocess it. This includes handling missing values, encoding categorical variables, and splitting the data into features and target variables.

In [56]:
# ! kaggle datasets download -d uciml/breast-cancer-wisconsin-data
# ! unzip  /content/breast-cancer-wisconsin-data.zip
# data = pd.read_csv("/content/data.csv")


data = data.drop(columns=["Unnamed: 32"])
data= pd.get_dummies(data,  drop_first=True)

### Step 3: Split the Dataset
Split the dataset into training and testing sets to evaluate the performance of the models.

In [54]:
X = data.drop(columns=["diagnosis_M"], axis=1)
y = data["diagnosis_M"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 ,random_state=42)

In [57]:
print(X.head())
print(y.head())

         id  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302        17.99         10.38          122.80     1001.0   
1    842517        20.57         17.77          132.90     1326.0   
2  84300903        19.69         21.25          130.00     1203.0   
3  84348301        11.42         20.38           77.58      386.1   
4  84358402        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

   symmetry_mean  ...  radius_worst  texture_worst  perimeter_worst  \
0         0.2419  ...         25.38          17.33 

### Step 4: Initialize and Train the XGBoost Classifier
Initialize and train the XGBoost classifier.

In [60]:
xgboost = XGBClassifier()

xgboost.fit(X_train, y_train)

xgboost_y_pred = xgboost.predict(X_test)

accuracy = accuracy_score(y_test, xgboost_y_pred)

print(f"xgboost classifier accuracy {accuracy * 100:.2f}%")


xgboost classifier accuracy 95.61%


## Gradient Boosting Tutorial


### Step 1: Import Required Libraries
First, import the necessary libraries for data manipulation, model training, and evaluation.

In [61]:
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Step 2: Load and Preprocess the Dataset
Load the dataset and preprocess it. This includes handling missing values, encoding categorical variables, and splitting the data into features and target variables.

In [62]:
# ! kaggle datasets download -d uciml/breast-cancer-wisconsin-data
# ! unzip  /content/breast-cancer-wisconsin-data.zip
# data = pd.read_csv("/content/data.csv")


data = data.drop(columns=["Unnamed: 32"])
data= pd.get_dummies(data,  drop_first=True)

### Step 3: Split the Dataset
Split the dataset into training and testing sets to evaluate the performance of the models.

In [63]:
X = data.drop(columns=["diagnosis_M"], axis=1)
y = data["diagnosis_M"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 ,random_state=42)

### Step 4: Initialize and Train the Gradient Boosting Classifier
Initialize and train the Gradient Boosting classifier.

In [75]:
gbc = GradientBoostingClassifier(n_estimators=50, learning_rate=0.7)

gbc.fit(X_train, y_train)

gbc_y_pred = gbc.predict(X_test)

accuracy = accuracy_score(y_test, gbc_y_pred)

print(f"Gradient Boosting Classifier accuracy {accuracy * 100 :.2f}%")

Gradient Boosting Classifier accuracy 96.49%




### Conclusion
Here’s a conclusion or insights section you can include in your notebook:

---


In this notebook, we explored the performance of various Boosting ensemble algorithms, including AdaBoost, XGBoost, and Gradient Boosting, to classify the dataset effectively.

- **AdaBoost**: The AdaBoost model achieved an accuracy of **94.74%**.

- **XGBoost**: : The XGBoost model delivered an accuracy of **95.61%**, slightly higher than AdaBoost.

- **Gradient Boosting**: By fine-tuning the learning rate to **0.7**, the Gradient Boosting model yielded the highest accuracy of **96.49%**.