# Ensemble Learning



### A. Implement Random Forest Classifier model to predict the safety of the car.
Dataset link: https://www.kaggle.com/datasets/elikplim/car-evaluation-data-set


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

In [2]:
data = pd.read_csv('./datasets/car_evaluation.csv', header=None)
data.shape

(1728, 7)

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       1728 non-null   object
 1   1       1728 non-null   object
 2   2       1728 non-null   object
 3   3       1728 non-null   object
 4   4       1728 non-null   object
 5   5       1728 non-null   object
 6   6       1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


### Step 2: Add headers

In [4]:
data.columns = ['buying_price', 'maintenance_cost', 'number_of_doors', 'number_of_persons', 'lug_boot', 'safety', 'decision']

In [5]:
data.head()

Unnamed: 0,buying_price,maintenance_cost,number_of_doors,number_of_persons,lug_boot,safety,decision
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [6]:
data.describe()

Unnamed: 0,buying_price,maintenance_cost,number_of_doors,number_of_persons,lug_boot,safety,decision
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,vhigh,vhigh,2,2,small,low,unacc
freq,432,432,432,576,576,576,1210


In [7]:
le = LabelEncoder()
for column in data.columns:
    data[column] = le.fit_transform(data[column])

In [8]:
X = data.drop('decision', axis=1)  # Features
y = data['decision']  # Target (Encoded)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
rf_classifier = RandomForestClassifier(n_estimators=10000, random_state=42)

rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)

Accuracy: 0.97
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.89      0.94        83
           1       0.58      1.00      0.73        11
           2       1.00      1.00      1.00       235
           3       1.00      0.94      0.97        17

    accuracy                           0.97       346
   macro avg       0.89      0.96      0.91       346
weighted avg       0.98      0.97      0.97       346



## **B: Use different voting mechanism and Apply AdaBoost (Adaptive Boosting), Gradient Tree Boosting (GBM), XGBoost classification on Iris dataset and compare the performance of three models using different evaluation measures.**
Dataset Link: https://www.kaggle.com/datasets/uciml/iris

### 1. Import Necessary Libraries

In [10]:
# Importing libraries for data manipulation and visualization
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

# Importing libraries for model building and evaluation
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Importing XGBoost
from xgboost import XGBClassifier

# Importing Label Encoder
from sklearn.preprocessing import LabelEncoder

In [11]:
iris_data = pd.read_csv("./datasets/Iris.csv")
iris_data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [12]:
# Check for missing values
iris_data.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [13]:
iris_data['Species'] = LabelEncoder().fit_transform(iris_data['Species'])
iris_data['Species'].unique()

array([0, 1, 2])

In [14]:
X = iris_data.drop('Species', axis=1)
y = iris_data['Species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

- **Stratification**: We use `stratify=y` to ensure that the proportion of classes in both training and test sets is the same as in the original dataset. This prevents bias, especially if the dataset is imbalanced.

#### AdaBoost,Gradient Boosting and XGBoost Classifiers

In [15]:
# Initialize and train AdaBoost Classifier
ada_model = AdaBoostClassifier(n_estimators=100, random_state=42)
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
xgb_model = XGBClassifier(n_estimators=100, use_label_encoder=False, eval_metric='mlogloss', random_state=42)

ada_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)
xgb_model.fit(X_train, y_train)

# Predict on the test data
y_pred_ada = ada_model.predict(X_test)
y_pred_gb = gb_model.predict(X_test)
y_pred_xgb = xgb_model.predict(X_test)

Parameters: { "use_label_encoder" } are not used.



---

#### Observations:
- **AdaBoost**: It sequentially adjusts the weights of misclassified instances, focusing on hard-to-classify examples.
- **Gradient Boosting**: Focuses on minimizing the errors of the previous trees by using residuals, making it a more refined boosting method compared to AdaBoost.
- **XGBoost**: It optimizes Gradient Boosting by using regularization to avoid overfitting, making it a more powerful and faster method.
- All three classifiers are trained on the training set using 100 estimators.

---


### 7. Evaluate the Models

We will evaluate the performance of each model using common classification metrics like **accuracy**, **confusion matrix**, and **classification report** (which includes precision, recall, and F1-score).


In [16]:
def evaluate_model(y_true, y_pred):
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("\nClassification Report:\n", classification_report(y_true, y_pred))
    print(f'Accuracy: {accuracy_score(y_true, y_pred) * 100:.2f}%\n')


print("AdaBoost Performance:")
evaluate_model(y_test, y_pred_ada)

print("\n\n\n\nGradient Boosting Performance:")
evaluate_model(y_test, y_pred_gb)

print("\n\n\n\nXGBoost Performance:")
evaluate_model(y_test, y_pred_xgb)

AdaBoost Performance:
Confusion Matrix:
 [[10  0  0]
 [ 0 10  0]
 [ 0  0 10]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        10
           2       1.00      1.00      1.00        10

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Accuracy: 100.00%





Gradient Boosting Performance:
Confusion Matrix:
 [[10  0  0]
 [ 0 10  0]
 [ 0  0 10]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        10
           2       1.00      1.00      1.00        10

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



#### Observations:
- We expect **XGBoost** to have the highest accuracy due to its optimizations and regularization techniques, but **Gradient Boosting** and **AdaBoost** are also strong contenders.
- Each model has its own strengths and trade-offs. AdaBoost is simpler and faster for smaller datasets, Gradient Boosting is more powerful for structured data, and XGBoost tends to be the most accurate for larger, more complex datasets.
- **Based on the results**, we would choose the model that best fits the problem's needs (accuracy, computational efficiency, or the ability to handle large datasets).

### Conclusion:

1. We imported and preprocessed the **Iris dataset**, splitting it into training and testing sets.
2. We trained three different boosting algorithms: **AdaBoost**, **Gradient Boosting**, and **XGBoost**.
3. We evaluated the models on accuracy, classification reports, and confusion matrices.
4. We compared the results of each model, and based on observations, **XGBoost** is likely to be the best choice for this dataset.
