You should compare AdaBoost to at least one of the following: a bagging model, a stacking model.
Based on the visualizations seen at the links above you're probably also thinking that this classification task should not be that difficult. So, a secondary goal of this assignment is to test the effects of the AdaBoost function arguments on the algorithm's performance.
You should explore at least 3 different sets of settings for the function inputs, and you should do your best to find values for these inputs that actually change the results of your modelling. That is, try not to run three different sets of inputs that result in the same performance. The goal here is for you to better understand how to set these input values yourself in the future. Comment on what you discover about these inputs and how the behave.
Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.
Again, submit an HTML, ipynb, or Colab link. Be sure to rerun your entire notebook fresh before submitting!


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
iter= pd.read_csv("/content/penguins_lter.csv")
size= pd.read_csv("/content/penguins_size.csv")

In [11]:
dataset=size
dataset.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [16]:
dataset = dataset.dropna()

# Optional: reset index after dropping rows
dataset = dataset.reset_index(drop=True)

bagging model

In [21]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X = dataset.drop(['species'],axis=1)
y = dataset['species']

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create the base estimator
base_model = DecisionTreeClassifier()

# Create Bagging Classifier
bagging_model = BaggingClassifier(estimator=base_model, n_estimators=50, random_state=42)

# Fit model
bagging_model.fit(X_train, y_train)

# Make predictions
y_pred = bagging_model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Bagging Classifier Accuracy: {accuracy:.2f}')


Bagging Classifier Accuracy: 0.95


adaboosts

In [22]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
for label in dataset.columns:
    dataset[label] = LabelEncoder().fit(dataset[label]).transform(dataset[label])

X = dataset.drop(['species'],axis=1)
Y = dataset['species']

#model = DecisionTreeClassifier(criterion='entropy',max_depth=1)
#AdaBoost = AdaBoostClassifier(base_estimator= model,n_estimators=400,learning_rate=1)

AdaBoost = AdaBoostClassifier(n_estimators=400,learning_rate=1,algorithm='SAMME')

AdaBoost.fit(X,Y)

prediction = AdaBoost.score(X,Y)

print('The accuracy is: ',prediction*100,'%')



The accuracy is:  98.83720930232558 %


In [62]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Encode categorical columns
for label in dataset.columns:
    dataset[label] = LabelEncoder().fit_transform(dataset[label])

# Define features and target
X = dataset.drop(['species'], axis=1)
Y = dataset['species']

# Split into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Initialize AdaBoost Classifier
AdaBoost = AdaBoostClassifier(n_estimators=400, learning_rate=1, algorithm='SAMME')

# Train model
AdaBoost.fit(X_train, y_train)

# Predict on test set
y_pred = AdaBoost.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Test set accuracy is:', accuracy * 100, '%')

# Get feature importances
importances = AdaBoost.feature_importances_
feature_names = X.columns
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print("\nTop Important Features:")
print(feature_importance_df)




Test set accuracy is: 96.15384615384616 %

Top Important Features:
             Feature  Importance
0             island    0.943058
1   culmen_length_mm    0.021172
4        body_mass_g    0.012698
3  flipper_length_mm    0.012345
2    culmen_depth_mm    0.010727
5                sex    0.000000


In [68]:
for label in dataset.columns:
    dataset[label] = LabelEncoder().fit_transform(dataset[label])

# Define features and target
X = dataset.drop(['species', 'island','body_mass_g'], axis=1)
Y = dataset['species']

# Split into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Initialize AdaBoost Classifier
AdaBoost = AdaBoostClassifier(n_estimators=400, learning_rate=1, algorithm='SAMME')

# Train model
AdaBoost.fit(X_train, y_train)

# Predict on test set
y_pred = AdaBoost.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Test set accuracy is:', accuracy * 100, '%')

# Get feature importances
importances = AdaBoost.feature_importances_
feature_names = X.columns
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print("\nTop Important Features:")
print(feature_importance_df)




Test set accuracy is: 93.26923076923077 %

Top Important Features:
             Feature  Importance
2  flipper_length_mm    0.492463
1    culmen_depth_mm    0.489828
0   culmen_length_mm    0.017709
3                sex    0.000000


In [50]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
for label in dataset.columns:
    dataset[label] = LabelEncoder().fit(dataset[label]).transform(dataset[label])

X = dataset.drop(['species'],axis=1)
Y = dataset['species']

model = BaggingClassifier(estimator=base_model, n_estimators=50, random_state=42)

AdaBoost = AdaBoostClassifier(estimator= model,n_estimators=400,learning_rate=1)


AdaBoost.fit(X,Y)

prediction = AdaBoost.score(X,Y)

print('The accuracy is: ',prediction*100,'%')

The accuracy is:  100.0 %


In [60]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Encode categorical columns
for label in dataset.columns:
    dataset[label] = LabelEncoder().fit_transform(dataset[label])

# Define features and target
X = dataset.drop(['species'], axis=1)
Y = dataset['species']

# Split into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Initialize AdaBoost Classifier
base_model = DecisionTreeClassifier()

model = BaggingClassifier(estimator=base_model, n_estimators=75, random_state=42)

AdaBoost = AdaBoostClassifier(estimator= model,n_estimators=400,learning_rate=.1)
# Train model
AdaBoost.fit(X_train, y_train)

# Predict on test set
y_pred = AdaBoost.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Test set accuracy is:', accuracy * 100, '%')


Test set accuracy is: 95.1923076923077 %


In [61]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Encode categorical columns
for label in dataset.columns:
    dataset[label] = LabelEncoder().fit_transform(dataset[label])

# Define features and target
X = dataset.drop(['species'], axis=1)
Y = dataset['species']

# Split into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Initialize AdaBoost Classifier
AdaBoost = AdaBoostClassifier(n_estimators=200, learning_rate=.01, algorithm='SAMME')

# Train model
AdaBoost.fit(X_train, y_train)

# Predict on test set
y_pred = AdaBoost.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Test set accuracy is:', accuracy * 100, '%')




Test set accuracy is: 79.8076923076923 %


I can see that as i decrease the number of esitmators and the learning rate the accuracy will decrease. i also see that adding a estimator within the boosting model will increase the accuracy. SO even if i have an estimator of one for the boosting but have 100 esitmators in the models within it, it will still have a high accuracy rate.