Classifying Penguins

Please review the following site for information on our dataset of interest here: https://allisonhorst.github.io/palmerpenguins (Links to an external site.)

You can find the CSV file here: https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data (Links to an external site.)

This is a very nice, simple dataset with which to apply clustering techniques, classification techniques, or play around with different visualization methods. Your goal is to use the other variables in the measurement variables in the dataset to predict (classify) species.

Assignment Specs

You should compare XGBoost or Gradient Boosting to the results of your previous AdaBoost activity.
Based on the visualizations seen at the links above you're probably also thinking that this classification task should not be that difficult. So, a secondary goal of this assignment is to test the effects of the XGBoost (or Gradient Boosting) function arguments on the algorithm's performance. 
You should explore at least 3 different sets of settings for the function inputs, and you should do your best to find values for these inputs that actually change the results of your modelling. That is, try not to run three different sets of inputs that result in the same performance. The goal here is for you to better understand how to set these input values yourself in the future. Comment on what you discover about these inputs and how the behave.
Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.
Again, submit an HTML, ipynb, or Colab link. Be sure to rerun your entire notebook fresh before submitting!

In [32]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.metrics import classification_report
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

In [28]:
penguins = pd.read_csv("penguins_size.csv")
penguins.dropna(inplace=True)

# define x and y 
X = penguins.drop("species", axis=1)
y = penguins["species"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [27]:
ct = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore'), make_column_selector(dtype_include=object)),
        ("standardize", StandardScaler(), make_column_selector(dtype_include=np.number))
    ],
    remainder="passthrough"
).set_output(transform="pandas")

ada_pipeline = Pipeline([
    ('preprocess', ct),
    ('clf', AdaBoostClassifier(n_estimators=100, random_state=42))
])

ada_pipeline.fit(X_train, y_train)
y_pred_ada = ada_pipeline.predict(X_test)

print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred_ada))
print(classification_report(y_test, y_pred_ada))


AdaBoost Accuracy: 1.0
              precision    recall  f1-score   support

      Adelie       1.00      1.00      1.00        31
   Chinstrap       1.00      1.00      1.00        13
      Gentoo       1.00      1.00      1.00        23

    accuracy                           1.00        67
   macro avg       1.00      1.00      1.00        67
weighted avg       1.00      1.00      1.00        67



# Gradient Boosting

In [30]:
gb_pipeline = Pipeline([
    ('preprocess', ct),
    ('clf', GradientBoostingClassifier(n_estimators=100, random_state=42))
])

gb_pipeline.fit(X_train, y_train)
y_pred_gb = gb_pipeline.predict(X_test)

print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb))
print(classification_report(y_test, y_pred_gb))

Gradient Boosting Accuracy: 0.9801980198019802
              precision    recall  f1-score   support

      Adelie       0.98      0.98      0.98        49
   Chinstrap       0.94      0.94      0.94        18
      Gentoo       1.00      1.00      1.00        34

    accuracy                           0.98       101
   macro avg       0.97      0.97      0.97       101
weighted avg       0.98      0.98      0.98       101



# Adaboost

In [36]:
ct = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore'), make_column_selector(dtype_include=object)),
        ("standardize", StandardScaler(), make_column_selector(dtype_include=np.number))
    ],
    remainder="passthrough"
).set_output(transform="pandas")

# Create the AdaBoost model with DecisionTreeClassifier as the base estimator
model1 = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0,
    algorithm='SAMME'
)

# Create the pipeline
pipeline = Pipeline([
    ('preprocess', ct),
    ('clf', model1)
])

# Fit the model and make predictions
pipeline.fit(X_train, y_train)
pred1 = pipeline.predict(X_test)

# Calculate accuracy
acc1 = accuracy_score(y_test, pred1)

print(f"Model 1 Accuracy (50 estimators, LR=1.0): {acc1 * 100:.2f}%")
print(classification_report(y_test, pred1))

Model 1 Accuracy (50 estimators, LR=1.0): 99.01%
              precision    recall  f1-score   support

      Adelie       0.98      1.00      0.99        49
   Chinstrap       1.00      0.94      0.97        18
      Gentoo       1.00      1.00      1.00        34

    accuracy                           0.99       101
   macro avg       0.99      0.98      0.99       101
weighted avg       0.99      0.99      0.99       101





It's a bit concerning that AdaBoost achieved 100% accuracy, as this might suggest overfitting or that the model is too simple for the data. While the other models, like Gradient Boosting and AdaBoost with Decision Trees, performed slightly lower at 98% and 99%, it’s worth looking closer at why AdaBoost performed so perfectly.