Classifying Penguins

Please review the following site for information on our dataset of interest here: https://allisonhorst.github.io/palmerpenguins (Links to an external site.)

You can find the CSV file here: https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data (Links to an external site.)

This is a very nice, simple dataset with which to apply clustering techniques, classification techniques, or play around with different visualization methods. Your goal is to use the other variables in the measurement variables in the dataset to predict (classify) species.

Assignment Specs

You should compare XGBoost or Gradient Boosting to the results of your previous AdaBoost activity.
Based on the visualizations seen at the links above you're probably also thinking that this classification task should not be that difficult. So, a secondary goal of this assignment is to test the effects of the XGBoost (or Gradient Boosting) function arguments on the algorithm's performance. 
You should explore at least 3 different sets of settings for the function inputs, and you should do your best to find values for these inputs that actually change the results of your modelling. That is, try not to run three different sets of inputs that result in the same performance. The goal here is for you to better understand how to set these input values yourself in the future. Comment on what you discover about these inputs and how the behave.
Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.
Again, submit an HTML, ipynb, or Colab link. Be sure to rerun your entire notebook fresh before submitting!

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

ModuleNotFoundError: No module named 'pandas'

In [9]:
penguins = pd.read_csv("penguins_size.csv")
penguins.dropna(inplace=True)

In [10]:
# Encode target variable
le = LabelEncoder()
penguins['species'] = le.fit_transform(penguins['species'])  # 0 = Adelie, 1 = Chinstrap, 2 = Gentoo

# Encode categorical features
penguins['sex'] = le.fit_transform(penguins['sex'])
penguins['island'] = le.fit_transform(penguins['island'])

# Define X and y
X = penguins.drop("species", axis=1)
y = penguins["species"]

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Gradient Boosting

In [11]:
# Train model
gb_clf = GradientBoostingClassifier()
gb_clf.fit(X_train, y_train)

# Evaluate
y_pred_gb = gb_clf.predict(X_test)
print("Gradient Boosting Classifier Report:")
print(classification_report(y_test, y_pred_gb, target_names=le.classes_))

Gradient Boosting Classifier Report:
              precision    recall  f1-score   support

      Biscoe       1.00      1.00      1.00        31
       Dream       1.00      1.00      1.00        13
   Torgersen       1.00      1.00      1.00        23

    accuracy                           1.00        67
   macro avg       1.00      1.00      1.00        67
weighted avg       1.00      1.00      1.00        67



In [None]:


# Train model
xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_clf.fit(X_train, y_train)

# Evaluate
y_pred_xgb = xgb_clf.predict(X_test)
print("XGBoost Classifier Report:")
print(classification_report(y_test, y_pred_xgb, target_names=le.classes_))
