Assignment Specs

You should compare XGBoost or Gradient Boosting to the results of your previous AdaBoost activity.
Based on the visualizations seen at the links above you're probably also thinking that this classification task should not be that difficult. So, a secondary goal of this assignment is to test the effects of the XGBoost (or Gradient Boosting) function arguments on the algorithm's performance.
You should explore at least 3 different sets of settings for the function inputs, and you should do your best to find values for these inputs that actually change the results of your modelling. That is, try not to run three different sets of inputs that result in the same performance. The goal here is for you to better understand how to set these input values yourself in the future. Comment on what you discover about these inputs and how the behave.
Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.
Again, submit an HTML, ipynb, or Colab link. Be sure to rerun your entire notebook fresh before submitting!


In [5]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
from xgboost import XGBClassifier


In [3]:

size= pd.read_csv("/content/penguins_size.csv")
dataset=size

dataset = dataset.dropna()

# Optional: reset index after dropping rows
dataset = dataset.reset_index(drop=True)

gradient boosting

In [4]:


# Encode categorical columns
for label in dataset.columns:
    dataset[label] = LabelEncoder().fit_transform(dataset[label])

# Define features and target
X = dataset.drop(['species'], axis=1)
Y = dataset['species']

# Split into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Initialize Gradient Boosting Classifier
gbc = GradientBoostingClassifier(n_estimators=400, learning_rate=1.0, max_depth=3, random_state=42)

# Train model
gbc.fit(X_train, y_train)

# Predict on test set
y_pred = gbc.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Test set accuracy is:', accuracy * 100, '%')

# Get feature importances
importances = gbc.feature_importances_
feature_names = X.columns
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print("\nTop Important Features:")
print(feature_importance_df)


Test set accuracy is: 99.00990099009901 %

Top Important Features:
             Feature    Importance
1   culmen_length_mm  4.938379e-01
3  flipper_length_mm  3.327490e-01
0             island  1.588311e-01
2    culmen_depth_mm  1.273897e-02
4        body_mass_g  1.842956e-03
5                sex  6.166081e-09


xg boost

In [6]:
for label in dataset.columns:
    dataset[label] = LabelEncoder().fit_transform(dataset[label])

# Define features and target
X = dataset.drop(['species'], axis=1)
Y = dataset['species']

# Split into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Initialize XGBoost Classifier
xgb = XGBClassifier(n_estimators=400, learning_rate=1.0, max_depth=3, use_label_encoder=False, eval_metric='mlogloss', random_state=42)

# Train model
xgb.fit(X_train, y_train)

# Predict on test set
y_pred = xgb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Test set accuracy is:', accuracy * 100, '%')


Parameters: { "use_label_encoder" } are not used.



Test set accuracy is: 99.00990099009901 %


In [7]:
importances = xgb.feature_importances_
feature_names = X.columns
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print("\nTop Important Features:")
print(feature_importance_df)


Top Important Features:
             Feature  Importance
3  flipper_length_mm    0.564384
0             island    0.224471
1   culmen_length_mm    0.187995
2    culmen_depth_mm    0.015092
5                sex    0.004768
4        body_mass_g    0.003289


now trying to mess up the functions

In [21]:
for label in dataset.columns:
    dataset[label] = LabelEncoder().fit_transform(dataset[label])

# Define features and target
X = dataset.drop(['species'], axis=1)
Y = dataset['species']

# Split into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Initialize XGBoost Classifier
xgb = XGBClassifier(n_estimators=1, learning_rate=10000000, max_depth=1, eval_metric='mlogloss', random_state=4)

# Train model
xgb.fit(X_train, y_train)

# Predict on test set
y_pred = xgb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Test set accuracy is:', accuracy * 100, '%')

Test set accuracy is: 94.05940594059405 %


In [24]:
for label in dataset.columns:
    dataset[label] = LabelEncoder().fit_transform(dataset[label])

# Define features and target
X = dataset.drop(['species', 'island'], axis=1)
Y = dataset['species']

# Split into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Initialize XGBoost Classifier
xgb = XGBClassifier(n_estimators=10, learning_rate=1000, max_depth=1000, eval_metric='mlogloss', random_state=4)

# Train model
xgb.fit(X_train, y_train)

# Predict on test set
y_pred = xgb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Test set accuracy is:', accuracy * 100, '%')

Test set accuracy is: 97.02970297029702 %


i messed around with max depth, n estimators, and learning rate to optimize model performance and strike a balance between underfitting and overfitting.

max_depth controls the complexity of each individual decision tree
n_estimators defines how many boosting rounds (trees) the model builds.
learning_rate determines the contribution of each tree to the final prediction.

Last time using adaboost i got an accuracy of 96%, so both graident(99%) and xgboost(99%) did better than ada boost.