# Activity Directions
### Classifying Penguins

Please review the following site for information on our dataset of interest here: https://allisonhorst.github.io/palmerpenguins (Links to an external site.)

You can find the CSV file here: https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data (Links to an external site.)

This is a very nice, simple dataset with which to apply clustering techniques, classification techniques, or play around with different visualization methods. Your goal is to use the other variables in the measurement variables in the dataset to predict (classify) species.

### Assignment Specs

- You should compare XGBoost or Gradient Boosting to the results of your previous AdaBoost activity.
- Based on the visualizations seen at the links above you're probably also thinking that this classification task should not be that difficult. So, a secondary goal of this assignment is to test the effects of the XGBoost (or Gradient Boosting) function arguments on the algorithm's performance. 
- You should explore at least 3 different sets of settings for the function inputs, and you should do your best to find values for these inputs that actually change the results of your modelling. That is, try not to run three different sets of inputs that result in the same performance. The goal here is for you to better understand how to set these input values yourself in the future. Comment on what you discover about these inputs and how the behave.
- Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.

# Notes
### Gradient Boosting
- After the initial model is fit, a loss function is plotted (instead of updating the weights as we did in AdaBoost)
- Gradient Boosting gets its name from Gradient Descent, which is the method used to find the parameters which minimize the loss function
### XGBoost
- Direct application of Gradient Boosting for decision trees with the following advantages:
1. Easy to use
2. Computational Efficiency
3. Model Accuracy
4. Feasibility – easy to tune parameters and modify objectives

# Process
## import data

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
peng = pd.read_csv(url)

peng = peng.dropna()
peng.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE


# Preprocessing

In [2]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Encode categorical variables
peng_encoded = peng.copy()
peng_encoded['species'] = LabelEncoder().fit_transform(peng['species'])
peng_encoded['sex'] = LabelEncoder().fit_transform(peng['sex'])
peng_encoded['island'] = LabelEncoder().fit_transform(peng['island'])

# Define features and target
X = peng_encoded.drop('species', axis=1)
y = peng_encoded['species']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Models
## Gradient Boosting

In [6]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Gradient Boosting model
params = [
    {"n_estimators": 1, "learning_rate": 0.5},
    {"n_estimators": 1, "learning_rate": 1.0},
    {"n_estimators": 1, "learning_rate": 1.5},
    {"n_estimators": 10, "learning_rate": 0.5},
    {"n_estimators": 10, "learning_rate": 1.0},
    {"n_estimators": 10, "learning_rate": 1.5},
    {"n_estimators": 25, "learning_rate": 0.5},
    {"n_estimators": 25, "learning_rate": 1.0},
    {"n_estimators": 25, "learning_rate": 1.5},
    {"n_estimators": 100, "learning_rate": 0.5},
    {"n_estimators": 100, "learning_rate": 1.0},
    {"n_estimators": 100, "learning_rate": 1.5}
]

gb_results = []
for p in params:
    model = GradientBoostingClassifier(**p)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = accuracy_score(y_test, pred)
    gb_results.append((p["n_estimators"], p["learning_rate"], acc))

# Show results
gb_results_df = pd.DataFrame(gb_results, columns=["n_estimators", "learning_rate", "accuracy"])
print("Gradient Boosting Results:")
print(gb_results_df)


Gradient Boosting Results:
    n_estimators  learning_rate  accuracy
0              1            0.5  0.985075
1              1            1.0  1.000000
2              1            1.5  1.000000
3             10            0.5  1.000000
4             10            1.0  1.000000
5             10            1.5  0.970149
6             25            0.5  1.000000
7             25            1.0  1.000000
8             25            1.5  0.970149
9            100            0.5  1.000000
10           100            1.0  1.000000
11           100            1.5  0.970149


## XGBoost

In [8]:
from xgboost import XGBClassifier

# XGBoost model
params = [
    {"n_estimators": 1, "learning_rate": 0.5},
    {"n_estimators": 1, "learning_rate": 1.0},
    {"n_estimators": 1, "learning_rate": 1.5},
    {"n_estimators": 10, "learning_rate": 0.5},
    {"n_estimators": 10, "learning_rate": 1.0},
    {"n_estimators": 10, "learning_rate": 1.5},
    {"n_estimators": 25, "learning_rate": 0.5},
    {"n_estimators": 25, "learning_rate": 1.0},
    {"n_estimators": 25, "learning_rate": 1.5},
    {"n_estimators": 100, "learning_rate": 0.5},
    {"n_estimators": 100, "learning_rate": 1.0},
    {"n_estimators": 100, "learning_rate": 1.5}
]

xgb_results = []
for p in params:
    model = XGBClassifier(eval_metric='mlogloss', **p)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = accuracy_score(y_test, pred)
    xgb_results.append((p["n_estimators"], p["learning_rate"], acc))

# Show results
xgb_results_df = pd.DataFrame(xgb_results, columns=["n_estimators", "learning_rate", "accuracy"])
print("XGBoost Results:")
print(xgb_results_df)


XGBoost Results:
    n_estimators  learning_rate  accuracy
0              1            0.5       1.0
1              1            1.0       1.0
2              1            1.5       1.0
3             10            0.5       1.0
4             10            1.0       1.0
5             10            1.5       1.0
6             25            0.5       1.0
7             25            1.0       1.0
8             25            1.5       1.0
9            100            0.5       1.0
10           100            1.0       1.0
11           100            1.5       1.0


Here we see that for both of these methods, on the penguins dataset, we are getting very high accuracy scores. This is similar to what we saw with the AdaBoosting method as well and is likely a result of an easily classified dataset.