<a href="https://colab.research.google.com/github/evroth/gsb545repo/blob/main/PA_xgboost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice Activity: Gradient and XGBoost

Predicting penguin species from palmer penguins data with gradient and xgboosting techniques.

## The Data

In [12]:
import pandas as pd

df = pd.read_csv("penguins_size.csv")

In [13]:
df.sample(5)

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
177,Chinstrap,Dream,52.0,19.0,197.0,4150.0,MALE
224,Gentoo,Biscoe,47.6,14.5,215.0,5400.0,MALE
215,Chinstrap,Dream,55.8,19.8,207.0,4000.0,MALE
163,Chinstrap,Dream,51.7,20.3,194.0,3775.0,MALE
11,Adelie,Torgersen,37.8,17.3,180.0,3700.0,


In [14]:
df.dtypes

species               object
island                object
culmen_length_mm     float64
culmen_depth_mm      float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

In [15]:
df = pd.get_dummies(df, columns = ['island','sex'])
df.dropna(inplace=True)

# Separate features and target variable
X = df.drop(['species'], axis=1)
y = df['species']

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

## Accuracy Levels From Previous Activity

The best AdaBoost model we were able to put together in the last activity had an accuracy rate of .9912. That will be the benchmark to shoot for with gradient and xgb boosting.

## Gradient Boosting

In [20]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier

# define the model
model = GradientBoostingClassifier()
# define the evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model on the dataset
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Mean Accuracy: 0.985 (0.020)


With no tuning, this model's mean accuracy is .985 which is a little worse than the .991 we got with AdaBoost.

## XGBoost

In [21]:
from xgboost import XGBClassifier

# define the model
model = XGBClassifier()
# define the evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model on the dataset
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Mean Accuracy: 0.987 (0.021)


In [24]:
from sklearn.model_selection import GridSearchCV
import multiprocessing

if __name__ == "__main__":
    print("Parallel Parameter optimization")
    xgb_model = XGBClassifier(n_jobs=multiprocessing.cpu_count() // 2)
    clf = GridSearchCV(xgb_model, {'max_depth': [1, 2, 4, 6],
                                   'n_estimators': [20, 50, 100]}, verbose=1,
                       n_jobs=-1, cv=5)
    clf.fit(X, y)
    print(clf.best_score_)
    print(clf.best_params_)

Parallel Parameter optimization
Fitting 5 folds for each of 12 candidates, totalling 60 fits
0.994160272804774
{'max_depth': 1, 'n_estimators': 100}


The above code does a grid search tuning the parameters "max_depth" with values 1,2,3,4, and 6, as well as tuning the parameter "n_estimators" which is the number of trees with values 20,50, and 100. The best performing model of these options is the one with a max depth of 1 and 100 trees. 

The CV accuracy for this model is .994, which outperforms the best AdaBoost model from the previous activity as well as any model in this activity. XGBoost reign supreme.