# EXERCISES Hyperparameter tuning

### PREDICT PENGUIN SPECIES WITH HYPERPARAMETER TUNING USING CROSS-VALIDATION

We want to build a model to predict the penguin species based on some penguin characteristics we can observe. We have a labeled dataset <strong>'penguin'</strong> that is part of the Seaborn built-in datasets. We want to use a decision tree and want to experiment with following hyperparameters to find the best solution: maximum tree depth ranging from 3 tot 10, and split criterion equal to 'gini' or 'entropy'. Derive the best model, using a decision tree with the given set of hyperparameter values, using 3-fold cross validation with the accuracy as validation measure for the hyperparameter tuning. Use <strong>species</strong> as the target variable and all other variables except <strong>island</strong> and <strong>sex</strong> as predictors.

In [1]:
# IMPORTS EN DATA PREPARATIE
import seaborn as sns                                           # theorie: gegevens inladen (slides Python basics)
from sklearn.model_selection import GridSearchCV, train_test_split  # theorie: cross-validatie en train/test split (slides Hyperparameters)
from sklearn.pipeline import Pipeline                           # theorie: pipelines voor workflow (slides Hyperparameters)
from sklearn.preprocessing import OneHotEncoder                 # theorie: categorische encoding (slides Data visualisatie)
from sklearn.compose import ColumnTransformer                   # theorie: feature preprocessing (slides Hyperparameters)
from sklearn.ensemble import RandomForestClassifier             # theorie: basisklas voor tuning (slides Decision Trees)
from sklearn.metrics import classification_report               # theorie: evaluatie metrics (slides Hyperparameters)

In [2]:
df = sns.load_dataset('penguins')                               # laad de penguins dataset
df = df.dropna(subset=['island', 'sex', 'species']) 

In [3]:
# 2) Features en target splitsen
X = df[['island', 'sex']]                                       # voorspellers: eiland en geslacht
y = df['species']  

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y            # stratify garandeert evenwichtige klassenverdeling
)

In [5]:
# 4) Preprocessing definieren
categorical_features = ['island', 'sex']
categorical_transformer = OneHotEncoder(drop='first')           # drop='first' om collineariteit te voorkomen

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)
    ]
)

In [6]:
# 5) Pipeline en parametergrid opzetten
pipeline = Pipeline(steps=[
    ('prep', preprocessor),                                     # eerst encoderen
    ('clf', RandomForestClassifier(random_state=42))            # daarna classificator
])

param_grid = {
    'clf__n_estimators': [50, 100, 200],                        # aantal bomen
    'clf__max_depth': [None, 5, 10],                            # maximale boomdiepte
    'clf__min_samples_split': [2, 5]                            # minimale samples voor split
}

In [7]:
# 6) GridSearchCV met 5-voudige cross-validatie
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,                                                        # 5-fold CV (slides Hyperparameters)
    scoring='accuracy',                                          # evaluatiecriterium
    n_jobs=-1                                                    # alle cores gebruiken
)
grid_search.fit(X_train, y_train)   

In [8]:
# 7) Resultaten bekijken
print("Beste parameters:", grid_search.best_params_)            # toont optimale hyperparameters
print("CV accuracy:", grid_search.best_score_)                  # gemiddelde CV-score

Beste parameters: {'clf__max_depth': None, 'clf__min_samples_split': 2, 'clf__n_estimators': 50}
CV accuracy: 0.6881201956673654


In [9]:
# 8) Finale evaluatie op testset
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))    

              precision    recall  f1-score   support

      Adelie       1.00      0.28      0.43        29
   Chinstrap       0.58      1.00      0.74        14
      Gentoo       0.69      1.00      0.81        24

    accuracy                           0.69        67
   macro avg       0.76      0.76      0.66        67
weighted avg       0.80      0.69      0.63        67

