**Section 6: Classification - Hyperparameters**

Notebook for "Introduction to Data Science and Machine Learning"

version 1.0, June 2 2025

In order to use the relevant packages we need the following import statements: 

In [None]:
import matplotlib.pyplot as plt
import numpy as np

import pandas as pd
from sklearn import tree
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


# Introduction

In this assignment you will see some code statements to use pipelines, cross validation and to search the parameter space for hyperparameters.

- use the internet to understand the code and make notes, wherever suitable
- select some hyperparameters and search the parameter space.
- create a decision tree for the best paramater set
- use the test data set to evaluate the tree,

# 1. Loading the Data

In [None]:
penguins=sns.load_dataset('penguins')

# 2. Getting to Know the Data

In [None]:
penguins.info()

`species` is the target variable / the class. 

# 3. Preparing the Data

- create an `X` data frame with the data and a `y` data frame with the class
- encode the categorical values using `get_dummies()`
- splitting the data into a trainings and a test data set

In [None]:
X=penguins.copy()
y=X.pop('species')

In [None]:
X=pd.get_dummies(X)

In [None]:
X.info()

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33, random_state=10)

# 4. Learn the Decision Tree and Determine its Accuracy

- using a scaler (e.g. the standard scaler (z-transform) for the numerical data): train the scaler on the trainings data set and apply it on the trainings and test data set
- train the tree using the trainings data
- display the tree
- use the test data to determine its accuracy

In [None]:
theScaler=StandardScaler()
X_train_scaled=theScaler.fit_transform(X_train)
X_test_scaled=theScaler.transform(X_test)

In [None]:
classif=tree.DecisionTreeClassifier()
classif.fit(X_train_scaled,y_train)


In [None]:
print("accuracy classif:",classif.score(X_test_scaled, y_test))

In [None]:
y.unique()

In [None]:
fig=plt.figure(dpi=1600)
tree.plot_tree(classif,feature_names=X_train.columns,class_names=y.unique(), filled=True)# here we plot the tree
fig.savefig('plots/tree_penguins.png') # and here we save it

In [None]:
theTree=tree.export_text(classif,feature_names=X_test.columns) # translate the tree to a set of rules
print(theTree)

# 5. Pipelines

Information about [Pipelines](https://scikit-learn.org/stable/modules/compose.html#pipeline-chaining-estimators).

In [None]:
scaledClassifier = [('scaler', StandardScaler()), 
                    ('tree', tree.DecisionTreeClassifier())]

pipe = Pipeline(scaledClassifier)

In [None]:
print(pipe)

In [None]:
pipe.fit(X_train,y_train)

In [None]:
pipe.score(X_test,y_test)

# 6. Cross Validation

Please check out the slides from the lecture as well as [Cross Validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)

In [None]:
scores = cross_val_score(pipe, X_train, y_train, cv=5)

In [None]:
print(f"{scores.mean()*100:0.2f}% accuracy with",
      f"a standard deviation of {scores.std()*100:0.2f}")


# 7. Grid Search: Searching the Hyperparameter Space

Please check out [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [None]:
parameters={"max_depth":[4,5],"min_samples_split":[3,4]}

In [None]:
gridPipeline = Pipeline([('scaler', StandardScaler()), 
                ('tree', GridSearchCV(tree.DecisionTreeClassifier(), parameters,scoring="accuracy"))])
gridPipeline.fit(X_train,y_train)

In [None]:
gridPipeline["tree"].cv_results_

In [None]:
theResults=pd.DataFrame(gridPipeline["tree"].cv_results_)

In [None]:
theResults

# 8. Searching the Parameter Space

In [None]:
# Your code

*--- End of notebook ---*

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This notebook was created by Christina B. Class for teaching at EAH Jena and is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.