
### Exercises: Decision trees, random forests and K-means

In [1]:
# Import necessary libraries here
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay
import seaborn as sns
import matplotlib.pyplot as plt

This exercise uses the wine dataset included in the sklearn.datasets module.
In the Jupyter tab explore the 'wine' object. It stores all kinds of data about the wine dataset. You will probably need data, feature_names, target and target_names. In the following cell the wine dataset is loaded.

In [27]:
data = load_wine()  # basisdataset   # doelvariabele

{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
 

## DATA PREPARATION
a. Create an X en y dataset with the features and the target variable. Check the first rows of the datasets.


In [9]:
X = pd.DataFrame(data.data,                   # feature-matrix
                 columns=data.feature_names)
y = pd.Series(data.target, name='target')  

### Explore data

 a. Investigate the target variable. Which and how many different values are there?\
 b. Do you think these values are nominal, ordinal, interval or ratio? Why?\
 c. Investigate the features (independant variables).\
 d. Are they nominal, ordinal, interval or ratio? Why?

In [11]:
# b) Nominaal, ordinaal, interval of ratio?
print("\nClassificatie target: nominale variabele (geen volgorde).")

# c–d) Kenmerken inspecteren: type en meetniveau
print("\nOverzicht features:\n", X.dtypes)
print("\nFeatures zijn continu (ratio-schaal), bv. alcoholgehalte heeft absoluut nulpunt en verhoudingsbetekenis.")


Classificatie target: nominale variabele (geen volgorde).

Overzicht features:
 alcohol                         float64
malic_acid                      float64
ash                             float64
alcalinity_of_ash               float64
magnesium                       float64
total_phenols                   float64
flavanoids                      float64
nonflavanoid_phenols            float64
proanthocyanins                 float64
color_intensity                 float64
hue                             float64
od280/od315_of_diluted_wines    float64
proline                         float64
dtype: object

Features zijn continu (ratio-schaal), bv. alcoholgehalte heeft absoluut nulpunt en verhoudingsbetekenis.


### Prepare data for validation
In order to validate the model, split the dataset into a training and a test set. Use 70% of the data for training and 30% for testing.  \

In [12]:
from sklearn.model_selection import train_test_split  # Importeer train_test_split

# Splits de data in 70% train en 30% test 
X_train, X_test, y_train, y_test = train_test_split(
    X,                       # feature-matrix
    y,                       # doelvariabele
    test_size=0.3,           # 30% van de data voor test
    random_state=42,         # seed voor reproduceerbaarheid
    stratify=y               # behoud verhouding klassen in train en test
)


## Model selection and hyperparameter selection
### 1. Creating a Decision Tree Classifier
a. Based on the type of data in the features and target variable, would you use a classification or regression model?

In [13]:
# a) Kies een classificatiemodel omdat de doelvariabele categorisch (nominaal) is
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(
    criterion='gini',    # splitsingscriterium: Gini impurity
    random_state=42      # voor reproduceerbaarheid
)



b. First fit a "DescisionTreeClassifier". Use random_state=42 to make sure that the results are reproducible. In this first run all hyperparameters can be kept on their default values.\
c. Validate the model by using the correct classification metrics

In [14]:
# b) Instantieer en train de Decision Tree Classifier
model = DecisionTreeClassifier(random_state=42)  # default hyperparams + reproducible seed
model.fit(X_train, y_train)                      # fit model op trainingsdata

In [15]:
#Validate
# c) Validatie op testset
y_pred = model.predict(X_test)                   # predictie maken

Following on this first model you are going to create extra models in the next steps. To make it easy to evaluate each model, create a function evaluate(y_test,y_test_pred) that prints the evalation results like you did in the preveous step.

In [17]:
from sklearn.metrics import (
    accuracy_score,      # overall accuracy
    precision_score,     # precisie per klasse
    recall_score,        # recall per klasse
    f1_score,            # f1-score per klasse
    classification_report
)

def evaluate(y_true, y_pred):
    """Print de belangrijkste classificatiemetingen voor y_true vs. y_pred."""
    # Bereken metrics
    acc  = accuracy_score(y_true, y_pred)  
    prec = precision_score(y_true, y_pred, average='weighted')  
    rec  = recall_score(y_true, y_pred, average='weighted')     
    f1   = f1_score(y_true, y_pred, average='weighted')         

    # Toon samenvatting
    print(f'ACC  : {acc:.3f} - PREC : {prec:.3f} - REC  : {rec:.3f} - F1   : {f1:.3f}')
    
    # Toon gedetailleerd rapport per klasse
    print("\nClassification Report:")
    print(classification_report(y_true, y_pred))
    
y_pred = model.predict(X_test)
evaluate(y_test, y_pred)


ACC  : 0.963 - PREC : 0.966 - REC  : 0.963 - F1   : 0.963

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.94      0.97        18
           1       0.91      1.00      0.95        21
           2       1.00      0.93      0.97        15

    accuracy                           0.96        54
   macro avg       0.97      0.96      0.96        54
weighted avg       0.97      0.96      0.96        54



### 2. Adapting Hyperparameters
a. Create a new model, but now adjust the hyperparameter min_samples_leaf to 11.\
b. What is the effect on the tree?\
c. What is the effect on the metrics?

In [19]:
# a) Nieuw model met min_samples_leaf=11 (voorkomt kleine, overfitte bladeren)
model2 = DecisionTreeClassifier(
    random_state=42,     # voor reproduceerbaarheid
    min_samples_leaf=11  # elk blad bevat minimaal 11 samples
)

In [20]:
# train het model op de trainingsset
model2.fit(X_train, y_train)  # fit model met opgegeven hyperparameter

# b) Effect op de boom: minder splits, eenvoudigere structuur
#    Je kunt het aantal bladeren zien met:
n_leaves = model2.get_n_leaves()  # aantal bladnodes
print(f"Aantal bladeren met min_samples_leaf=11: {n_leaves}")

Aantal bladeren met min_samples_leaf=11: 6


In [21]:
# c) Effect op de metrics: doorgaans daalt de training-accuracy iets (minder overfit),
#    test‐accuracy kan verbeteren of licht dalen afhankelijk van de data.
#    Gebruik de eerder gedefinieerde evaluatiefunctie:
y_pred2 = model2.predict(X_test)                  # predicties maken
evaluate(y_test, y_pred2)                         # vergelijk met test‐labels

ACC  : 0.907 - PREC : 0.925 - REC  : 0.907 - F1   : 0.905

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.72      0.84        18
           1       0.81      1.00      0.89        21
           2       1.00      1.00      1.00        15

    accuracy                           0.91        54
   macro avg       0.94      0.91      0.91        54
weighted avg       0.93      0.91      0.90        54



### 3. Creating and validating a random forest model
With the same dataset we are now going to predict the wine classes using a random forest model. This model uses multiple trees and aggregates the results.\
 a. Create a RandomForerstClassifier with n_estimators hyperparameter set to 100, also use random_state=42\
 b. Fit the model to the training set and evaluate on the test set.\
 c. Based on the metrics, does this model outperform the previous model?\
 d. Lower the hyperparameter n_estimators until it is performing less compared to n_estimators=100? With how many n_estimators the model is performing less.\


In [22]:
# a) Maak RandomForestClassifier met 100 bomen
rf_model = RandomForestClassifier(
    n_estimators=100,    # aantal decision trees in het ensemble
    random_state=42      # voor reproduceerbaarheid
)

In [23]:
# b) Train het model en evalueer op de testset
rf_model.fit(X_train, y_train)              # fit op trainingsdata
y_pred_rf = rf_model.predict(X_test)        # predicties op testdata
evaluate(y_test, y_pred_rf)                 # gebruik de eerder gedefinieerde evaluatiefunctie

ACC  : 1.000 - PREC : 1.000 - REC  : 1.000 - F1   : 1.000

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        18
           1       1.00      1.00      1.00        21
           2       1.00      1.00      1.00        15

    accuracy                           1.00        54
   macro avg       1.00      1.00      1.00        54
weighted avg       1.00      1.00      1.00        54



In [25]:
# c) Vergelijk de hierboven geprinte metrics met die van het Decision Tree-model om te bepalen
#    of Random Forest beter presteert.

# d) Zoek het kleinste aantal n_estimators waarbij de accuracy onder de baseline (100 bomen) daalt
baseline_acc = accuracy_score(y_test, y_pred_rf)  # baseline accuracy bij 100 bomen

for n in [50, 20, 10, 5]:
    tmp_model = RandomForestClassifier(n_estimators=n, random_state=42)
    tmp_model.fit(X_train, y_train)                 # train met minder bomen
    acc_tmp = accuracy_score(y_test, tmp_model.predict(X_test))  
    print(f"n_estimators={n}: ACC={acc_tmp:.3f}")
# Bekijk de uitvoer om te zien bij welk n_estimators de ACC lager is dan {baseline_acc:.3f}

n_estimators=50: ACC=1.000
n_estimators=20: ACC=0.981
n_estimators=10: ACC=0.963
n_estimators=5: ACC=0.963


### 4. Using another classification model
In the previous exercises you used a Decision Tree and a Random Forest model to predict the wine classes. Now, let's try another classification model.\
a. Go to https://scikit-learn.org/stable/machine_learning_map.html. Start with the start node and go through the nodechart until you find the first model to use. Which model did you find?\
b. Create the model and perform the validation.\
c. What are your findings

In [26]:
# Importeer de Stochastic Gradient Descent Classifier uit sklearn
from sklearn.linear_model import SGDClassifier

# a) Instantieer het volgens de map aanbevolen model (SGDClassifier)
model_sgd = SGDClassifier(
    random_state=42       # seed voor reproduceerbaarheid
)

# b) Train het model op de trainingsset
model_sgd.fit(X_train, y_train)           # fit op X_train en y_train

# Maak voorspellingen op de testset
y_pred_sgd = model_sgd.predict(X_test)    # predicties op X_test

# c) Evalueer met de eerder gedefinieerde evaluate()-functie
evaluate(y_test, y_pred_sgd)              # toont accuracy, precision, recall en F1-score


ACC  : 0.500 - PREC : 0.522 - REC  : 0.500 - F1   : 0.424

Classification Report:
              precision    recall  f1-score   support

           0       0.40      1.00      0.57        18
           1       1.00      0.43      0.60        21
           2       0.00      0.00      0.00        15

    accuracy                           0.50        54
   macro avg       0.47      0.48      0.39        54
weighted avg       0.52      0.50      0.42        54



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [28]:
#random tabellen zien voor informatie te gainen
import pandas as pd
from sklearn.datasets import load_wine

# Laad de data
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)  # features als DataFrame
y = pd.Series(data.target, name='target')               # target als Series

# 1) Eerste paar rijen bekijken
print("Eerste 5 rijen van X:")
print(X.head())                                         # toont de eerste 5 rijen :contentReference[oaicite:6]{index=6}

print("\nVerdeling van y:")
print(y.value_counts())                                 # laat zien hoeveel samples per klasse

# 2) Structuur en types inspecteren
print("\nInformatie over X (dtype, non-null count):")
print(X.info())                                         # geeft kolomnamen, types, missing values :contentReference[oaicite:7]{index=7}

# 3) Statistische samenvatting
print("\nStatistische beschrijving van de numerieke features:")
print(X.describe())                                     # mean, std, min/max, kwartielen :contentReference[oaicite:8]{index=8}



Eerste 5 rijen van X:
   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline