# EXERCISES MACHINE LEARNING

---

## scikit-learn machine learning pipeline with validation

---

(concrete compressive strength dataset)

### Task 1 : Import libraries

Import the necessary libraries (pandas, Numpy, Matplotlib, Seaborn and scikit-learn libraries)

In [None]:
# SOLUTION
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

### Task 2 : Set Seaborn style

Choose a seaborn style for plots.

In [None]:
# SOLUTION
sns.set()

### Task 3 : Load the data

Load the 'Concrete_Data.csv' from the data directory and display the first rows.

In [None]:
# SOLUTION
df = pd.read_csv('../datasets/Concrete_Data.csv')
# Display the first few rows to understand the structure of the dataset
df.head()

### Task 4 : Key statistics and missing values

Understand the dataset by displaying key statistics and check for missing values

In [None]:
#SOLUTION
# Key statistics
display(df.describe())
# Count the number of missing values per column
print(df.isnull().sum())

### Task 5 : Heatmap with correlations

Plot a heatmap of the correlation matrix to understand the relationships between the target variable 'csMPa' and all other variables (predictors).

In [None]:
#SOLUTION
# plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


### Task 6 : Simple linear regression

Perform a simple linear regression with 'csMPa' as target and one feature.

What feature seems to be the best candidate for the job?

Use a standard test setup (training set and test set) and make the predictions 
for the observations in the test set.

In [None]:
#SOLUTION

# Select the feature (X) and the target variable (y)
# 'cement' has the highest correlation with 'csMPa' and hence is the best 
# candidate.
X = df[['cement']] # Results in a dataframe
y = df['csMPa']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Create a linear regression model
lin_reg_one = LinearRegression()

# Train the model
lin_reg_one.fit(X_train, y_train)

# Predict the test set
y_pred_one = lin_reg_one.predict(X_test)

### Task 7 : Validation

Calculate and display the R-squared and Mean Squared Error (MSE) for the simple linear regression.

In [None]:
#SOLUTION
print(f'R-squared (Simple Linear Regression): {r2_score(y_test, y_pred_one):.3f}')
print(f'Mean Squared Error (Simple Linear Regression): {mean_squared_error(y_test, y_pred_one):.3f}')

### Task 8 : Regression line

Plot the regression line along with the test data points and print the intercept and coefficients.

In [None]:
#SOLUTION
plt.figure(figsize=(8, 6))
plt.scatter(X_test, y_test, color='blue', label='Actual values')
plt.plot(X_test, y_pred_one, color='red', linewidth=2, label='Regression Line')
plt.xlabel('Cement')
plt.ylabel('Compressive Strength (MPa)')
plt.legend()
plt.title('Simple Linear Regression: Cement vs Strength')
plt.show()

In [None]:
#SOLUTION
print(f'Intercept    : {lin_reg_one.intercept_:.3f}')
print(f'Coefficients : {lin_reg_one.coef_}')

### Task 9 : Linear regression with more predictors

As we see the single feature explains the compressive strength not to well. We continue our search for a better model:
include more than one variable in our regression model. What other 2 features are potential good candidates?
Make predictions for the test set and calculate the Mean Squared Error and R-squared. Print the coefficients and the intercept.

In [None]:
#SOLUTION

# Let's include 'Cement', 'Water', and 'Age' as features based on their correlation with the target variable.

# Select multiple features (X) and the target variable (y)
X = df[['cement', 'water', 'age']]
y = df['csMPa']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
lin_regML = LinearRegression()

# Train the model
lin_regML.fit(X_train, y_train)

# Predict on the test set
y_predML = lin_regML.predict(X_test)

print(f'R-squared          : {r2_score(y_test, y_predML):.3f}')
print(f'Mean Squared Error : {mean_squared_error(y_test, y_predML):.3f}')
print(f'Intercept          : {lin_regML.intercept_:.3f}')
print(f'Coefficients       : {lin_regML.coef_}')
#SOLUTION_END

---

## scikit-learn machine learning pipeline with model selection and hyperparameter tuning using cross-validation

---

### 1. PREDICT PENGUIN SPECIES WITH HYPERPARAMETER TUNING USING CROSS-VALIDATION

We want to build a model to predict the penguin species based on some penguin characteristics we can observe. We have a labeled dataset <strong>'penguin'</strong> that is part of the Seaborn built-in datasets. We want to use a decision tree and want to experiment with following hyperparameters to find the best solution: maximum tree depth ranging from 3 tot 10, and split criterion equal to 'gini' or 'entropy'. Derive the best model, using a decision tree with the given set of hyperparameter values, using 3-fold cross validation with recall as validation measure for the hyperparameter tuning. Use <strong>species</strong> as the target variable and all other variables except <strong>island</strong> and <strong>sex</strong> as predictors.

In [None]:
# DATA PREPARATION

import pandas as pd
pd.options.display.max_rows = None
import seaborn as sns 
df = sns.load_dataset('penguins')
y = df['species']              # Target feature to predict
X = df.copy().drop(['species','island', 'sex'], axis=1) # Predictors

print(type(df), df.shape)
print(type(X), X.shape)
print(type(y), y.shape)

display(X.head(5))
display(y.head(5))

In [None]:
# Explore data
display(X.sample(10, random_state=0))
display(y.sample(10, random_state=0))
# Mind that the indexes of the sample of y might be different of the indexes
# of the sample of X because of the random selection.
# When using random_state with the same state, you should get the same 
# indexes.
display(X.describe())

In [None]:
# SPLIT LABELED DATA INTO TRAIN/VALIDATE - TEST SAMPLE

from sklearn.model_selection import train_test_split
# Split the data randomly into 80% training set and 20% test set
X_tr, X_tst, y_tr, y_tst = train_test_split(X, y, random_state=0, train_size=0.8)
# (use random_state to be sure that every time the same random sample is drawn)

print(type(X_tr), X_tr.shape)
print(type(X_tst), X_tst.shape)
print(type(y_tr), y_tr.shape)
print(type(y_tst), y_tst.shape)

In [None]:
# MODEL SELECTION AND HYPERPARAMETER TUNING (REPEAT THIS STEP FOR MULTIPLE MODELING TECHNIQUES)

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

# Define parameter grid (model specific)
grid_param = {'criterion' : ['gini', 'entropy'],
              'max_depth' : list(range(3,11))}
display(grid_param)

# Setup grid search with N-fold cross validation (e.g. 5-fold)
grid_search = GridSearchCV(model, grid_param, cv=3)

# Execute full grid search
grid_search.fit(X_tr, y_tr)

# Display best hyperparameter values and matching validation score
print(f'Best parameters : {grid_search.best_params_}')
print(f'Best score      : {grid_search.best_score_:.3f}')


In [None]:
# DERIVE MODEL FROM TRAINING DATA USING BEST HYPERPARAMETER VALUES (TRAIN MODEL/FIT MODEL)

model.set_params(**grid_search.best_params_)
# List all selected hyperparameters
print(model.get_params(deep=True))

model.fit(X_tr,y_tr)

In [None]:
# DISPLAY MODEL (MODEL SPECIFIC)

from sklearn.tree import plot_tree
plot_tree(model)

In [None]:
# VALIDATE MODEL USING TEST DATA

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score, precision_recall_fscore_support, classification_report
import matplotlib.pyplot as plt

# Predict target feature for the test data
y_tst_pred = pd.Series(model.predict(X_tst), name='y_tst_pred')

# Calculate the difference between predicted and real values for the test data
err = pd.Series(y_tst_pred.reset_index(drop=True)!=y_tst.reset_index(drop=True), name='err').astype(int)
display(pd.concat([y_tst_pred.reset_index(drop=True), y_tst_pred.reset_index(drop=True), err], axis=1))

In [None]:
# Confusion matrix
# Display as text (console output)
class_labels = sorted(list(pd.concat([y_tst,y_tst_pred], axis=0).unique()))
# Alternative : model.classes_
cm = confusion_matrix(y_true = y_tst, y_pred = y_tst_pred) 
print('Predicted label')
print(class_labels)
print(cm)
# Display as heatmap (nicer output in Jupyter)
disp = sns.heatmap(cm, square=True, annot=True, cbar=True, cmap='Greys', xticklabels=class_labels, yticklabels=class_labels)
plt.xlabel('Predicted label')
plt.ylabel('True label')
disp.xaxis.tick_top()                # Put x-axis tickers on top
disp.xaxis.set_label_position('top') # Put x-axis label on top

In [None]:
# Metrics
acc = accuracy_score(y_true=y_tst, y_pred=y_tst_pred)
prec = precision_score(y_true=y_tst, y_pred=y_tst_pred, average='weighted')
rec = recall_score(y_true=y_tst, y_pred=y_tst_pred, average='weighted')
f1 = f1_score(y_true=y_tst, y_pred=y_tst_pred, average='weighted')
# Mind this is a multiclass classification problem, so precision, recall and F1 
# are calculated by class and averaged.
print(f'ACC : {acc:.3f} - PREC : {prec:.3f} - REC : {rec:.3f} - F1 : {f1:.3f}')

In [None]:
# The easiest way to get results by class is to use precision_recall_fscore_support
class_labels = sorted(list(pd.concat([y_tst,y_tst_pred], axis=0).unique()))
# Alternative : model.classes_
# Display precision/recall/fscore/support table as text (consule output)
print(class_labels)
display(precision_recall_fscore_support(y_true=y_tst, y_pred=y_tst_pred))
# Display precision/recall/fscore/support as pandas dataframe (nicer outputin Jupyter)
display(pd.DataFrame(precision_recall_fscore_support(y_true=y_tst, y_pred=y_tst_pred), index=['prec','rec','fscore','sup'], columns=class_labels))

In [None]:
# Or use classification_report
print(classification_report(y_true=y_tst, y_pred=y_tst_pred, target_names=class_labels))

### 2. PREDICT PENGUIN SPECIES WITH HYPERPARAMETER TUNING USING CROSS-VALIDATION (MANUALLY)

If everything went well, you used specific scikit-learn features for the cross validation and grid search (GridSearchCV) that do all the work (as seen in the lecture). 

In this exercise, we want you to program the procedure for the hyperparameter tuning with cross-validation yourself. So program the steps of hyperparameter tuning and cross-validation yourself, without using the cross-validation and grid search functions of scikit-learn (so do not use functions like GridSearchCV). 

Use the same data (target variable and predictors) as in the previous exercise. Find the best model using 3-fold cross validation. To make it simpler, limit the hyperparameters to be checked to split criterion equal to 'entropy' and maximum tree depth equal to 3,5,8 and 10.

The only scikit-learn functions you can use are train_test_split, the functions to derive, plot and apply a decision tree (DecisionTreeClassifier and it's methods - fit, predict - plot_tree) and the functions for the calculation of validation metrics (confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score, precision_recall_fscore_support, classification_report).

So be sure to understand the procedure for hyperparameter tuning with cross-validation (how is the data split, which iterations are needed, how are decisions on the best model taken, ...) and develop a Python program accordingly.

In [None]:
# We do not publish a solution here because it is essential that you develop this yourself instead of just looking at the solution. You can check if your solutions is ok by comparing results with the solutions of the previous exercise.