# **Notebook 4**
## **Modelling and Tuning**


### Introduction
This notebook marks the official transition to the core Machine Learning phase. Our primary objective is to engage in rigorous model evaluation and refinement to identify the best classification algorithm for predicting the binary quality of the PastÃ©is de Nata.  

This involves a strict, comparative analysis using the clean, anti-leakage data partitions (Train and Validation) prepared in Notebook 3.   
We will follow this steps:  
- **Establish Baseline Performance:** We will train a diverse portfolio of models using default settings to establish a baseline performance and potential.
- **Diagnose Overfitting:** By comparing performance metrics across the Training and Validation sets, we will precisely diagnose model generalization ability versus **overfitting**.
- **Systematic Optimization:** We will select the most promising models and optimize their complexity and performance using **GridSearchCV**  combined with the robust **Stratified K-Fold Cross-Validation (SKF)**  loaded from the previous step.

In [1]:
import pandas as pd
import numpy as np
import pickle, os

In [2]:
import warnings
from sklearn.exceptions import ConvergenceWarning

# Ignore ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

##### **4.1 Load transformed data and partitions**

Loads the master dictionary saved in Notebook 3, which contains all pre-processed and standardized data partitions (`X_train`, `X_val`, `X_test`) and their corresponding target variables (`y_train`, `y_val`, `y_test`).

In [3]:
import pickle

# Load train/val/test split data from notebook3
with open(r'Nata_Files\\train_test_split_fixed.pkl', 'rb') as f:
    notebook3_data = pickle.load(f)


X_train = notebook3_data['X_train']
X_val = notebook3_data['X_val']
X_test = notebook3_data['X_test']
y_train = notebook3_data['y_train']
y_val = notebook3_data['y_val']
y_test = notebook3_data['y_test']


In [14]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV




These three functions serve to **standardize and centralize** the fundamental operations required for every classification model: training the model (`fit`), extracting continuous scores (`predict_proba`), and obtaining the final binary prediction (`predict`), ensuring uniformity in the evaluation code.

#### **4.2 Model Selection**

In [10]:
logr = LogisticRegression(random_state=42)
logr.fit(X_train, y_train)
logr_proba = logr.predict_proba(X_val)[:,1]
logr_pred = logr.predict(X_val) 
logr_proba_tr = logr.predict_proba(X_train)[:,1]
logr_pred_tr = logr.predict(X_train)

dtc = DecisionTreeClassifier(random_state=42)
dtc.fit(X_train, y_train)
dtc_proba = dtc.predict_proba(X_val)[:,1]
dtc_pred = dtc.predict(X_val)
dtc_proba_tr = dtc.predict_proba(X_train)[:,1]
dtc_pred_tr = dtc.predict(X_train)


rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_proba = rf.predict_proba(X_val)[:,1]
rf_pred = rf.predict(X_val)
rf_proba_tr = rf.predict_proba(X_train)[:,1]
rf_pred_tr = rf.predict(X_train)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_proba = knn.predict_proba(X_val)[:,1]
knn_pred = knn.predict(X_val)
knn_proba_tr = knn.predict_proba(X_train)[:,1]
knn_pred_tr = knn.predict(X_train)

lgb = LGBMClassifier(random_state=42)
lgb.fit(X_train, y_train)
lgb_proba = lgb.predict_proba(X_val)[:,1]
lgb_pred = lgb.predict(X_val)
lgb_proba_tr = lgb.predict_proba(X_train)[:,1]
lgb_pred_tr = lgb.predict(X_train)

[LightGBM] [Info] Number of positive: 1981, number of negative: 1138
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001105 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1722
[LightGBM] [Info] Number of data points in the train set: 3119, number of used features: 16
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.635139 -> initscore=0.554329
[LightGBM] [Info] Start training from score 0.554329


#### **4.3 Define the metrics function**

In [11]:
def get_metrics(y_val, y_proba, y_pred, model, dataset):
    return {
        "Model" : model,
        "Set" : dataset,
        "AUC": roc_auc_score(y_val, y_proba),
        "Accuracy": accuracy_score(y_val, y_pred),
    }

Inputs: It takes the true target values (`y_val`), the predicted probabilities (`y_proba`), the final predicted classes (`y_pred`), the model name, and the dataset name ("Train" or "Validation").  
Output: It returns a dictionary containing the **AUC** (Area Under the Curve) and **Accuracy** metrics. AUC requires the probability scores (`y_proba`), while Accuracy requires the binary class predictions (`y_pred`).

#### **4.4 Model Evaluatiion and Metrics collection**

This crucial step executes the evaluation function (`get_metrics`) for all five baseline models across both the Training and Validation sets.

In [12]:
models_metrics = []

models_metrics.append(get_metrics(y_train, logr_proba_tr, logr_pred_tr, "Logistic Regression", "Train"))
models_metrics.append(get_metrics(y_train, dtc_proba_tr, dtc_pred_tr, "DTClassifier", "Train"))
models_metrics.append(get_metrics(y_train, rf_proba_tr, rf_pred_tr, "Random Forest", "Train"))
models_metrics.append(get_metrics(y_train, knn_proba_tr, knn_pred_tr, "KNClassifier", "Train"))
models_metrics.append(get_metrics(y_train, lgb_proba_tr, lgb_pred_tr, "LightGBM", "Train"))

models_metrics.append(get_metrics(y_val, logr_proba, logr_pred, "Logistic Regression", "Validation"))
models_metrics.append(get_metrics(y_val, dtc_proba, dtc_pred, "DTClassifier", "Validation"))
models_metrics.append(get_metrics(y_val, rf_proba, rf_pred, "Random Forest", "Validation"))
models_metrics.append(get_metrics(y_val, knn_proba, knn_pred, "KNClassifier", "Validation"))
models_metrics.append(get_metrics(y_val, lgb_proba, lgb_pred, "LightGBM", "Validation"))

The metrics are first collected on the training set to measure the model's fiting ability. The metrics are then collected on the validation set to measure the model's generalization ability.

In [13]:
df_models_metrics = pd.DataFrame(models_metrics)
df_models_metrics = df_models_metrics.pivot_table(index=["Model", "Set"], values=["AUC", "Accuracy"])

df_models_metrics

Unnamed: 0_level_0,Unnamed: 1_level_0,AUC,Accuracy
Model,Set,Unnamed: 2_level_1,Unnamed: 3_level_1
DTClassifier,Train,1.0,1.0
DTClassifier,Validation,0.665391,0.686538
KNClassifier,Train,0.894342,0.819493
KNClassifier,Validation,0.757484,0.701923
LightGBM,Train,0.997112,0.972748
LightGBM,Validation,0.799442,0.742308
Logistic Regression,Train,0.80007,0.741263
Logistic Regression,Validation,0.790423,0.7375
Random Forest,Train,1.0,1.0
Random Forest,Validation,0.817711,0.759615


## Choosing the best model
- We decided to use Random Forest and LightGBM as our baseline models and Logistic Regression as the metamodel on Stacking

### **LightGBM**

#### Hyperparameter tuning

In [None]:
param_dist = {
    'num_leaves': [15, 31, 63],
    'max_depth': [5, 10, -1], # -1 means no limit
    'learning_rate': [0.1, 0.03, 0.01],
    'n_estimators': [100, 300, 500],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]}

lgb_clf = LGBMClassifier(random_state=42)
rs = RandomizedSearchCV(lgb_clf, param_dist, n_iter=20, cv=3,
                        scoring='accuracy', random_state=42, n_jobs=-1)
rs.fit(X_train, y_train)

print("Best params:", rs.best_params_)
best_lgb = rs.best_estimator_

[LightGBM] [Info] Number of positive: 1981, number of negative: 1138
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000536 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1722
[LightGBM] [Info] Number of data points in the train set: 3119, number of used features: 16
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.635139 -> initscore=0.554329
[LightGBM] [Info] Start training from score 0.554329
Best params: {'subsample': 0.8, 'num_leaves': 15, 'n_estimators': 100, 'max_depth': 10, 'learning_rate': 0.1, 'colsample_bytree': 1.0}  Best CV AUC: 0.825499972981131
