# **Notebook 4**
## **Modelling and Tuning**


### Introduction
This notebook marks the official transition to the core Machine Learning phase. Our primary objective is to engage in rigorous model evaluation and refinement to identify the best classification algorithm for predicting the binary quality of the Pastéis de Nata.  

This involves a strict, comparative analysis using the clean, anti-leakage data partitions (Train and Validation) prepared in Notebook 3.   
We will follow this steps:  
- **Establish Baseline Performance:** We will train a diverse portfolio of models using default settings to establish a baseline performance and potential.
- **Diagnose Overfitting:** By comparing performance metrics across the Training and Validation sets, we will precisely diagnose model generalization ability versus **overfitting**.
- **Systematic Optimization:** We will select the most promising models and optimize their complexity and performance using **GridSearchCV**  combined with the robust **Stratified K-Fold Cross-Validation (SKF)**  loaded from the previous step.

In [59]:
import pandas as pd
import numpy as np
import pickle, os

In [60]:
import warnings
from sklearn.exceptions import ConvergenceWarning

# Ignore ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

##### **4.1 Load transformed data and partitions**

Loads the master dictionary saved in Notebook 3, which contains all pre-processed and standardized data partitions (`X_train`, `X_val`, `X_test`) and their corresponding target variables (`y_train`, `y_val`, `y_test`).

In [61]:
import pickle

# Load train/val/test split data from notebook3
with open(r'Nata_Files\\train_test_split_fixed.pkl', 'rb') as f:
    notebook3_data = pickle.load(f)


X_train = notebook3_data['X_train']
X_val = notebook3_data['X_val']
X_test = notebook3_data['X_test']
y_train = notebook3_data['y_train']
y_val = notebook3_data['y_val']
y_test = notebook3_data['y_test']
X_predict = notebook3_data['X_predict_final']
id_predict = notebook3_data['id_predict']


In [62]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import StackingClassifier



These three functions serve to **standardize and centralize** the fundamental operations required for every classification model: training the model (`fit`), extracting continuous scores (`predict_proba`), and obtaining the final binary prediction (`predict`), ensuring uniformity in the evaluation code.

#### **4.2 Model Selection**

In [63]:
logr = LogisticRegression(random_state=42)
logr.fit(X_train, y_train)
logr_proba = logr.predict_proba(X_val)[:,1]
logr_pred = logr.predict(X_val) 
logr_proba_tr = logr.predict_proba(X_train)[:,1]
logr_pred_tr = logr.predict(X_train)

dtc = DecisionTreeClassifier(random_state=42)
dtc.fit(X_train, y_train)
dtc_proba = dtc.predict_proba(X_val)[:,1]
dtc_pred = dtc.predict(X_val)
dtc_proba_tr = dtc.predict_proba(X_train)[:,1]
dtc_pred_tr = dtc.predict(X_train)


rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_proba = rf.predict_proba(X_val)[:,1]
rf_pred = rf.predict(X_val)
rf_proba_tr = rf.predict_proba(X_train)[:,1]
rf_pred_tr = rf.predict(X_train)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_proba = knn.predict_proba(X_val)[:,1]
knn_pred = knn.predict(X_val)
knn_proba_tr = knn.predict_proba(X_train)[:,1]
knn_pred_tr = knn.predict(X_train)

lgb = LGBMClassifier(random_state=42)
lgb.fit(X_train, y_train)
lgb_proba = lgb.predict_proba(X_val)[:,1]
lgb_pred = lgb.predict(X_val)
lgb_proba_tr = lgb.predict_proba(X_train)[:,1]
lgb_pred_tr = lgb.predict(X_train)

[LightGBM] [Info] Number of positive: 2311, number of negative: 1328
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000784 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2221
[LightGBM] [Info] Number of data points in the train set: 3639, number of used features: 18
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.635065 -> initscore=0.554006
[LightGBM] [Info] Start training from score 0.554006


#### **4.3 Define the metrics function**

In [64]:
def get_metrics(y_val, y_proba, y_pred, model, dataset):
    return {
        "Model" : model,
        "Set" : dataset,
        "Accuracy": accuracy_score(y_val, y_pred),
    }

Inputs: It takes the true target values (`y_val`), the predicted probabilities (`y_proba`), the final predicted classes (`y_pred`), the model name, and the dataset name ("Train" or "Validation").  
Output: It returns a dictionary containing the **AUC** (Area Under the Curve) and **Accuracy** metrics. AUC requires the probability scores (`y_proba`), while Accuracy requires the binary class predictions (`y_pred`).

#### **4.4 Model Evaluatiion and Metrics collection**

This crucial step executes the evaluation function (`get_metrics`) for all five baseline models across both the Training and Validation sets.

In [65]:
models_metrics = []

models_metrics.append(get_metrics(y_train, logr_proba_tr, logr_pred_tr, "Logistic Regression", "Train"))
models_metrics.append(get_metrics(y_train, dtc_proba_tr, dtc_pred_tr, "DTClassifier", "Train"))
models_metrics.append(get_metrics(y_train, rf_proba_tr, rf_pred_tr, "Random Forest", "Train"))
models_metrics.append(get_metrics(y_train, knn_proba_tr, knn_pred_tr, "KNClassifier", "Train"))
models_metrics.append(get_metrics(y_train, lgb_proba_tr, lgb_pred_tr, "LightGBM", "Train"))

models_metrics.append(get_metrics(y_val, logr_proba, logr_pred, "Logistic Regression", "Validation"))
models_metrics.append(get_metrics(y_val, dtc_proba, dtc_pred, "DTClassifier", "Validation"))
models_metrics.append(get_metrics(y_val, rf_proba, rf_pred, "Random Forest", "Validation"))
models_metrics.append(get_metrics(y_val, knn_proba, knn_pred, "KNClassifier", "Validation"))
models_metrics.append(get_metrics(y_val, lgb_proba, lgb_pred, "LightGBM", "Validation"))

The metrics are first collected on the training set to measure the model's fiting ability. The metrics are then collected on the validation set to measure the model's generalization ability.

In [66]:
df_models_metrics = pd.DataFrame(models_metrics)
df_models_metrics = df_models_metrics.pivot_table(index=["Model", "Set"], values=["Accuracy"])

df_models_metrics

Unnamed: 0_level_0,Unnamed: 1_level_0,Accuracy
Model,Set,Unnamed: 2_level_1
DTClassifier,Train,1.0
DTClassifier,Validation,0.689744
KNClassifier,Train,0.829074
KNClassifier,Validation,0.755128
LightGBM,Train,0.956032
LightGBM,Validation,0.780769
Logistic Regression,Train,0.75213
Logistic Regression,Validation,0.732051
Random Forest,Train,1.0
Random Forest,Validation,0.794872


## Choosing the best model
- We decided to use Random Forest and LightGBM as our baseline models and Logistic Regression as the metamodel on Stacking

In [67]:
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


### **LightGBM**

In [68]:
lgb_clf = LGBMClassifier(random_state=42)


#### Hyperparameter tuning

In [69]:

param_space = {
    'num_leaves': [10, 15, 20],
    'max_depth': [5, 7, 10], 
    'learning_rate': [0.1, 0.03, 0.01],
    'n_estimators': [200, 300],
    'min_child_samples': [20, 30, 40],
    'reg_lambda': [0.1, 1, 10],
}
lgb_clf_rs = RandomizedSearchCV(lgb_clf, param_space, n_iter=20, cv= cv_strategy, scoring='accuracy', random_state=42, n_jobs=-1)
lgb_clf_rs.fit(X_train, y_train)

print("Best params:", lgb_clf_rs.best_params_)
best_lgb = lgb_clf_rs.best_estimator_

[LightGBM] [Info] Number of positive: 2311, number of negative: 1328
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000938 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2221
[LightGBM] [Info] Number of data points in the train set: 3639, number of used features: 18
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.635065 -> initscore=0.554006
[LightGBM] [Info] Start training from score 0.554006
Best params: {'reg_lambda': 1, 'num_leaves': 15, 'n_estimators': 300, 'min_child_samples': 30, 'max_depth': 5, 'learning_rate': 0.03}


In [70]:
best_lgb_proba = best_lgb.predict_proba(X_val)[:,1]
best_lgb_pred = best_lgb.predict(X_val)

best_lgb_pred_tr = best_lgb.predict(X_train)


print(f"Accuracy: {accuracy_score(y_val, best_lgb_pred):.3f}")
print(f"Accuracy on train: {accuracy_score(y_train, best_lgb_pred_tr):.3f}")

Accuracy: 0.764
Accuracy on train: 0.851


### **Random Forest**

### Hyperparameter Tuning

In [71]:
rf_clf = RandomForestClassifier(random_state=42)

In [72]:
param_range = {
    'n_estimators': [100, 200, 300],
    'max_depth': [7, 10,12], 
    'criterion' : ['gini', 'entropy'],
    'max_features': ['sqrt', 'log2'],
    'min_samples_split': [10, 15, 20],
    'min_samples_leaf': [5, 10, 15],
    'bootstrap': [True]
}

rf_clf_rs = RandomizedSearchCV(rf_clf, param_range, n_iter=20, cv=cv_strategy, scoring='accuracy', random_state=42, n_jobs=-1)
rf_clf_rs.fit(X_train, y_train)

print("Best params:", rf_clf_rs.best_params_)
best_rf = rf_clf_rs.best_estimator_

Best params: {'n_estimators': 300, 'min_samples_split': 15, 'min_samples_leaf': 5, 'max_features': 'log2', 'max_depth': 12, 'criterion': 'entropy', 'bootstrap': True}


In [73]:
best_rf_proba = best_rf.predict_proba(X_val)[:,1]
best_rf_pred = best_rf.predict(X_val)

best_rf_pred_tr = best_rf.predict(X_train)

print(f"Accuracy: {accuracy_score(y_val, best_rf_pred):.3f}")

print(f"Accuracy on train: {accuracy_score(y_train, best_rf_pred_tr):.3f}")

Accuracy: 0.782
Accuracy on train: 0.889


## **KNClassifier**

In [74]:
knn_clf = KNeighborsClassifier()

### Hyperparameter Tuning

In [75]:
param_set = {
    'n_neighbors': [15, 20, 25, 30],
    'weights': ['uniform'],
    'leaf_size': [20, 30, 40],
    'metric': ['euclidean', 'manhattan']
}

knn_clf_rs = GridSearchCV(knn_clf, param_set, cv=cv_strategy, scoring='accuracy', n_jobs=-1)
knn_clf_rs.fit(X_train, y_train)

print("Best params:", knn_clf_rs.best_params_)
best_knn = knn_clf_rs.best_estimator_

Best params: {'leaf_size': 20, 'metric': 'manhattan', 'n_neighbors': 30, 'weights': 'uniform'}


In [None]:
"""
opcional pôr 


results = knn_grid.cv_results_['mean_test_score']
plt.plot(range(1, 40), results, marker='o')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('CV Accuracy')
plt.show()
"""

In [76]:
best_knn_proba = best_knn.predict_proba(X_val)[:,1]
best_knn_pred = best_knn.predict(X_val)

best_knn_pred_tr = best_knn.predict(X_train)

print(f"Accuracy: {accuracy_score(y_val, best_knn_pred):.3f}")

print(f"Accuracy on train: {accuracy_score(y_train, best_knn_pred_tr):.3f}")

Accuracy: 0.776
Accuracy on train: 0.771


## **Logistic Regression**

In [77]:
logr_clf = LogisticRegression(solver='liblinear', random_state=42, max_iter=1000)

### Hyperparameter Tuning

In [78]:
param_grid = {
    'penalty': ['l1', 'l2'], #L1 = lasso L2 = ridge
    'C': [0.01, 0.1, 1, 10],
}

logr_clf_rs = RandomizedSearchCV(logr_clf, param_grid, cv=cv_strategy, scoring='accuracy', n_jobs=-1)
logr_clf_rs.fit(X_train, y_train)

print("Best params:", logr_clf_rs.best_params_)
best_logr = logr_clf_rs.best_estimator_



Best params: {'penalty': 'l1', 'C': 1}


In [79]:
best_logr_proba = best_logr.predict_proba(X_val)[:,1]
best_logr_pred = best_logr.predict(X_val)

print(f"Accuracy: {accuracy_score(y_val, best_logr_pred):.3f}")

Accuracy: 0.731


## **Stacking**

In [80]:
estimators = [
    ('rf', best_rf),
    ('lgb', best_lgb),
    ('knn', best_knn),
]

In [81]:
print("--- Individual Model Performance ---")
for name, model in estimators:
    model.fit(X_train, y_train)
    acc = accuracy_score(y_val, model.predict(X_val))
    print(f"{name}: {acc:.3f}")

--- Individual Model Performance ---
rf: 0.782
[LightGBM] [Info] Number of positive: 2311, number of negative: 1328
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001437 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2221
[LightGBM] [Info] Number of data points in the train set: 3639, number of used features: 18
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.635065 -> initscore=0.554006
[LightGBM] [Info] Start training from score 0.554006
lgb: 0.764
knn: 0.776


In [85]:
stc = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(), cv=cv_strategy, n_jobs=-1).fit(X_train, y_train)

In [86]:
stc_proba = stc.predict_proba(X_val)[:,1]
stc_pred = stc.predict(X_val)
stc_pred_tr = stc.predict(X_train)

print(f"Accuracy: {accuracy_score(y_val, stc_pred):.3f}")

print(f"Accuracy on train: {accuracy_score(y_train, stc_pred_tr):.3f}")

Accuracy: 0.783
Accuracy on train: 0.865


In [87]:
stc_proba_test = stc.predict_proba(X_test)[:,1]
stc_pred_test = stc.predict(X_test)
stc_pred_test = stc.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, stc_pred_test):.3f}")


Accuracy: 0.778


In [88]:
# 1. Concatenate Train and Validation sets
X_full = pd.concat([X_train, X_val, X_test], axis=0)
y_full = pd.concat([y_train, y_val, y_test], axis=0)

# 2. Re-train your BEST model on the full data
# (Assuming 'best_rf' or 'stc' was your best estimator)
final_model = stc  # or best_rf, best_lgb, etc.
final_model.fit(X_full, y_full)

# 3. Predict on the Kaggle data
final_predictions = final_model.predict(X_predict)

# 4. Save
submission = pd.DataFrame({'id': id_predict, 'Quality_class': final_predictions})
submission['Quality_class'] = submission['Quality_class'].map({0: 'KO', 1: 'OK'})
submission.to_csv('submission.csv', index=False)