# **Notebook 4**
## **Modelling and Tuning**


### Introduction
This notebook marks the official transition to the core Machine Learning phase. Our primary objective is to engage in rigorous model evaluation and refinement to identify the best classification algorithm for predicting the binary quality of the Pastéis de Nata.  

This involves a strict, comparative analysis using the clean, anti-leakage data partitions (Train and Validation) prepared in Notebook 3.   
We will follow this steps:  
- **Establish Baseline Performance:** We will train a diverse portfolio of models using default settings to establish a baseline performance and potential.
- **Diagnose Overfitting:** By comparing performance metrics across the Training and Validation sets, we will precisely diagnose model generalization ability versus **overfitting**.
- **Systematic Optimization:** We will select the most promising models and optimize their complexity and performance using **GridSearchCV**  combined with the robust **Stratified K-Fold Cross-Validation (SKF)**  loaded from the previous step.

In [23]:
import pandas as pd
import numpy as np
import pickle, os

import warnings
from sklearn.exceptions import ConvergenceWarning

# Ignore ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import StackingClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

In [44]:
# Load train/val/test split data from notebook3
with open(r'Nata_Files\\notebook3newwwww.pkl', 'rb') as f:
    notebook3newww_data = pickle.load(f)


X_train_final = notebook3newww_data['X_train_final']
X_val_final = notebook3newww_data['X_val_final']
X_test_final = notebook3newww_data['X_test_final']
y_train = notebook3newww_data['y_train']
y_val = notebook3newww_data['y_val']
y_test = notebook3newww_data['y_test']
X_predict_final = notebook3newww_data['X_predict_final']
id_predict = notebook3newww_data['id_predict']

In [3]:
models = {
    "Logistic Regression": LogisticRegression(), 
    "Decision Tree":       DecisionTreeClassifier(),
    "Random Forest":       RandomForestClassifier(),
    "XGBoost":             XGBClassifier(eval_metric='logloss'),
    "LightGBM":            LGBMClassifier(verbose=-1),
    "MLP Classifier":      MLPClassifier(max_iter=1000, early_stopping=True),
    "Gradient Boosting":   GradientBoostingClassifier(),
    "KNClassifier": KNeighborsClassifier()
}

In [4]:
results_list = []

for name, model in models.items():
    # 1. Fit on Train
    model.fit(X_train_final, y_train)
    
    # 2. Predict on Train AND Validation
    train_preds = model.predict(X_train_final)
    val_preds = model.predict(X_val_final)
    
    # 3. Calculate Scores
    train_acc = accuracy_score(y_train, train_preds)
    val_acc = accuracy_score(y_val, val_preds)
    
    
    # Print formatted row
    print(f"{name:<20} | {train_acc:.4f}     | {val_acc:.4f}")
    
    # Store for later analysis if needed
    results_list.append({
        "Model": name,
        "Train Acc": train_acc,
        "Val Acc": val_acc,
    })

# Optional: View as a sorted DataFrame
print("\n--- Sorted by Validation Accuracy ---")
df_results = pd.DataFrame(results_list).sort_values(by="Val Acc", ascending=False)
display(df_results)

Logistic Regression  | 0.7461     | 0.7256
Decision Tree        | 1.0000     | 0.7051
Random Forest        | 1.0000     | 0.7859
XGBoost              | 0.9989     | 0.7526
LightGBM             | 0.9472     | 0.7615
MLP Classifier       | 0.7571     | 0.7359
Gradient Boosting    | 0.8104     | 0.7821
KNClassifier         | 0.8214     | 0.7346

--- Sorted by Validation Accuracy ---


Unnamed: 0,Model,Train Acc,Val Acc
2,Random Forest,1.0,0.785897
6,Gradient Boosting,0.810387,0.782051
4,LightGBM,0.947238,0.761538
3,XGBoost,0.998901,0.752564
5,MLP Classifier,0.757076,0.735897
7,KNClassifier,0.821379,0.734615
0,Logistic Regression,0.746084,0.725641
1,Decision Tree,1.0,0.705128


In [5]:
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


In [47]:
parameters_Gradient_Boosting_Classifier= {'loss': ['log_loss', 'exponential'], 'learning_rate': [0.1, 0.05, 0.01], 'n_estimators': [100, 150, 200], 'max_depth': [3, 6, 9], 'max_leaf_nodes': [3, 6, 9]}


Best_Gradient_Boosting_Classifier = RandomizedSearchCV(estimator = GradientBoostingClassifier(), param_distributions = parameters_Gradient_Boosting_Classifier, scoring = 'accuracy', verbose = True).fit(X_train_final, y_train)
display(Best_Gradient_Boosting_Classifier.best_params_)
display(Best_Gradient_Boosting_Classifier.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


{'n_estimators': 100,
 'max_leaf_nodes': 9,
 'max_depth': 6,
 'loss': 'log_loss',
 'learning_rate': 0.1}

np.float64(0.7639509046661729)

In [7]:
parameters_Decision_Tree_Classifier =  {'criterion':['log_loss', 'gini', 'entropy'],
                                        'max_depth':np.arange(1,21).tolist()[0::2],
                                        'min_samples_split':np.arange(2,11).tolist()[0::2],
                                        'max_leaf_nodes':np.arange(3,26).tolist()[0::2]}


Best_Decision_Tree_Classifier = RandomizedSearchCV(estimator = DecisionTreeClassifier(), param_distributions = parameters_Decision_Tree_Classifier, scoring = 'accuracy', verbose = True).fit(X_train_final, y_train)
display(Best_Decision_Tree_Classifier.best_params_)
display(Best_Decision_Tree_Classifier.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


{'min_samples_split': 6,
 'max_leaf_nodes': 3,
 'max_depth': 5,
 'criterion': 'gini'}

np.float64(0.7337171425548317)

In [8]:
parameters_Random_Forest_Classifier = {'n_estimators': [25, 50, 100, 150], 
                                       'max_features': ['sqrt', 'log2', None], 
                                       'max_depth': [3, 6, 9], 
                                       'max_leaf_nodes': [3, 6, 9], }


Best_Random_Forest_Classifier = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = parameters_Random_Forest_Classifier, scoring = 'accuracy', cv = cv_strategy, verbose = True).fit(X_train_final, y_train)
display(Best_Random_Forest_Classifier.best_params_)
display(Best_Random_Forest_Classifier.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


{'n_estimators': 25,
 'max_leaf_nodes': 9,
 'max_features': 'log2',
 'max_depth': 9}

np.float64(0.7353590700908446)

In [9]:
parameters_XGB_Classifier = {'n_estimators': [25, 50, 100, 150], 'max_depth': [3, 6, 9], 'learning_rate': [0.1, 0.05, 0.01], 'gamma': [0, 0.1, 0.2, 0.3], 'colsample_bytree': [0.3, 0.4, 0.5, 0.7]}


Best_XGB_Classifier = RandomizedSearchCV(estimator = XGBClassifier(), param_distributions = parameters_XGB_Classifier, scoring = 'accuracy', cv = cv_strategy, verbose = True).fit(X_train_final, y_train)
display(Best_XGB_Classifier.best_params_)
display(Best_XGB_Classifier.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


{'n_estimators': 150,
 'max_depth': 6,
 'learning_rate': 0.05,
 'gamma': 0.1,
 'colsample_bytree': 0.7}

np.float64(0.7628414226763607)

In [10]:
parameters_LightGBM_Classifier = {
    'num_leaves': [10, 15, 20],
    'max_depth': [5, 7, 10], 
    'learning_rate': [0.1, 0.03, 0.01],
    'n_estimators': [200, 300],
    'min_child_samples': [20, 30, 40],
    'reg_lambda': [0.1, 1, 10],
}

Best_LightGBM_Classifier = RandomizedSearchCV(estimator = LGBMClassifier(), param_distributions = parameters_LightGBM_Classifier, scoring = 'accuracy', cv = cv_strategy, verbose = True).fit(X_train_final, y_train)
display(Best_LightGBM_Classifier.best_params_)
display(Best_LightGBM_Classifier.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


{'reg_lambda': 0.1,
 'num_leaves': 10,
 'n_estimators': 300,
 'min_child_samples': 20,
 'max_depth': 5,
 'learning_rate': 0.03}

np.float64(0.7642158048279093)

In [14]:
labels = ['Gradient Boosting', 'Random Forest', 'Decision Tree', 'XGBoost', 'LightGBM']
# Prediction results on training data
def predict_and_results_train(list_of_models):

    f1, accuracy, precision, recall = {}, {}, {}, {}

    for model, label in zip(list_of_models, labels):
        predictions_train = model.predict(X_train_final)

        f1[label] = f1_score(y_train, predictions_train)

        accuracy[label] = accuracy_score(y_train, predictions_train)

        precision[label] = precision_score(y_train, predictions_train)

        recall[label] = recall_score(y_train, predictions_train)

        results = pd.DataFrame.from_dict({'F1 Score': f1, 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall})
    return results

# Prediction results on validation data
def predict_and_results_val(list_of_models):

    f1, accuracy, precision, recall = {}, {}, {}, {}

    for model, label in zip(list_of_models, labels):
        predictions_val = model.predict(X_val_final)

        f1[label] = f1_score(y_val, predictions_val)

        accuracy[label] = accuracy_score(y_val, predictions_val)

        precision[label] = precision_score(y_val, predictions_val)

        recall[label] = recall_score(y_val, predictions_val)

        results = pd.DataFrame.from_dict({'F1 Score': f1, 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall})
    return results

print('Prediction Results on Training Data')
display(predict_and_results_train([Best_Gradient_Boosting_Classifier, Best_Random_Forest_Classifier, Best_Decision_Tree_Classifier, Best_XGB_Classifier, Best_LightGBM_Classifier]))
print('Prediction Results on Validation Data')
display(predict_and_results_val([Best_Gradient_Boosting_Classifier, Best_Random_Forest_Classifier, Best_Decision_Tree_Classifier, Best_XGB_Classifier, Best_LightGBM_Classifier]))

Prediction Results on Training Data


Unnamed: 0,F1 Score,Accuracy,Precision,Recall
Gradient Boosting,0.878274,0.840341,0.85134,0.906967
Random Forest,0.819582,0.748832,0.753539,0.898312
Decision Tree,0.789394,0.733718,0.793013,0.785807
XGBoost,0.926044,0.904095,0.907392,0.945478
LightGBM,0.869932,0.830723,0.849485,0.891389


Prediction Results on Validation Data


Unnamed: 0,F1 Score,Accuracy,Precision,Recall
Gradient Boosting,0.828986,0.773077,0.795918,0.864919
Random Forest,0.814747,0.742308,0.750424,0.891129
Decision Tree,0.789318,0.726923,0.774757,0.804435
XGBoost,0.825121,0.767949,0.792208,0.860887
LightGBM,0.826255,0.769231,0.792593,0.862903


## Fitting the Model

In [48]:
final_model = GradientBoostingClassifier(n_estimators=100, max_leaf_nodes=9, max_depth=6, loss = 'log_loss', learning_rate=0.1)
final_model.fit(X_train_final, y_train)

0,1,2
,loss,'log_loss'
,learning_rate,0.1
,n_estimators,100
,subsample,1.0
,criterion,'friedman_mse'
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_depth,6
,min_impurity_decrease,0.0


In [49]:
# Apply the K Fold and the Repeated K Fold in the model

kf = KFold(n_splits = 10,random_state = 42, shuffle = True)
rkf = RepeatedKFold(n_splits = 5, n_repeats = 3,random_state = 42)

def eval_model_clf(X, y, model):
    y_pred = model.predict(X)
    return accuracy_score(y, y_pred)
def run_model(X, y, model):
    return model.fit(X,y)

def avg_score_clf(method, X, y, model):
    score_train = []
    score_test = []
    for train_index, test_index in method.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        train_model = run_model(X_train, y_train, model)
        value_train = eval_model_clf(X_train, y_train, train_model)
        value_test = eval_model_clf(X_test, y_test, train_model)

        score_train.append(value_train)
        score_test.append(value_test)

    mean_train_score = np.mean(score_train)
    mean_test_score = np.mean(score_test)

    result_df = pd.DataFrame({'Train accuracy': [mean_train_score], 'Test accuracy': [mean_test_score]})
    result_df.index = [f'{str(model)} with {str(method)}']

    return result_df

In [50]:
final_model_train_acc = eval_model_clf(X_train_final, y_train, final_model)
final_model_test_acc = eval_model_clf(X_val_final, y_val, final_model)
final_model_df = pd.DataFrame({'Train accuracy': [final_model_train_acc], 'Test accuracy': [final_model_test_acc]})
final_model_df.index = [f"{str(final_model)} with Simple Data Split"]

final_model_kf_df = avg_score_clf(kf, X_train_final, y_train,final_model)
final_model_rkf_df = avg_score_clf(rkf, X_train_final, y_train,final_model)

df_final_model = pd.concat([final_model_df, final_model_kf_df, final_model_rkf_df])

In [51]:
df_final_model

Unnamed: 0,Train accuracy,Test accuracy
"GradientBoostingClassifier(max_depth=6, max_leaf_nodes=9) with Simple Data Split",0.835394,0.769231
"GradientBoostingClassifier(max_depth=6, max_leaf_nodes=9) with KFold(n_splits=10, random_state=42, shuffle=True)",0.844249,0.763124
"GradientBoostingClassifier(max_depth=6, max_leaf_nodes=9) with RepeatedKFold(n_repeats=3, n_splits=5, random_state=42)",0.851493,0.763668


In [40]:
def predict_and_results_metrics(final_model):
    predictions_train = final_model.predict(X_train_final)
    predictions_validation = final_model.predict(X_val_final)

    print('___________________________________________________________________________________________________________')
    print('                                                     TRAIN                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_train, predictions_train))
    print('Confusion Matrix:\n', confusion_matrix(y_train, predictions_train))

    print('___________________________________________________________________________________________________________')
    print('                                                VALIDATION                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_val, predictions_validation))
    print('Confusion Matrix:\n', confusion_matrix(y_val, predictions_validation))

In [41]:
display(predict_and_results_metrics(final_model))

___________________________________________________________________________________________________________
                                                     TRAIN                                                 
-----------------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.82      0.73      0.77      1328
           1       0.85      0.91      0.88      2311

    accuracy                           0.85      3639
   macro avg       0.84      0.82      0.83      3639
weighted avg       0.84      0.85      0.84      3639

Confusion Matrix:
 [[ 970  358]
 [ 206 2105]]
___________________________________________________________________________________________________________
                                                VALIDATION                                                 
---------------------------------------------------------------------------------------

None

In [46]:
X_full = pd.concat([X_train_final, X_val_final, X_test_final], axis=0)
y_full = pd.concat([y_train, y_val, y_test], axis=0)


final_model.fit(X_full, y_full)
# 3. Predict on the Kaggle data
final_predictions = final_model.predict(X_predict_final)

# 4. Save
submission = pd.DataFrame({'id': id_predict, 'Quality_class': final_predictions})
submission['Quality_class'] = submission['Quality_class'].map({0: 'KO', 1: 'OK'})
submission.to_csv('submission.csv', index=False)