# **Modelling and tuning**
* [Introdcution](#4.0)
* [4.1 Model Choice](#4.1)
* [4.2 Fitting the Model](#4.2)
* [4.3 Making Predictions](#4.3)


<a id="4.0"></a>     


### Introduction
This notebook marks the official transition to the core Machine Learning phase. Our primary objective is to engage in rigorous model evaluation and refinement to identify the best classification algorithm for predicting the binary quality of the Pastéis de Nata.  

This involves a strict, comparative analysis using the clean, anti-leakage data partitions (Train and Validation) prepared in Notebook 3.   
We will follow this steps:  
- **Establish Baseline Performance:** We will train a diverse portfolio of models using default settings to establish a baseline performance and potential.
- **Diagnose Overfitting:** By comparing performance metrics across the Training and Validation sets, we will precisely diagnose model generalization ability versus **overfitting**.
- **Systematic Optimization:** We will select the most promising models and optimize their complexity and performance using **GridSearchCV**  combined with the robust **Stratified K-Fold Cross-Validation (SKF)**  loaded from the previous step.

In [2]:
"""
Importing the necessary libraries
"""


import pandas as pd
import numpy as np
import pickle, os
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
from sklearn.exceptions import ConvergenceWarning

# Ignore ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier


In order to load the variables created in notebook 3 we utilized pickle files, so we can transitate the variables from one notebook to another notebook

In [3]:
"""
Loading preprocessed data from notebook3 through a pickle file.

In this part we mostly load the training, validation and test data scaled and ready to be used in models.
Also, the final predict data is laoded   PS: KAGGLE É PARA O NOTEBOOK 9
"""


with open(r'Nata_Files\\notebook3newwwww.pkl', 'rb') as f:
    notebook3newww_data = pickle.load(f)


X_train_final = notebook3newww_data['X_train_final']
X_val_final = notebook3newww_data['X_val_final']
X_test_final = notebook3newww_data['X_test_final']
y_train = notebook3newww_data['y_train']
y_val = notebook3newww_data['y_val']
y_test = notebook3newww_data['y_test']
X_predict_final = notebook3newww_data['X_predict_final']
id_predict = notebook3newww_data['id_predict']

<a id="4.1"></a>     

## **4.1. Model Choice**

Here, we establish the models that are going to be evaluated.

In [4]:
"""
We define a dictionary containing all the models we want to use for classification.
This will help us to iterate through them later on for training and evaluation.
"""


models = {
    "Logistic Regression": LogisticRegression(), 
    "Decision Tree":       DecisionTreeClassifier(),
    "Random Forest":       RandomForestClassifier(),
    "XGBoost":             XGBClassifier(eval_metric='logloss'),
    "LightGBM":            LGBMClassifier(verbose=-1),
    "MLP Classifier":      MLPClassifier(max_iter=1000, early_stopping=True),
    "Gradient Boosting":   GradientBoostingClassifier(),
    "KNClassifier": KNeighborsClassifier()
}

In [5]:
results_list = []

for name, model in models.items():
    # 1. Fit on Train
    model.fit(X_train_final, y_train)
    
    # 2. Predict on Train AND Validation
    train_pred = model.predict(X_train_final)
    val_pred = model.predict(X_val_final)
    
    # 3. Calculate Scores
    train_acc = accuracy_score(y_train, train_pred)
    val_acc = accuracy_score(y_val, val_pred)
    
    
    # Store for later analysis if needed
    results_list.append({
        "Model": name,
        "Train Acc": train_acc,
        "Val Acc": val_acc,
    })

# Optional: View as a sorted DataFrame
print("\n--- Dataframe containing the models sorted by validation accuracy ---")
df_results = pd.DataFrame(results_list).sort_values(by="Val Acc", ascending=False)
display(df_results)


--- Dataframe containing the models sorted by validation accuracy ---


Unnamed: 0,Model,Train Acc,Val Acc
2,Random Forest,1.0,0.796154
4,LightGBM,0.950811,0.775641
6,Gradient Boosting,0.810113,0.775641
3,XGBoost,0.996977,0.75641
7,KNClassifier,0.82715,0.742308
5,MLP Classifier,0.761473,0.737179
0,Logistic Regression,0.746359,0.726923
1,Decision Tree,1.0,0.708974


This dataframe created, contains the accuracy in training and validation of the baseline models using default parameters.

From this dataframe we can observe that some models perform better than other and some models are overfitting significantly!

Models overfitting: 
-  Random Forest
- LightGBM
- XGBoost
- Decision Tree Classifier

Note: This overfitting is expected when we don't specify parameter to combat the overfitting

Besides this, we decided to choose these models to optimize their complexity and performance using **RandomizedSearchCV**:
- Gradient Boosting Classifier
- Decision Tree Classifer (talvez tirar)
- Random Forest Classifier
- XGBoost Classifier
- LightGBM Classifier

PÔR AQUI COMO FUNCIONA A RANDOMIZEDSEARCHCV

In [6]:
"""
PERCEBER PORQUE É QUE FAZEMOS ISTO
"""
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


#### **Hyperparameter tuning for Gradient Boosting Classifier.**

In [None]:
"""
We define a range for some parameters of the model to be used by RandomizedSearchCV, specifying that we want to optimize for accuracy score
"""


parameters_Gradient_Boosting_Classifier= {'loss': ['log_loss', 'exponential'], 
                                          'learning_rate': [0.1, 0.05, 0.01], 
                                          'n_estimators': [100, 150], 
                                          'max_depth': [3, 6, 9], 
                                          'max_leaf_nodes': [3, 6, 9]}


Best_Gradient_Boosting_Classifier = RandomizedSearchCV(estimator = GradientBoostingClassifier(), param_distributions = parameters_Gradient_Boosting_Classifier, scoring = 'accuracy', verbose = True).fit(X_train_final, y_train)
print(f"Best combination of parameters: {Best_Gradient_Boosting_Classifier.best_params_}")
print(f"Best accuracy score: {Best_Gradient_Boosting_Classifier.best_score_:.3f}")  #PERCEBER SE É PARA TRAINING OU VALIDATION

Fitting 5 folds for each of 24 candidates, totalling 120 fits


KeyboardInterrupt: 

**Analyzing the results:**

From the code above, we can observe that the best combination of parameteres is:
- n_estimators = 150
- max_leaf_nodes = 6
- max_depth = 6
- loss = 'exponential'
- learning_rate = 0.05

And we can also see the best accuracy score on validation??, being 0.761.


#### **Hyperparameter tuning for Decision Tree Classifier.**

In [8]:
parameters_Decision_Tree_Classifier =  {'criterion':['log_loss', 'gini', 'entropy'],
                                        'max_depth':np.arange(1,21).tolist()[0::2],
                                        'min_samples_split':np.arange(2,11).tolist()[0::2],
                                        'max_leaf_nodes':np.arange(3,26).tolist()[0::2]}


Best_Decision_Tree_Classifier = RandomizedSearchCV(estimator = DecisionTreeClassifier(), param_distributions = parameters_Decision_Tree_Classifier, scoring = 'accuracy', verbose = True).fit(X_train_final, y_train)
print(f"Best combination of parameters: {Best_Decision_Tree_Classifier.best_params_}")
print(f"Best accuracy score: {Best_Decision_Tree_Classifier.best_score_:.3f}")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best combination of parameters: {'min_samples_split': 6, 'max_leaf_nodes': 5, 'max_depth': 5, 'criterion': 'entropy'}
Best accuracy score: 0.734


**Analyzing the results:**

From the code above, we can observe that the best combination of parameteres is:
- min_samples_split = 4
- max_leaf_nodes = 5
- max_depth = 13
- criterion = 'entropy'

And we can also see the best accuracy score on validation, being 0.733.

#### **Hyperparameter tuning for Random Forest Classifier.**

In [9]:
parameters_Random_Forest_Classifier = {'n_estimators': [25, 50, 100, 150], 
                                       'max_features': ['sqrt', 'log2', None], 
                                       'max_depth': [3, 6, 9], 
                                       'max_leaf_nodes': [3, 6, 9], }


Best_Random_Forest_Classifier = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = parameters_Random_Forest_Classifier, scoring = 'accuracy', cv = cv_strategy, verbose = True).fit(X_train_final, y_train)
print(f"Best combination of parameters: {Best_Random_Forest_Classifier.best_params_}")
print(f"Best accuracy score: {Best_Random_Forest_Classifier.best_score_:.3f}")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best combination of parameters: {'n_estimators': 100, 'max_leaf_nodes': 9, 'max_features': None, 'max_depth': 6}
Best accuracy score: 0.738


**Analyzing the results:**

From the code above, we can observe that the best combination of parameteres is:
- n_estimators = 150
- max_leaf_nodes = 9
- max_features = 'None'
- max_depth = 6

And we can also see the best accuracy score on validation, being 0.738.

#### **Hyperparameter tuning for XGBoost Classifier.**

In [10]:
parameters_XGB_Classifier = {'n_estimators': [25, 50, 100, 150], 'max_depth': [3, 6, 9], 'learning_rate': [0.1, 0.05, 0.01], 'gamma': [0, 0.1, 0.2, 0.3], 'colsample_bytree': [0.3, 0.4, 0.5, 0.7]}


Best_XGB_Classifier = RandomizedSearchCV(estimator = XGBClassifier(), param_distributions = parameters_XGB_Classifier, scoring = 'accuracy', cv = cv_strategy, verbose = True).fit(X_train_final, y_train)
print(f"Best combination of parameters: {Best_XGB_Classifier.best_params_}")
print(f"Best accuracy score: {Best_XGB_Classifier.best_score_:.3f}")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best combination of parameters: {'n_estimators': 100, 'max_depth': 9, 'learning_rate': 0.05, 'gamma': 0, 'colsample_bytree': 0.7}
Best accuracy score: 0.767


**Analyzing the results:**

From the code above, we can observe that the best combination of parameteres is:
- n_estimators = 100
- max_depth = 9
- learning_rate = 0.1
- gamma = 0.3
- colsample_bytree = 0.3

And we can also see the best accuracy score on validation, being 0.768.

#### **Hyperparameter tuning for LightGBM Classifier.**

In [11]:
parameters_LightGBM_Classifier = {
    'num_leaves': [10, 15, 20],
    'max_depth': [5, 7, 10], 
    'learning_rate': [0.1, 0.03, 0.01],
    'n_estimators': [200, 300],
    'min_child_samples': [20, 30, 40],
    'reg_lambda': [0.1, 1, 10],
}

Best_LightGBM_Classifier = RandomizedSearchCV(estimator = LGBMClassifier(), param_distributions = parameters_LightGBM_Classifier, scoring = 'accuracy', cv = cv_strategy, verbose = True).fit(X_train_final, y_train)
print(f"Best combination of parameters: {Best_LightGBM_Classifier.best_params_}")
print(f"Best accuracy score: {Best_LightGBM_Classifier.best_score_:.3f}")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best combination of parameters: {'reg_lambda': 0.1, 'num_leaves': 15, 'n_estimators': 300, 'min_child_samples': 20, 'max_depth': 10, 'learning_rate': 0.03}
Best accuracy score: 0.764


**Analyzing the results:**

From the code above, we can observe that the best combination of parameteres is:
- reg_lambda = 1
- num_leaves = 10
- n_estimators = 200
- min_child_samples = 20
- max_depth = 7
- learning_rate = 0.1

And we can also see the best accuracy score on validation, being 0.761.

In [12]:
labels = ['Gradient Boosting', 'Random Forest', 'Decision Tree', 'XGBoost', 'LightGBM']
# Prediction results on training data
def predict_and_results_train(list_of_models):

    """
    Creating a datframe that will contain the evaluation metrics (F1 Score, Accuracy, Precision, Recall) for each model in the list_of_models for the training data
    """

    f1, accuracy, precision, recall = {}, {}, {}, {}

    for model, label in zip(list_of_models, labels): # Iterate through each model and its corresponding label... zip combines the two lists into pairs
        predictions_train = model.predict(X_train_final)

        """
        This part calculates the evaluation metrics for each model and stores them in each dictionary of the metrics
        """
        f1[label] = f1_score(y_train, predictions_train)
        accuracy[label] = accuracy_score(y_train, predictions_train)
        precision[label] = precision_score(y_train, predictions_train)
        recall[label] = recall_score(y_train, predictions_train)

        results = pd.DataFrame.from_dict({'F1 Score': f1, 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall})

    return results # this is a dataframe where the columns are the metrics and the rows are the models' scores for train

# Prediction results on validation data
def predict_and_results_val(list_of_models):

    """
    Creating a datframe that will contain the evaluation metrics (F1 Score, Accuracy, Precision, Recall) for each model in the list_of_models for the validation data
    """

    f1, accuracy, precision, recall = {}, {}, {}, {}

    for model, label in zip(list_of_models, labels):
        predictions_val = model.predict(X_val_final)

        f1[label] = f1_score(y_val, predictions_val)
        accuracy[label] = accuracy_score(y_val, predictions_val)
        precision[label] = precision_score(y_val, predictions_val)
        recall[label] = recall_score(y_val, predictions_val)

        results = pd.DataFrame.from_dict({'F1 Score': f1, 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall})

    return results

print('Prediction Results on Training Data')
display(predict_and_results_train([Best_Gradient_Boosting_Classifier, Best_Random_Forest_Classifier, Best_Decision_Tree_Classifier, Best_XGB_Classifier, Best_LightGBM_Classifier]))
print('Prediction Results on Validation Data')
display(predict_and_results_val([Best_Gradient_Boosting_Classifier, Best_Random_Forest_Classifier, Best_Decision_Tree_Classifier, Best_XGB_Classifier, Best_LightGBM_Classifier]))

Prediction Results on Training Data


Unnamed: 0,F1 Score,Accuracy,Precision,Recall
Gradient Boosting,0.879832,0.842814,0.855043,0.906101
Random Forest,0.8,0.743336,0.791861,0.808308
Decision Tree,0.789394,0.733718,0.793013,0.785807
XGBoost,0.988185,0.984886,0.981229,0.99524
LightGBM,0.895833,0.865348,0.880485,0.911727


Prediction Results on Validation Data


Unnamed: 0,F1 Score,Accuracy,Precision,Recall
Gradient Boosting,0.826462,0.767949,0.787934,0.868952
Random Forest,0.797665,0.733333,0.770677,0.826613
Decision Tree,0.789318,0.726923,0.774757,0.804435
XGBoost,0.828322,0.773077,0.798131,0.860887
LightGBM,0.826255,0.769231,0.792593,0.862903


<a id="4.2"></a> 

## **4.2 Choosing Final Model**

Considering the results above, we can conclude that the best model after the hyperparameter tuning is the Gradient Boosting Classifier

O que está em baixo disto vai para o notebook 9

In [None]:
"""
Training the final model with the best combination of parameters
"""

final_model = GradientBoostingClassifier(n_estimators=100, max_leaf_nodes=9, max_depth=6, loss='log_loss', learning_rate=0.1)
final_model.fit(X_train_final, y_train)

0,1,2
,loss,'log_loss'
,learning_rate,0.1
,n_estimators,100
,subsample,1.0
,criterion,'friedman_mse'
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_depth,6
,min_impurity_decrease,0.0


In [14]:
# Apply the K Fold and the Repeated K Fold in the model

kf = KFold(n_splits = 10,random_state = 42, shuffle = True)
rkf = RepeatedKFold(n_splits = 5, n_repeats = 3,random_state = 42)

def eval_model_clf(X, y, model):
    y_pred = model.predict(X)
    return accuracy_score(y, y_pred)
def run_model(X, y, model):
    return model.fit(X,y)

def avg_score_clf(method, X, y, model):
    score_train = []
    score_test = []
    for train_index, test_index in method.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        train_model = run_model(X_train, y_train, model)
        value_train = eval_model_clf(X_train, y_train, train_model)
        value_test = eval_model_clf(X_test, y_test, train_model)

        score_train.append(value_train)
        score_test.append(value_test)

    mean_train_score = np.mean(score_train)
    mean_test_score = np.mean(score_test)

    result_df = pd.DataFrame({'Train accuracy': [mean_train_score], 'Test accuracy': [mean_test_score]})
    result_df.index = [f'{str(model)} with {str(method)}']

    return result_df

In [15]:
final_model_train_acc = eval_model_clf(X_train_final, y_train, final_model)
final_model_test_acc = eval_model_clf(X_val_final, y_val, final_model)
final_model_df = pd.DataFrame({'Train accuracy': [final_model_train_acc], 'Test accuracy': [final_model_test_acc]})
final_model_df.index = [f"{str(final_model)} with Simple Data Split"]

final_model_kf_df = avg_score_clf(kf, X_train_final, y_train,final_model)
final_model_rkf_df = avg_score_clf(rkf, X_train_final, y_train,final_model)

df_final_model = pd.concat([final_model_df, final_model_kf_df, final_model_rkf_df])

In [16]:
df_final_model

Unnamed: 0,Train accuracy,Test accuracy
"GradientBoostingClassifier(max_depth=6, max_leaf_nodes=9) with Simple Data Split",0.836494,0.780769
"GradientBoostingClassifier(max_depth=6, max_leaf_nodes=9) with KFold(n_splits=10, random_state=42, shuffle=True)",0.843669,0.762851
"GradientBoostingClassifier(max_depth=6, max_leaf_nodes=9) with RepeatedKFold(n_repeats=3, n_splits=5, random_state=42)",0.848997,0.76367


In [17]:
def predict_and_results_metrics(final_model):
    predictions_train = final_model.predict(X_train_final)
    predictions_validation = final_model.predict(X_val_final)

    print('___________________________________________________________________________________________________________')
    print('                                                     TRAIN                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_train, predictions_train))
    print('Confusion Matrix:\n', confusion_matrix(y_train, predictions_train))

    print('___________________________________________________________________________________________________________')
    print('                                                VALIDATION                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_val, predictions_validation))
    print('Confusion Matrix:\n', confusion_matrix(y_val, predictions_validation))

In [18]:
display(predict_and_results_metrics(final_model))

___________________________________________________________________________________________________________
                                                     TRAIN                                                 
-----------------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.81      0.72      0.76      1328
           1       0.85      0.90      0.88      2311

    accuracy                           0.84      3639
   macro avg       0.83      0.81      0.82      3639
weighted avg       0.84      0.84      0.83      3639

Confusion Matrix:
 [[ 956  372]
 [ 222 2089]]
___________________________________________________________________________________________________________
                                                VALIDATION                                                 
---------------------------------------------------------------------------------------

None

In [19]:
X_full = pd.concat([X_train_final, X_val_final, X_test_final], axis=0)
y_full = pd.concat([y_train, y_val, y_test], axis=0)


final_model.fit(X_full, y_full)
# 3. Predict on the Kaggle data
final_predictions = final_model.predict(X_predict_final)

# 4. Save
submission = pd.DataFrame({'id': id_predict, 'Quality_class': final_predictions})
submission['Quality_class'] = submission['Quality_class'].map({0: 'KO', 1: 'OK'})
submission.to_csv('submission.csv', index=False)