# **Notebook 4**

## **Table of Contents**

* [1. Introduction](#1-introduction)

* [2. Importing Section](#2-importing-section)
    * [2.1 Importing Libraries](#21-importing-libraries)
    * [2.2 Importing the Data](#22-importing-the-data-from-the-previous-notebooks-stored-through-pickle-file)

* [3. Establish baseline performance and analysis](#3-establish-baseline-performance-and-analysis)
    * [3.1. Defining baseline models](#31-defining-baseline-models.)
    * [3.2. Baseline models' analysis and comparison](#32-baseline-models'-analysis-and-comparison)

* [4. Models' optimization](#4-models'-optimization)
    * [4.1. Optimizing models' performance](#41-optimizing-models'-performance)
    * [4.2. Tuned models' analysis and comparison](#41-tuned-models'-analysis-and-comparison)

* [5. Final model selection](#5-final-model-selection)
    
* [6. Conclusion](#6-conclusion)



In [None]:
"""
- acabar o índice
- melhorar a introdução
- meter markdowns mais extensos e explicativos
"""

# **Notebook 2: Modelling and Tuning**

## **1. Introduction**

Following the data preprocessing and the feature selection done in Notebook 2 and 3 respectively, where we ensure the data was cleaned and ready for the modelling phase and dropped irrelevant and redudant features that didn't add predictive value, this notebook focuses on the **Modelling and Tuning** phase.

### **Objectives & Our Workflow**

1.  **Establish baseline performance and analysis:**

    1.1. **Establish baseline models:**
    In this first phase we train a diverse portfolio of models using default parameters to establish a baseline performance. The metric used to evaluate this performance will be the accuracy `accuracy_score`.

    1.2. **Baseline Models' analysis and comparison:**
    As previously said, we evaluate the models based on the accuracy metric, showing the accuracy of the models in **both training and validation sets**, which allows us not only to observe the models that generalize better in **unseen data**, but also diagnose models that are **overfitting**.

3.  **Selecting the top 4 models to optimize performance**
    After analyzing the models with the default parameters, we select the top 4 most promising models for optimization.
    For that we decided to perform Hyperparameter tuning using GridSearchCV. We combined this optimization technique with **Stratified K-Fold Cross-Validation** to ensure that our tuning is robust and the selected parameters are the best for predicting **both target classes (OK and KO)**.

2.  **Final model selection**
    After the models' optimization we compare the **tuned models** and select the one that has **best score in both train and validation set.**




## **2. Importing Section**

### **2.1. Importing Libraries**

In [1]:
"""
Importing the necessary libraries
"""


import pandas as pd
import numpy as np
import pickle, os
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
from sklearn.exceptions import ConvergenceWarning

# Ignore ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier


### **2.2. Importing the data from the previous notebooks stored through pickle file**

In order to **load the variables created in notebook 2 and 3** we utilized pickle file, so we can transitate the variables from one notebook to another notebook.

In [None]:
"""
Loading preprocessed data from notebook3 through a pickle file.

In this part we mostly load the training, validation scaled and ready to be used in models.
Also, the final predict data is laoded   PS: KAGGLE É PARA O NOTEBOOK 9
"""


with open(r'Nata_Files\\notebook3newwwww.pkl', 'rb') as f:
    notebook3newww_data = pickle.load(f)


X_train_final = notebook3newww_data['X_train_final']
X_val_final = notebook3newww_data['X_val_final']
X_test_final = notebook3newww_data['X_test_final']
y_train = notebook3newww_data['y_train']
y_val = notebook3newww_data['y_val']
y_test = notebook3newww_data['y_test']
X_predict_final = notebook3newww_data['X_predict_final']
id_predict = notebook3newww_data['id_predict']

## **3. Establish baseline performance and analysis**

### **3.1. Defining baseline models**

Here we **establish the models that are going to be evaluated.** 

In [None]:
"""
We define a dictionary containing all the models we want to use for classification.
This will help us to iterate through them later on for training and evaluation.
"""


models = {
    "Logistic Regression": LogisticRegression(), 
    "Decision Tree":       DecisionTreeClassifier(),
    "Random Forest":       RandomForestClassifier(),
    "XGBoost":             XGBClassifier(eval_metric='logloss'),
    "LightGBM":            LGBMClassifier(verbose=-1),
    "MLP Classifier":      MLPClassifier(max_iter=1000, early_stopping=True),
    "Gradient Boosting":   GradientBoostingClassifier(),
    "KNClassifier":        KNeighborsClassifier()
}

### **3.2. Baseline Models' analysis and comparison**

In [4]:
results_list = []

for name, model in models.items():
    # 1. Fit on Train
    model.fit(X_train_final, y_train)
    
    # 2. Predict on Train AND Validation
    train_pred = model.predict(X_train_final)
    val_pred = model.predict(X_val_final)
    
    # 3. Calculate Scores
    train_acc = accuracy_score(y_train, train_pred)
    val_acc = accuracy_score(y_val, val_pred)
    
    
    # Store for later analysis if needed
    results_list.append({
        "Model": name,
        "Train Acc": train_acc,
        "Val Acc": val_acc,
    })

# Optional: View as a sorted DataFrame
print("\n--- Dataframe containing the models sorted by validation accuracy ---")
df_results = pd.DataFrame(results_list).sort_values(by="Val Acc", ascending=False)
display(df_results)


--- Dataframe containing the models sorted by validation accuracy ---


Unnamed: 0,Model,Train Acc,Val Acc
2,Random Forest,1.0,0.792308
3,XGBoost,0.992031,0.775641
6,Gradient Boosting,0.80654,0.770513
4,LightGBM,0.928552,0.765385
7,KNClassifier,0.818631,0.744872
5,MLP Classifier,0.757351,0.739744
0,Logistic Regression,0.738939,0.725641
1,Decision Tree,1.0,0.70641


This dataframe created, contains the **accuracy** in training and validation of the baseline models using default parameters.

From this dataframe we can observe that some models perform better than other and some models are overfitting significantly!

**Models overfitting**: 
-  Random Forest
- LightGBM
- XGBoost
- Decision Tree Classifier

Note: This overfitting is **expected** for these models when we don't specify parameters to combat the overfitting!

Besides this, we decided to choose these models to optimize their complexity and performance using **GridSearchCV**:
- Gradient Boosting Classifier
- ( talvez adicionar aqui um modelo non-tree)
- Random Forest Classifier
- XGBoost Classifier
- LightGBM Classifier

## **4. Models' optimization**

**Firstly**, GridSearchCV is a ...

In [None]:
"""
PERCEBER PORQUE É QUE FAZEMOS ISTO - fazer (tiago)
"""
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


### **4.1 Optimizing models' performance**

#### **Hyperparameter tuning for Gradient Boosting Classifier.**

EXPLICAR A IMPORTANCIA DOS PARAMETERS USADOS E O QUE FAZEM

In [None]:
"""
We define a range for some parameters of the model to be used by RandomizedSearchCV, specifying that we want to optimize for accuracy score
"""


parameters_Gradient_Boosting_Classifier= {'learning_rate': [0.1, 0.05, 0.01], 
                                          'n_estimators': [100, 150], 
                                          'max_depth': [3, 6, 9], 
                                          'max_leaf_nodes': [3, 6, 9]}


Best_Gradient_Boosting_Classifier = GridSearchCV(estimator = GradientBoostingClassifier(), param_grid = parameters_Gradient_Boosting_Classifier, cv = cv_strategy, scoring = 'accuracy').fit(X_train_final, y_train)
print(f"Best combination of parameters: {Best_Gradient_Boosting_Classifier.best_params_}")
print(f"Best accuracy score: {Best_Gradient_Boosting_Classifier.best_score_:.3f}")  

Fitting 5 folds for each of 54 candidates, totalling 270 fits
Best combination of parameters: {'learning_rate': 0.1, 'max_depth': 3, 'max_leaf_nodes': 9, 'n_estimators': 100}
Best accuracy score: 0.760


**Analyzing the results:**

From the code above, we can observe that the best combination of parameteres is:
- n_estimators = 150
- max_leaf_nodes = 6
- max_depth = 6
- loss = 'exponential'
- learning_rate = 0.05

**Now we're calculating the validation score (X_val_final):**


In [19]:
Best_Gradient_Boosting_Classifier.score(X_val_final, y_val)

0.7705128205128206

#### **Hyperparameter tuning for Random Forest Classifier.**

EXPLICAR A IMPORTANCIA DOS PARAMETERS USADOS E O QUE FAZEM

In [None]:
parameters_Random_Forest_Classifier = {'n_estimators': [25, 50, 100, 150], 
                                       'max_features': ['sqrt', 'log2', None], 
                                       'max_depth': [3, 6, 9], 
                                       'max_leaf_nodes': [3, 6, 9]} #ver se adicionar mais parâmetros


Best_Random_Forest_Classifier = GridSearchCV(estimator = RandomForestClassifier(), param_grid = parameters_Random_Forest_Classifier, scoring = 'accuracy', cv = cv_strategy).fit(X_train_final, y_train)
print(f"Best combination of parameters: {Best_Random_Forest_Classifier.best_params_}")
print(f"Best cross-validation score (in train set): {Best_Random_Forest_Classifier.best_score_:.3f}")

Best combination of parameters: {'max_depth': 9, 'max_features': 'sqrt', 'max_leaf_nodes': 9, 'n_estimators': 100}
Best accuracy score: 0.743


**Analyzing the results:**

From the code above, we can observe that the best combination of parameteres is:
- n_estimators = 150
- max_leaf_nodes = 9
- max_features = 'None'
- max_depth = 6

**Now we're calculating the validation score (X_val_final):**


In [25]:
Best_Random_Forest_Classifier.score(X_val_final, y_val)

0.7487179487179487

#### **Hyperparameter tuning for XGBoost Classifier.**

EXPLICAR A IMPORTANCIA DOS PARAMETERS USADOS E O QUE FAZEM

In [26]:
parameters_XGB_Classifier = {'n_estimators': [25, 50, 100, 150], 
                             'max_depth': [3, 6, 9], 
                             'learning_rate': [0.1, 0.05, 0.01], 
                             'gamma': [0, 0.1, 0.2], 
                             'colsample_bytree': [0.3, 0.4, 0.5]}


Best_XGB_Classifier = GridSearchCV(estimator = XGBClassifier(), param_grid = parameters_XGB_Classifier, scoring = 'accuracy', cv = cv_strategy).fit(X_train_final, y_train)
print(f"Best combination of parameters: {Best_XGB_Classifier.best_params_}")
print(f"Best accuracy score: {Best_XGB_Classifier.best_score_:.3f}")

Best combination of parameters: {'colsample_bytree': 0.4, 'gamma': 0, 'learning_rate': 0.05, 'max_depth': 9, 'n_estimators': 150}
Best accuracy score: 0.766


**Analyzing the results:**

From the code above, we can observe that the best combination of parameteres is:
- n_estimators = 100
- max_depth = 9
- learning_rate = 0.1
- gamma = 0.3
- colsample_bytree = 0.3

**Now we're calculating the validation score (X_val_final):**


In [None]:
Best_XGB_Classifier.score(X_val_final, y_val)

#### **Hyperparameter tuning for LightGBM Classifier.**

EXPLICAR A IMPORTANCIA DOS PARAMETERS USADOS E O QUE FAZEM

In [None]:
parameters_LightGBM_Classifier = {
    'num_leaves': [10, 15, 20],
    'max_depth': [5, 10], 
    'learning_rate': [0.1, 0.03, 0.01],
    'n_estimators': [200, 300],
    'min_child_samples': [20, 30, 40],
}

Best_LightGBM_Classifier = GridSearchCV(estimator = LGBMClassifier(), param_grid = parameters_LightGBM_Classifier, scoring = 'accuracy', cv = cv_strategy).fit(X_train_final, y_train)
print(f"Best combination of parameters: {Best_LightGBM_Classifier.best_params_}")
print(f"Best accuracy score: {Best_LightGBM_Classifier.best_score_:.3f}")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best combination of parameters: {'reg_lambda': 0.1, 'num_leaves': 15, 'n_estimators': 300, 'min_child_samples': 30, 'max_depth': 10, 'learning_rate': 0.1}
Best accuracy score: 0.763


**Analyzing the results:**

From the code above, we can observe that the best combination of parameteres is:
- reg_lambda = 1
- num_leaves = 10
- n_estimators = 200
- min_child_samples = 20
- max_depth = 7
- learning_rate = 0.1

**Now we're calculating the validation score (X_val_final):**


In [None]:
Best_LightGBM_Classifier.score(X_val_final, y_val)

### **4.2. Tuned models' analysis and comparison**

UM TEXTINHO A DIZER QUE ANALISAMOS O SCORE e outras metrics

In [None]:
labels = ['Gradient Boosting', 'Random Forest', 'Decision Tree', 'XGBoost', 'LightGBM']
# Prediction results on training data
def predict_and_results_train(list_of_models):

    """
    Creating a datframe that will contain the evaluation metrics (F1 Score, Accuracy, Precision, Recall) for each model in the list_of_models for the training data
    """

    f1, accuracy, precision, recall = {}, {}, {}, {}

    for model, label in zip(list_of_models, labels): # Iterate through each model and its corresponding label... zip combines the two lists into pairs
        predictions_train = model.predict(X_train_final)

        """
        This part calculates the evaluation metrics for each model and stores them in each dictionary of the metrics
        """
        f1[label] = f1_score(y_train, predictions_train)
        accuracy[label] = accuracy_score(y_train, predictions_train)
        precision[label] = precision_score(y_train, predictions_train)
        recall[label] = recall_score(y_train, predictions_train)

        results = pd.DataFrame.from_dict({'F1 Score': f1, 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall})

    return results # this is a dataframe where the columns are the metrics and the rows are the models' scores for train

# Prediction results on validation data
def predict_and_results_val(list_of_models):

    """
    Creating a datframe that will contain the evaluation metrics (F1 Score, Accuracy, Precision, Recall) for each model in the list_of_models for the validation data
    """

    f1, accuracy, precision, recall = {}, {}, {}, {}

    for model, label in zip(list_of_models, labels):
        predictions_val = model.predict(X_val_final)

        f1[label] = f1_score(y_val, predictions_val)
        accuracy[label] = accuracy_score(y_val, predictions_val)
        precision[label] = precision_score(y_val, predictions_val)
        recall[label] = recall_score(y_val, predictions_val)

        results = pd.DataFrame.from_dict({'F1 Score': f1, 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall})

    return results

print('Prediction Results on Training Data')
display(predict_and_results_train([Best_Gradient_Boosting_Classifier, Best_Random_Forest_Classifier, Best_Decision_Tree_Classifier, Best_XGB_Classifier, Best_LightGBM_Classifier]))
print('Prediction Results on Validation Data')
display(predict_and_results_val([Best_Gradient_Boosting_Classifier, Best_Random_Forest_Classifier, Best_Decision_Tree_Classifier, Best_XGB_Classifier, Best_LightGBM_Classifier]))

Prediction Results on Training Data


Unnamed: 0,F1 Score,Accuracy,Precision,Recall
Gradient Boosting,0.852349,0.80654,0.827025,0.879273
Random Forest,0.815394,0.753504,0.777473,0.857205
Decision Tree,0.789394,0.733718,0.793013,0.785807
XGBoost,0.994825,0.993405,0.991405,0.998269
LightGBM,0.95736,0.945315,0.948217,0.966681


Prediction Results on Validation Data


Unnamed: 0,F1 Score,Accuracy,Precision,Recall
Gradient Boosting,0.829035,0.770513,0.787659,0.875
Random Forest,0.814745,0.748718,0.766904,0.868952
Decision Tree,0.789318,0.726923,0.774757,0.804435
XGBoost,0.837255,0.787179,0.814885,0.860887
LightGBM,0.828654,0.773077,0.79702,0.862903


TEXTINHO A ANALISAR AS DATAFRAMES

## **5. Final model selection**

Considering the results above, we can conclude that the best model after the hyperparameter tuning is the Gradient Boosting Classifier

## **6. Conclusion**

A PARTIR DAQUI É PARA O NOTEBOOK 9

O que está em baixo disto vai para o notebook 9

In [None]:
"""
Training the final model with the best combination of parameters
"""

final_model = GradientBoostingClassifier(n_estimators=100, max_leaf_nodes=9, max_depth=6, loss='log_loss', learning_rate=0.1)
final_model.fit(X_train_final, y_train)

0,1,2
,loss,'log_loss'
,learning_rate,0.1
,n_estimators,100
,subsample,1.0
,criterion,'friedman_mse'
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_depth,6
,min_impurity_decrease,0.0


In [None]:
# Apply the K Fold and the Repeated K Fold in the model

kf = KFold(n_splits = 10,random_state = 42, shuffle = True)
rkf = RepeatedKFold(n_splits = 5, n_repeats = 3,random_state = 42)

def eval_model_clf(X, y, model):
    y_pred = model.predict(X)
    return accuracy_score(y, y_pred)
def run_model(X, y, model):
    return model.fit(X,y)

def avg_score_clf(method, X, y, model):
    score_train = []
    score_test = []
    for train_index, test_index in method.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        train_model = run_model(X_train, y_train, model)
        value_train = eval_model_clf(X_train, y_train, train_model)
        value_test = eval_model_clf(X_test, y_test, train_model)

        score_train.append(value_train)
        score_test.append(value_test)

    mean_train_score = np.mean(score_train)
    mean_test_score = np.mean(score_test)

    result_df = pd.DataFrame({'Train accuracy': [mean_train_score], 'Test accuracy': [mean_test_score]})
    result_df.index = [f'{str(model)} with {str(method)}']

    return result_df

In [None]:
final_model_train_acc = eval_model_clf(X_train_final, y_train, final_model)
final_model_test_acc = eval_model_clf(X_val_final, y_val, final_model)
final_model_df = pd.DataFrame({'Train accuracy': [final_model_train_acc], 'Test accuracy': [final_model_test_acc]})
final_model_df.index = [f"{str(final_model)} with Simple Data Split"]

final_model_kf_df = avg_score_clf(kf, X_train_final, y_train,final_model)
final_model_rkf_df = avg_score_clf(rkf, X_train_final, y_train,final_model)

df_final_model = pd.concat([final_model_df, final_model_kf_df, final_model_rkf_df])

In [None]:
df_final_model

Unnamed: 0,Train accuracy,Test accuracy
"GradientBoostingClassifier(max_depth=6, max_leaf_nodes=9) with Simple Data Split",0.824952,0.769231
"GradientBoostingClassifier(max_depth=6, max_leaf_nodes=9) with KFold(n_splits=10, random_state=42, shuffle=True)",0.829898,0.752679
"GradientBoostingClassifier(max_depth=6, max_leaf_nodes=9) with RepeatedKFold(n_repeats=3, n_splits=5, random_state=42)",0.83773,0.760005


In [None]:
def predict_and_results_metrics(final_model):
    predictions_train = final_model.predict(X_train_final)
    predictions_validation = final_model.predict(X_val_final)

    print('___________________________________________________________________________________________________________')
    print('                                                     TRAIN                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_train, predictions_train))
    print('Confusion Matrix:\n', confusion_matrix(y_train, predictions_train))

    print('___________________________________________________________________________________________________________')
    print('                                                VALIDATION                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_val, predictions_validation))
    print('Confusion Matrix:\n', confusion_matrix(y_val, predictions_validation))

In [None]:
display(predict_and_results_metrics(final_model))

___________________________________________________________________________________________________________
                                                     TRAIN                                                 
-----------------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.80      0.70      0.74      1328
           1       0.84      0.90      0.87      2311

    accuracy                           0.82      3639
   macro avg       0.82      0.80      0.80      3639
weighted avg       0.82      0.82      0.82      3639

Confusion Matrix:
 [[ 926  402]
 [ 237 2074]]
___________________________________________________________________________________________________________
                                                VALIDATION                                                 
---------------------------------------------------------------------------------------

None

In [None]:
X_full = pd.concat([X_train_final, X_val_final, X_test_final], axis=0)
y_full = pd.concat([y_train, y_val, y_test], axis=0)


final_model.fit(X_full, y_full)
# 3. Predict on the Kaggle data
final_predictions = final_model.predict(X_predict_final)

# 4. Save
submission = pd.DataFrame({'id': id_predict, 'Quality_class': final_predictions})
submission['Quality_class'] = submission['Quality_class'].map({0: 'KO', 1: 'OK'})
submission.to_csv('submission.csv', index=False)

NameError: name 'pd' is not defined