# **Notebook 9**

## **Table of Contents**

* [1. Introduction](#1-introduction)

* [2. Importing Section](#2-importing-section)
    * [2.1 Importing Libraries](#21-importing-libraries)
    * [2.2 Importing Datasets](#22-importing-datasets)

* [3. Preprocessing and Feature Management](#3-establish-baseline-performance-and-analysis)

    
* [4. Final model training and Kaggle prediction](#4-final-model-training-and-kaggle-prediction)



# **Notebook 9: Complete final model and Kaggle predictions**

## **1. Introduction**

Following the data preprocessing, feature selection and model selection and evaluation done in the **previous notebook**, where we **cleaned the data, dropped irrelevant and redudant features that didn't add predictive value and lastly tested a bunch of models**, getting a final model. This notebook focuses on the **Final model predictions** phase.

### **Objectives & Our Workflow**

1.  **Import the dataset and do all the transformations applied in the previous notebook**:
    In this final notebook, all the transformations applied to the dataset in the previous notebook are **applied to the original dataset** `"learn.csv"` **imported in this notebook!**

2.  **Full dataset training and evaluation**
    We train our final model on the full dataset `"learn.csv"` so it **learns as much as possible before making Kaggle predictions.** We perform a **Repeated Stratified K-Fold Cross-Validation** on the complete dataset. This allows us to verify the final model stability, trained on all the dataset, while ensuring that the **amount of OK vs KO is the same in every split.**
    

3.  **Kaggle predictions**
    Finally, we use the the final model trained on the full dataset to genreate the predictions (classification labels, "OK" or "KO") for the unseen data in `"predict.csv"` and lastly we export the results to a csv called `"submission.csv"`



## **2. Importing Section**

###  **2.1. Importing Libraries**


In [1]:
"""
Importing the necessary libraries
"""


import pandas as pd
import numpy as np
import pickle, os
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
from sklearn.exceptions import ConvergenceWarning

# Ignore ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RepeatedStratifiedKFold, StratifiedKFold, cross_val_score
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder

###  **2.2. Importing Datasets**


In [2]:
dataset_original = pd.read_csv('Nata_Files/learn.csv', index_col = 0)
dataset_original

Unnamed: 0_level_0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,notes_baker,origin,oven_temperature,pastry_type,preheating_time,salt_ratio,sugar_content,vanilla_extract,quality_class
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1,54.0,24.0,26.0,100.4,52.0,11.0,309.0,3.2,,Lisboa,,Pastel Nata,207.0,42.74,22.8,5.7,KO
2,66.0,37.0,34.0,98.0,46.0,10.0,317.0,3.3,,Lisboa,306.0,,245.0,41.73,11.6,4.0,KO
3,41.0,30.0,19.0,99.3,53.0,10.0,130.0,3.4,,Porto,121.0,,186.0,75.10,20.3,7.5,OK
4,62.0,24.0,48.0,98.0,115.0,9.0,354.0,3.3,,Lisboa,357.0,Pastel de Nata,186.0,46.41,73.3,4.2,OK
5,55.0,21.0,34.0,100.1,48.0,9.0,211.0,3.0,,Lisboa,202.0,Pastel de nata,218.0,56.52,80.1,6.0,KO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5196,60.0,18.0,35.0,96.0,72.0,11.0,215.0,3.3,,Lisboa,222.0,,177.0,34.42,58.9,5.7,OK
5197,61.0,25.0,40.0,96.4,99.0,9.0,367.0,3.2,,Lisboa,366.0,Pastel De Nata,224.0,46.18,141.4,6.5,KO
5198,69.0,18.0,36.0,97.7,90.0,11.0,206.0,3.6,,Lisboa,203.0,Pastel de nata,158.0,28.46,10.0,6.0,OK
5199,70.0,25.0,40.0,101.2,139.0,9.0,414.0,3.1,,Lisboa,391.0,,196.0,56.92,188.9,5.7,KO


In [3]:
'''Loading the prediction dataset'''
predict_data = pd.read_csv('Nata_Files/predict.csv', index_col = 0)
display(predict_data.shape)
display(predict_data)

id_predict = predict_data.index

(1300, 16)

Unnamed: 0_level_0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,notes_baker,origin,oven_temperature,pastry_type,preheating_time,salt_ratio,sugar_content,vanilla_extract
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
5201,79.0,22,40,98.6,79.0,9.0,259.0,3.2,,Lisboa,268.0,,208.0,49.63,182.6,4.0
5202,49.0,26,32,101.9,105.0,9.0,287.0,3.2,,Lisboa,287.0,Pastel de nata,189.0,182.54,76.2,4.8
5203,80.0,28,24,96.6,20.0,10.0,64.0,3.4,,Porto,74.0,Pastel Nata,201.0,100.41,23.5,6.1
5204,74.0,21,37,97.2,81.0,9.0,314.0,3.0,,Lisboa,317.0,,220.0,46.66,143.2,4.9
5205,41.0,19,41,97.3,104.0,10.0,246.0,3.2,,Lisboa,243.0,Pastel Nata,191.0,39.45,143.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6496,73.0,24,23,98.9,58.0,10.0,208.0,3.1,,Lisboa,200.0,Pastel De Nata,189.0,37.80,142.6,4.0
6497,66.0,21,34,100.6,62.0,10.0,277.0,3.1,,Lisboa,264.0,Pastel de nata,235.0,33.78,102.2,4.8
6498,41.0,49,23,98.9,20.0,11.0,57.0,3.4,,Porto,69.0,Pastel de nata,243.0,86.73,21.3,7.4
6499,42.0,26,54,98.0,44.0,12.0,135.0,3.4,,Porto,115.0,Pastel de Nata,333.0,135.35,18.0,7.6


# **3. Preprocessing and Feature Management**

**Handling inconsistencies as we saw in the previous notebook.**

In [4]:
'''
In the descriptive statistics of notebook 1, we verified that there were no negative number in the dataset. 
Therefore, we will not check for impossible values for that constraint.
'''
# 1st Constraint: Percentages (0 to 100)
impossible_humidityL = dataset_original[(dataset_original['ambient_humidity'] > 100) | (dataset_original['ambient_humidity'] < 0)]
impossible_fatL = dataset_original[(dataset_original['cream_fat_content'] > 100) | (dataset_original['cream_fat_content'] < 0)]

impossible_humidityP = predict_data[(predict_data['ambient_humidity'] > 100) | (predict_data['ambient_humidity'] < 0)]
impossible_fatP = predict_data[(predict_data['cream_fat_content'] > 100) | (predict_data['cream_fat_content'] < 0)]

# 2nd Constraint: pH Scale (0 to 14)
impossible_phL = dataset_original[(dataset_original['lemon_zest_ph'] > 14) | (dataset_original['lemon_zest_ph'] < 0)]
impossible_phP = predict_data[(predict_data['lemon_zest_ph'] > 14) | (predict_data['lemon_zest_ph'] < 0)]

print(f"Impossible Humidity rows in dataset_learn: {len(impossible_humidityL)}")
print(f"Impossible Humidity rows in predict_data: {len(impossible_humidityP)}")

print(f"Impossible Fat rows in dataset_learn: {len(impossible_fatL)}")
print(f"Impossible Fat rows in predict_data: {len(impossible_fatP)}")

print(f"Impossible pH rows in dataset_learn: {len(impossible_phL)}")
print(f"Impossible pH rows in predict_data: {len(impossible_phP)}")

Impossible Humidity rows in dataset_learn: 0
Impossible Humidity rows in predict_data: 0
Impossible Fat rows in dataset_learn: 1099
Impossible Fat rows in predict_data: 281
Impossible pH rows in dataset_learn: 0
Impossible pH rows in predict_data: 0


In [5]:
'''
We will just replace the values in cream fat content as it is the only feature with impossible values found (in both datasets).
'''
dataset_original.loc[impossible_fatL.index, 'cream_fat_content'] = np.nan
predict_data.loc[impossible_fatP.index, 'cream_fat_content'] = np.nan

In [6]:
print(dataset_original['origin'].value_counts())

origin
Lisboa     3486
Porto      1167
LISBOA      119
Lisboa       88
lisboa       83
PORTO        33
Porto        25
 Lisboa      20
porto        15
 Porto        3
Name: count, dtype: int64


In [7]:
dataset_original['origin'] = dataset_original['origin'].str.strip().str.lower().str.title() 

predict_data['origin'] = predict_data['origin'].str.strip().str.lower().str.title()

print(dataset_original['origin'].unique())


['Lisboa' 'Porto' nan]


**From the previous notebook, we know that columns `notes_baker` and `pastry_type` are not useful for the model.**

In [8]:
dataset_original = dataset_original.drop(columns=['notes_baker', 'pastry_type'])
predict_data = predict_data.drop(columns=['notes_baker', 'pastry_type'])
dataset_original.info()


<class 'pandas.core.frame.DataFrame'>
Index: 5200 entries, 1 to 5200
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ambient_humidity   5182 non-null   float64
 1   baking_duration    5199 non-null   float64
 2   cooling_period     5199 non-null   float64
 3   cream_fat_content  4077 non-null   float64
 4   egg_temperature    5176 non-null   float64
 5   egg_yolk_count     5176 non-null   float64
 6   final_temperature  5175 non-null   float64
 7   lemon_zest_ph      5174 non-null   float64
 8   origin             5039 non-null   object 
 9   oven_temperature   5179 non-null   float64
 10  preheating_time    5181 non-null   float64
 11  salt_ratio         5187 non-null   float64
 12  sugar_content      5178 non-null   float64
 13  vanilla_extract    5182 non-null   float64
 14  quality_class      5199 non-null   object 
dtypes: float64(13), object(2)
memory usage: 779.0+ KB


In [9]:
print(dataset_original.isna().sum())

ambient_humidity       18
baking_duration         1
cooling_period          1
cream_fat_content    1123
egg_temperature        24
egg_yolk_count         24
final_temperature      25
lemon_zest_ph          26
origin                161
oven_temperature       21
preheating_time        19
salt_ratio             13
sugar_content          22
vanilla_extract        18
quality_class           1
dtype: int64


In [10]:
dataset_original = dataset_original.dropna(subset=['quality_class'])

In [11]:
"""
just changing the data type of a column
"""

dataset_original.loc[:, 'egg_yolk_count'] = dataset_original['egg_yolk_count'].astype('Int64')
dataset_original.head(1) #just to check if the changes were applied

[11, 10, 10,  9,  9,  9,  9, 13,  9, 11,
 ...
  9, 11, 11, 13, 12, 11,  9, 11,  9, 10]
Length: 5199, dtype: Int64' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  dataset_original.loc[:, 'egg_yolk_count'] = dataset_original['egg_yolk_count'].astype('Int64')


Unnamed: 0_level_0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,origin,oven_temperature,preheating_time,salt_ratio,sugar_content,vanilla_extract,quality_class
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,54.0,24.0,26.0,,52.0,11,309.0,3.2,Lisboa,,207.0,42.74,22.8,5.7,KO


The target variable `quality_class` is categorical ('OK' or 'KO'). To better prepare the data for the binary classification models (which is the case), it is necessary to transform it into a binary variable, which means either '0' or '1'.  

We decided to attribute:
- **1** for "OK", the Pastel de Nata is in a good state.
- **0** for "KO", you should not eat the Pastel de Nata.

The "OK" class is positive and is the one that will be predicted.

In [12]:
dataset_original['quality_class'].replace({'OK': 1, 'KO': 0}, inplace= True)
dataset_original.head() #just to check if the changes were applied

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset_original['quality_class'].replace({'OK': 1, 'KO': 0}, inplace= True)
  dataset_original['quality_class'].replace({'OK': 1, 'KO': 0}, inplace= True)


Unnamed: 0_level_0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,origin,oven_temperature,preheating_time,salt_ratio,sugar_content,vanilla_extract,quality_class
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,54.0,24.0,26.0,,52.0,11,309.0,3.2,Lisboa,,207.0,42.74,22.8,5.7,0
2,66.0,37.0,34.0,98.0,46.0,10,317.0,3.3,Lisboa,306.0,245.0,41.73,11.6,4.0,0
3,41.0,30.0,19.0,99.3,53.0,10,130.0,3.4,Porto,121.0,186.0,75.1,20.3,7.5,1
4,62.0,24.0,48.0,98.0,115.0,9,354.0,3.3,Lisboa,357.0,186.0,46.41,73.3,4.2,1
5,55.0,21.0,34.0,,48.0,9,211.0,3.0,Lisboa,202.0,218.0,56.52,80.1,6.0,0


&nbsp;&nbsp;&nbsp;&nbsp;To better capture the interaction between ingredients and the baking process, we decided to **create two new features**.

* **`sugar_fat_ratio`**:
    This captures the relative proportion of two key ingredients: `sugar_content` and `cream_fat_content`. A small constant ($1e^{-6}$) was added to prevent errors due to divison by zero in cases where fat content might be zero or missing. 

    
* **`baking_intensity`**:
    By multiplying `baking_duration` by `oven_temperature`, we are able to quantify the overall heat exposure the product underwent during the baking process. This allows the model to dstinguish between a 'short but high' heat and a 'long but low' heat.



In [13]:
#Original Dataset
dataset_original['sugar_fat_ratio'] = dataset_original['sugar_content'] / (dataset_original['cream_fat_content'] + 1e-6)  # the small constant to avoid division by zero
dataset_original['baking_intensity'] = dataset_original['baking_duration'] * dataset_original['oven_temperature']

#Predict Dataset
predict_data['sugar_fat_ratio'] = predict_data['sugar_content'] / (predict_data['cream_fat_content'] + 1e-6)
predict_data['baking_intensity'] = predict_data['baking_duration'] * predict_data['oven_temperature']

### In this last notebook, we will **train the final model on the Full data** as we want to make predictions in another dataset, so there is no training, validation and test sets

In [14]:
X_origin = dataset_original.drop('quality_class', axis = 1)
y_origin = dataset_original['quality_class']

#### **Separate Numerical and Categorical**

In [15]:
numerical_features = X_origin.select_dtypes(include = np.number).columns.tolist()
categorical_features = X_origin.select_dtypes(exclude = np.number).columns.tolist()



num_features_origin = X_origin.select_dtypes(include = np.number).columns
cat_features_origin = X_origin.select_dtypes(exclude = np.number).columns

num_features_predict = predict_data.select_dtypes(include = np.number).columns
cat_features_predict = predict_data.select_dtypes(exclude = np.number).columns

dataset_original.dtypes

ambient_humidity     float64
baking_duration      float64
cooling_period       float64
cream_fat_content    float64
egg_temperature      float64
egg_yolk_count         Int64
final_temperature    float64
lemon_zest_ph        float64
origin                object
oven_temperature     float64
preheating_time      float64
salt_ratio           float64
sugar_content        float64
vanilla_extract      float64
quality_class          int64
sugar_fat_ratio      float64
baking_intensity     float64
dtype: object

In [16]:
print("Numerical Features:", num_features_origin)
print("numerical_features:", num_features_predict)

X_origin_num = X_origin[num_features_origin]
X_origin_cat = X_origin[cat_features_origin]

X_predict_num = predict_data[num_features_predict]
X_predict_cat = predict_data[cat_features_predict]

Numerical Features: Index(['ambient_humidity', 'baking_duration', 'cooling_period',
       'cream_fat_content', 'egg_temperature', 'egg_yolk_count',
       'final_temperature', 'lemon_zest_ph', 'oven_temperature',
       'preheating_time', 'salt_ratio', 'sugar_content', 'vanilla_extract',
       'sugar_fat_ratio', 'baking_intensity'],
      dtype='object')
numerical_features: Index(['ambient_humidity', 'baking_duration', 'cooling_period',
       'cream_fat_content', 'egg_temperature', 'egg_yolk_count',
       'final_temperature', 'lemon_zest_ph', 'oven_temperature',
       'preheating_time', 'salt_ratio', 'sugar_content', 'vanilla_extract',
       'sugar_fat_ratio', 'baking_intensity'],
      dtype='object')


### **Imputation of Missing Values**

In [17]:
imputer = SimpleImputer(strategy='median')
imputer.fit(X_origin_num)

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False


In [18]:
X_origin_num = imputer.transform(X_origin_num)
X_predict_num = imputer.transform(X_predict_num)

### **Encoding Categorical Features**

We utilized the `OneHotEncoder` to transform **categorical data into a machine readable format**. We created binary columns for each category (e.g. `origin_Lisboa`, `origin_Porto`)


In [19]:
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False) 

encoder.fit(X_origin_cat) #Now fitting to the full data

#Full Data
X_origin_cat_encoded = encoder.transform(X_origin_cat)
X_origin_cat_encoded = pd.DataFrame(X_origin_cat_encoded, columns = encoder.get_feature_names_out(categorical_features), index = X_origin_cat.index)


#Prediction Data
X_predict_cat_encoded = encoder.transform(X_predict_cat)
X_predict_cat_encoded = pd.DataFrame(X_predict_cat_encoded, columns = encoder.get_feature_names_out(categorical_features), index = X_predict_cat.index)

### **Feature Scaling**

Since our dataset has outliers, we chose **RobustScaler** which ensures that the majority of our data is scaled to a standard range without being distorted by the few extreme values

In [20]:
scaler = RobustScaler()
#Fitting the scaler on the full data
scaler.fit(X_origin_num)

#Transforming on full data
X_origin_num_scaled = scaler.transform(X_origin_num)
X_origin_num_scaled = pd.DataFrame(X_origin_num_scaled, columns = numerical_features, index = X_origin.index)

#Transforming prediction data
X_predict_num_scaled = scaler.transform(X_predict_num)
X_predict_num_scaled = pd.DataFrame(X_predict_num_scaled, columns = numerical_features, index = predict_data.index)



To prepare for **Feature Selection and Modeling**, we merged the processed numerical and categorical features into a unified dataset. 

In the Feature Selection part we must have a view of how **ALL** the variables are related to each other and how important to the classification they are. Models like Lasso and RFE do not choose the 'best numerical' or 'best categorical' in isolation but instead in general.

In [21]:
features_to_drop = ['ambient_humidity', 'sugar_fat_ratio', 'lemon_zest_ph']

X_origin_final = pd.concat([X_origin_num_scaled, X_origin_cat_encoded], axis=1)
X_predict_final = pd.concat([X_predict_num_scaled, X_predict_cat_encoded], axis = 1)

X_origin_final.drop(columns = features_to_drop, inplace= True)
X_predict_final.drop(columns = features_to_drop, inplace = True)

### **Feature Selection**


**In the previous notebook, from analyzing the correlation results we know which features to drop.**

In [22]:
#we drop immediately because we are SURE these features are not useful
X_origin_final = X_origin_final.drop(columns=['final_temperature', 'sugar_content', 'origin_Porto'], errors='ignore')
X_predict_final = X_predict_final.drop(columns=['final_temperature', 'sugar_content', 'origin_Porto'], errors='ignore')


|Feature Data|RFE|Lasso|Decision Tree|Relevance (Decision)|
|---|---|---|---|---|
|Egg Yolk Count|Keep|Keep|Keep|**Keep**|
|Egg Temperature|Keep|Keep|Keep|**Keep**|
|Baking Intensity|Keep|Keep|Discard|**Keep**|
|Baking Duration|Discard|Keep|Keep|**Keep**|
|Sugar Fat Ratio|Discard|Keep|Keep|**Keep**|
|Salt Ratio|Discard|Keep|Keep|**Keep**|
|Vanilla Extract|Discard|Keep|Keep|**Keep**|
|Lemon PH|Discard|Keep|Keep|**Keep**|
|Cooling Period|Discard|Keep|Keep|**Keep**|
|Oven Temperature|Keep|Discard|Discard|**Discard**|
|Preheating Time|Discard|Discard|Keep|**Discard**|
|Ambient Humidity|Discard|Discard|Keep|**Discard**|
|Cream Fat Content|Discard|Discard|Discard|**Discard**|
|Origin (Lisboa)|Discard|Discard|Discard|**Discard**|


In [23]:
features_to_drop = ['ambient_humidity', 'preheating_time', 'oven_temperature', 'cream_fat_content', 'origin_lisboa']
X_origin_final = X_origin_final.drop(columns=features_to_drop, errors='ignore')
X_predict_final = X_predict_final.drop(columns=features_to_drop, errors='ignore')

**From the analysis in the previous notebook, we know that there is no more features that are worth dropping.**

**After this, we have now applied all the necessary preprocessing to have the dataset fully ready for the final model.**

# **4. Final model training and Kaggle predictions**

In [24]:
"""
Instatiating the final model with the best combination of parameters
"""

final_model = GradientBoostingClassifier(n_estimators=100, max_leaf_nodes=9, max_depth=6, loss='log_loss', learning_rate=0.1)

To confirm the **good generalization** of the final model we evaluate its performance using **Repeated Stratified K-Fold Cross-Validation on the full dataset.**

This technique ensures that the accuracy score is reliable and that the model is stable (low variance) before generating the final predictions.

In [25]:
rskf = RepeatedStratifiedKFold(n_splits = 5, n_repeats = 3, random_state = 42)

rskf_cv_results = cross_val_score(final_model, X_origin_final, y_origin, cv=rskf, scoring='accuracy')

print(f"Final Model Performance (on full data): {rskf_cv_results.mean():.4f} ± {rskf_cv_results.std():.4f}")

Final Model Performance (on full data): 0.7643 ± 0.0102


**For Kaggle prediction**

In [26]:
final_model.fit(X_origin_final, y_origin)
# 3. Predict on the Kaggle data
final_predictions = final_model.predict(X_predict_final)

# 4. Save
submission = pd.DataFrame({'id': id_predict, 'Quality_class': final_predictions})
submission['Quality_class'] = submission['Quality_class'].map({0: 'KO', 1: 'OK'})
submission.to_csv('submission.csv', index=False)