# **Notebook 2**

## **Table of Contents**

* [1. Introduction](#1-introduction)

* [2. Importing Section](#2-importing-section)
    * [2.1 Importing Libraries](#21-importing-libraries)
    * [2.2 Importing the Datasets](#22-importing-the-learn-and-predict-dataset)

* [3. Data Cleaning and Handling Inconsistencies](#3-data-cleaning-and-handling-inconsistencies)
    * [3.1. Handling Impossible Values](#31-handling-impossible-values.)
    * [3.2. Categorical Consistency](#32-categorical-consistency)
    * [3.3. Feature Dropping](#33-feature-dropping.)
    * [3.4. Data Type Correction](#34-data-type-correction)
    * [3.5. Target Variable Encoding](#35-target-variable-encoding)
* [4. Feature Engineering](#4-feature-engineering)
* [5. Data Partioning](#5-data-partitioning)
    * [5.1. Separate Numerical and Categorical](#51-separate-numerical-and-categorical)
    * [5.2. Imputation of Missing Values](#52-imputation-of-missing-values)
    * [5.3. Encoding Categorical Data](#53-encoding-categorical-features)
    * [5.4. Feature Scaling](#54-feature-scaling)
* [6. Preparation for Next Steps](#6-prepare-for-feature-selection-and-modelling)
* [7. Conclusion](#7-conclusion)



# **Notebook 2: Data Preprocessing & Feature Engineering**

## **1. Introduction**

Following the Exploratory Data Analysis in Notebook 1, where we identified the distribution, relationships, and quality issues within the dataset, this notebook focuses on the **Data Preprocessing** phase. 

Raw data is rarely ready for Machine Learning algorithms, therefore, it requires **cleaning, transformation, and structuring** to ensure model performance.

### **Objectives & Our Workflow**

1.  **Data Cleaning & Handling Inconsistencies:**
    Before any manipulation, we start to handle physical inconsistencies (like impossible percentages), standardizing categorical text (unifying "LISBOA " vs. "lisboa"), and removing any irrelevant or constant features.

2.  **Feature Engineering:**
    We create new features such as `baking_intensity` and `sugar_fat_ratio` to try to enhance the predictive power overall

3.  **Strict Data Partitioning:**
    We split the data into **Train (70%)**, **Validation (15%)**, and **Test (15%)** sets using a stratified approach. This ensures that the class balance (OK vs. KO) is preserved across all partitions.

By the end of this notebook, both the labeled dataset (`learn.csv`) and the unlabelled prediction dataset (`predict.csv`) will be fully processed and consolidated, ready for the Feature Selection phase in the next notebook.

## **2. Importing Section**

### **2.1. Importing Libraries**

In [1]:
#Importing necessary libraries
import pandas as pd
import numpy as np
import pickle, os
import matplotlib.pyplot as plt
import seaborn as sns

# Data partitioning
from sklearn.model_selection import train_test_split, StratifiedKFold 

# Imputation
from sklearn.impute import SimpleImputer

# Encoding
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Scaling methods
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

#Machine Learning Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import StackingClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier

#Model Evaluation Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

#Feature Selection
from sklearn.feature_selection import RFE
from sklearn.linear_model import LassoCV
import scipy.stats as stats

#Suppress warnings
import warnings
warnings.filterwarnings('ignore')

### **2.2. Importing the Learn and Predict Dataset**

In [2]:
dataset_learn = pd.read_csv('Nata_Files/learn.csv', index_col = 0) # index_col = 0 makes the first column of the dataset the index
dataset_learn

Unnamed: 0_level_0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,notes_baker,origin,oven_temperature,pastry_type,preheating_time,salt_ratio,sugar_content,vanilla_extract,quality_class
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1,54.0,24.0,26.0,100.4,52.0,11.0,309.0,3.2,,Lisboa,,Pastel Nata,207.0,42.74,22.8,5.7,KO
2,66.0,37.0,34.0,98.0,46.0,10.0,317.0,3.3,,Lisboa,306.0,,245.0,41.73,11.6,4.0,KO
3,41.0,30.0,19.0,99.3,53.0,10.0,130.0,3.4,,Porto,121.0,,186.0,75.10,20.3,7.5,OK
4,62.0,24.0,48.0,98.0,115.0,9.0,354.0,3.3,,Lisboa,357.0,Pastel de Nata,186.0,46.41,73.3,4.2,OK
5,55.0,21.0,34.0,100.1,48.0,9.0,211.0,3.0,,Lisboa,202.0,Pastel de nata,218.0,56.52,80.1,6.0,KO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5196,60.0,18.0,35.0,96.0,72.0,11.0,215.0,3.3,,Lisboa,222.0,,177.0,34.42,58.9,5.7,OK
5197,61.0,25.0,40.0,96.4,99.0,9.0,367.0,3.2,,Lisboa,366.0,Pastel De Nata,224.0,46.18,141.4,6.5,KO
5198,69.0,18.0,36.0,97.7,90.0,11.0,206.0,3.6,,Lisboa,203.0,Pastel de nata,158.0,28.46,10.0,6.0,OK
5199,70.0,25.0,40.0,101.2,139.0,9.0,414.0,3.1,,Lisboa,391.0,,196.0,56.92,188.9,5.7,KO


&nbsp;&nbsp;&nbsp;&nbsp; The `predict.csv` dataset (unseen data)  needs to go through the same transformations applied previously to the training data. That way, we **guarantee consistency** between all datasets.
The Machine Learning model must receive all the datasets with the **same format**: the same columns on the same scales and with the same categorical encodings used during the training phase.


In [4]:
'''
Loading the prediction dataset
'''
predict_data = pd.read_csv('Nata_Files/predict.csv', index_col = 0)
display(predict_data.shape)
predict_data

(1300, 16)

Unnamed: 0_level_0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,notes_baker,origin,oven_temperature,pastry_type,preheating_time,salt_ratio,sugar_content,vanilla_extract
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
5201,79.0,22,40,98.6,79.0,9.0,259.0,3.2,,Lisboa,268.0,,208.0,49.63,182.6,4.0
5202,49.0,26,32,101.9,105.0,9.0,287.0,3.2,,Lisboa,287.0,Pastel de nata,189.0,182.54,76.2,4.8
5203,80.0,28,24,96.6,20.0,10.0,64.0,3.4,,Porto,74.0,Pastel Nata,201.0,100.41,23.5,6.1
5204,74.0,21,37,97.2,81.0,9.0,314.0,3.0,,Lisboa,317.0,,220.0,46.66,143.2,4.9
5205,41.0,19,41,97.3,104.0,10.0,246.0,3.2,,Lisboa,243.0,Pastel Nata,191.0,39.45,143.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6496,73.0,24,23,98.9,58.0,10.0,208.0,3.1,,Lisboa,200.0,Pastel De Nata,189.0,37.80,142.6,4.0
6497,66.0,21,34,100.6,62.0,10.0,277.0,3.1,,Lisboa,264.0,Pastel de nata,235.0,33.78,102.2,4.8
6498,41.0,49,23,98.9,20.0,11.0,57.0,3.4,,Porto,69.0,Pastel de nata,243.0,86.73,21.3,7.4
6499,42.0,26,54,98.0,44.0,12.0,135.0,3.4,,Porto,115.0,Pastel de Nata,333.0,135.35,18.0,7.6


## **3. Data Cleaning and Handling Inconsistencies**

### **3.1. Handling Impossible Values**

While analysing our dataset and our features, we found that there were some  **physically impossible values** that can represent data entry errors or sensor malfunctions.

Therefore, we defined the following domain constraints:
1.  **Percentages (Humidity & Fat):** Must be between 0% and 100%.
2.  **pH Scale:** Must be between 0 and 14.
3.  **Physical Dimensions (Time/Temp):** Must be non-negative.

**Treatment Strategy:**

Instead of dropping these rows, as we would lose valuable information, we **convert these 'errors' to `NaN`** (Missing Values). These will be handled later on.

**Note:** In cases where values exceeded the constrained limits, capping them at those thresholds would introduce a larger error compared to median imputation. Additionally, capping would create an artificial 'spike' in the data at those thresholds, distorting the feature's distribution.

In [None]:
'''
In the descriptive statistics of notebook 1, we verified that there were no negative number in the dataset. 
Therefore, we will not check for impossible values for that constraint.
'''
# 1st Constraint: Percentages (0 to 100)
impossible_humidityL = dataset_learn[(dataset_learn['ambient_humidity'] > 100) | (dataset_learn['ambient_humidity'] < 0)]
impossible_fatL = dataset_learn[(dataset_learn['cream_fat_content'] > 100) | (dataset_learn['cream_fat_content'] < 0)]

impossible_humidityP = predict_data[(predict_data['ambient_humidity'] > 100) | (predict_data['ambient_humidity'] < 0)]
impossible_fatP = predict_data[(predict_data['cream_fat_content'] > 100) | (predict_data['cream_fat_content'] < 0)]

# 2nd Constraint: pH Scale (0 to 14)
impossible_phL = dataset_learn[(dataset_learn['lemon_zest_ph'] > 14) | (dataset_learn['lemon_zest_ph'] < 0)]
impossible_phP = predict_data[(predict_data['lemon_zest_ph'] > 14) | (predict_data['lemon_zest_ph'] < 0)]

print(f"Impossible Humidity rows in dataset_learn: {len(impossible_humidityL)}")
print(f"Impossible Humidity rows in predict_data: {len(impossible_humidityP)}")

print(f"Impossible Fat rows in dataset_learn: {len(impossible_fatL)}")
print(f"Impossible Fat rows in predict_data: {len(impossible_fatP)}")

print(f"Impossible pH rows in dataset_learn: {len(impossible_phL)}")
print(f"Impossible pH rows in predict_data: {len(impossible_phP)}")

Impossible Humidity rows in dataset_learn: 0
Impossible Humidity rows in predict_data: 0
Impossible Fat rows: 1099
Impossible Fat rows in predict_data: 281
Impossible pH rows: 0
Impossible pH rows in predict_data: 0


From the results obtained we can observe that:
- **1099 rows in dataset_learn** and **281 rows in predict_data** have a `cream_fat_content` value **above 100%** which is impossible

All these values will be transformed into `NaN` as previously said, and will be later imputed in the imputation phase

In [None]:
'''
We will just replace the values in cream fat content as it is the only feature with impossible values found (in both datasets).
'''
dataset_learn.loc[impossible_fatL.index, 'cream_fat_content'] = np.nan
predict_data.loc[impossible_fatP.index, 'cream_fat_content'] = np.nan

### **3.2. Categorical Consistency**
We will conduct this beacuse it ensures that all text or nominal values within a **categorical feature column** are **uniform, consistent, and correctly represented.**

**As we saw in the first notebook**, the columns `origin`, which keeps track of where the bakery is located (Lisbon or Porto), has a lot of inconsistencies in the names of those cities. 'Lisboa' and 'Porto' are written in a lot of different ways, therefore, we decided to start Notebook 2 deleting those differences, replacing all the values with either 'Lisboa' and 'Porto' written exactly like that.

We also have that exact same problem with the column `pastry_type` in which we found many inconsistencies of the same pastry type written differently.

The first step is to make sure that all the values are really the same in those columns. To be able to see the different type of values (and how many times they appear) in a column we use `value_counts()`.

In [None]:
print(dataset_learn['origin'].value_counts())

origin
Lisboa     3486
Porto      1167
LISBOA      119
Lisboa       88
lisboa       83
PORTO        33
Porto        25
 Lisboa      20
porto        15
 Porto        3
Name: count, dtype: int64


In [None]:
dataset_learn['origin'] = dataset_learn['origin'].str.strip().str.lower().str.title()  #acho que podes pôr só .capitalize() é mais simples

predict_data['origin'] = predict_data['origin'].str.strip().str.lower().str.title()

print(dataset_learn['origin'].unique())


['Lisboa' 'Porto' nan]


### **3.3. Feature Dropping**
Drop columns that are internal identifiers or text notes, as they are not useful for the model.  

The value of the ID has no physical or che ical relationship to the quality of the Pastel de Nata.  
The column `notes_baker` has 5200 missing values, which means it does not give us any useful information, therefore, we decided to remove it. \
 Additionally, the column `pastry_type` is a constant. It does not add any predictive value to our project, so, after checking if there are any values other than 'Pastel de Nata' written in different ways, we will also remove it.


In [6]:
# n era suposto droparmos estas features no feature selection?


In [4]:
dataset_learn = dataset_learn.drop(columns=['notes_baker', 'pastry_type'])
predict_data = predict_data.drop(columns=['notes_baker', 'pastry_type'])
dataset_learn.info()


<class 'pandas.core.frame.DataFrame'>
Index: 5200 entries, 1 to 5200
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ambient_humidity   5182 non-null   float64
 1   baking_duration    5199 non-null   float64
 2   cooling_period     5199 non-null   float64
 3   cream_fat_content  5176 non-null   float64
 4   egg_temperature    5176 non-null   float64
 5   egg_yolk_count     5176 non-null   float64
 6   final_temperature  5175 non-null   float64
 7   lemon_zest_ph      5174 non-null   float64
 8   origin             5039 non-null   object 
 9   oven_temperature   5179 non-null   float64
 10  preheating_time    5181 non-null   float64
 11  salt_ratio         5187 non-null   float64
 12  sugar_content      5178 non-null   float64
 13  vanilla_extract    5182 non-null   float64
 14  quality_class      5199 non-null   object 
dtypes: float64(13), object(2)
memory usage: 650.0+ KB


Drop rows where the target variable is null, due to reasons, one in the code and one logic. The one in the code is the stratify=y cannot process them, and the logic one is that either the Pastel de Nata is Ok or not Ok, can't be NaN

In [5]:
print(dataset_learn.isna().sum())

ambient_humidity      18
baking_duration        1
cooling_period         1
cream_fat_content     24
egg_temperature       24
egg_yolk_count        24
final_temperature     25
lemon_zest_ph         26
origin               161
oven_temperature      21
preheating_time       19
salt_ratio            13
sugar_content         22
vanilla_extract       18
quality_class          1
dtype: int64


In [6]:
dataset_learn = dataset_learn.dropna(subset=['quality_class'])

### **3.4. Data Type Correction**

The 'egg_yolk_count' is a count, implying an integer, but often loaded as a float due to NaNs.

In [9]:
dataset_learn.loc[:, 'egg_yolk_count'] = dataset_learn['egg_yolk_count'].astype('Int64')
dataset_learn.head(1) #just to check if the changes were applied

Unnamed: 0_level_0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,origin,oven_temperature,preheating_time,salt_ratio,sugar_content,vanilla_extract,quality_class
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,54.0,24.0,26.0,100.4,52.0,11,309.0,3.2,Lisboa,,207.0,42.74,22.8,5.7,KO


### **3.5. Target Variable Encoding**

The target variable `quality_class` is categorical ('OK' or 'KO'). To better prepare the data for the binary classification models (which is the case), it is necessary to transform it into a binary variable, which means either '0' or '1'.  

We decided to attribute:
- **1** for "OK", the Pastel de Nata is in a good state.
- **0** for "KO", you should not eat the Pastel de Nata.

The "OK" class is positive and is the one that will be predicted.

In [10]:
dataset_learn['quality_class_binary'] = dataset_learn['quality_class'].replace({'OK': 1, 'KO': 0})
dataset_learn = dataset_learn.drop(columns=['quality_class'])
dataset_learn.head(1) #just to check if the changes were applied

Unnamed: 0_level_0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,origin,oven_temperature,preheating_time,salt_ratio,sugar_content,vanilla_extract,quality_class_binary
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,54.0,24.0,26.0,100.4,52.0,11,309.0,3.2,Lisboa,,207.0,42.74,22.8,5.7,0


## **4. Feature Engineering**

&nbsp;&nbsp;&nbsp;&nbsp;To better capture the interaction between ingredients and the baking process, we decided to **create two new features**.

* **`sugar_fat_ratio`**:
    This captures the relative proportion of two key ingredients: `sugar_content` and `cream_fat_content`. A small constant ($1e^{-6}$) was added to prevent errors due to divison by zero in cases where fat content might be zero or missing. 

    
* **`baking_intensity`**:
    By multiplying `baking_duration` by `oven_temperature`, we are able to quantify the overall heat exposure the product underwent during the baking process. This allows the model to dstinguish between a 'short but high' heat and a 'long but low' heat.



In [26]:
'''These transformations were applied consistently to both the training and prediction datasets, for the reasons already stated.'''

#Original Dataset
dataset_learn['sugar_fat_ratio'] = dataset_learn['sugar_content'] / (dataset_learn['cream_fat_content'] + 1e-6)  # the small constant to avoid division by zero
dataset_learn['baking_intensity'] = dataset_learn['baking_duration'] * dataset_learn['oven_temperature']

#Predict Dataset
predict_data['sugar_fat_ratio'] = predict_data['sugar_content'] / (predict_data['cream_fat_content'] + 1e-6)
predict_data['baking_intensity'] = predict_data['baking_duration'] * predict_data['oven_temperature']

## **5. Data Partitioning**

In this section, we basically separate the **features (X)** from the **target variable (Y)** and split the dataset into **3 different subsets**. We opted for a Train-Validation-Test split rather than a simple Train-Test split to avoid **Data Leakage** during model optimization.

Since `scikit-learn` does not support a 3-way split, we performed two different splits to achieve a **70% / 15% / 15%** distribution:
1. **First Split:** We divided the data into **Training (70%)** and a temporary "Rest" set (30%).
2.  **Second Split:** We split the "Rest" set equally (50/50) to create the **Validation (15%)** and **Test (15%)** sets.

**Parameters Used:**
* `stratify=y`: Ensures the proportion of 'OK' and 'KO' Pastéis de Nata is identical across all three sets.
* `random_state=42`: Guarantees reproducibility of the split.
* `shuffle = True`: It mixes the rows randomly before cutting it.

In [14]:
X = dataset_learn.drop('quality_class_binary',axis = 1) #features
y = dataset_learn['quality_class_binary'] #target

In [15]:
X_train, X_rest, y_train, y_rest = train_test_split(X, y, train_size = 0.7,shuffle = True, random_state=42, stratify=y) #70% train, 30% rest

X_val, X_test, y_val, y_test = train_test_split(X_rest, y_rest, test_size=0.5, shuffle = True, random_state=42, stratify=y_rest) #15% val, 15% test

In [16]:
print('train:{}% | validation:{}% | test:{}%'.format(round(len(y_train)/len(y),2),
                                                     round(len(y_val)/len(y),2),
                                                     round(len(y_test)/len(y),2)
                                                    ))

train:0.7% | validation:0.15% | test:0.15%


### **5.1. Separate Numerical and Categorical**

As different data types requires different and specific trasnformations, we separated them into **Numeric** and **Categorical**. We applied this separation consistently across all partitions (Train, Validation, Test) and also the unlabelled Prediction set (`predict_data`).

In [17]:
numerical_features = X.select_dtypes(include = np.number).columns.tolist()
categorical_features = X.select_dtypes(exclude = np.number).columns.tolist()

In [18]:
print("Numerical Features:", numerical_features)
print("Categorical Features:", categorical_features)

Numerical Features: ['ambient_humidity', 'baking_duration', 'cooling_period', 'cream_fat_content', 'egg_temperature', 'egg_yolk_count', 'final_temperature', 'lemon_zest_ph', 'oven_temperature', 'preheating_time', 'salt_ratio', 'sugar_content', 'vanilla_extract', 'sugar_fat_ratio', 'baking_intensity']
Categorical Features: ['origin']


In [19]:
#NUMERICAL FEATURES
X_train_num = X_train.select_dtypes(include = np.number)
X_val_num = X_val.select_dtypes(include = np.number)
X_test_num = X_test.select_dtypes(include = np.number)

X_predict_num = predict_data.select_dtypes(include = np.number)

In [20]:
#CATEGORICAL FEATURES
X_train_cat = X_train.select_dtypes(exclude = np.number)
X_val_cat = X_val.select_dtypes(exclude = np.number)
X_test_cat = X_test.select_dtypes(exclude = np.number)

X_predict_cat = predict_data.select_dtypes(exclude = np.number)

### **5.2. Imputation of Missing Values**

To address missing values (including the "impossible" values converted to `NaN` during the cleaning phase), we implemented a **Median Imputation** strategy for the numerical and a **Mode Imputation** to the categorical. We chose the **median over the mean** because it provides a more representative central value as it is **robust to outliers**.

In [21]:
'''
The imputer was fitted only on the training set to prevent Data Leakage
This strictly ensures that no information from the evaluation sets leaks into the training process.
'''

imputer = SimpleImputer(strategy='median')
imputer.fit(X_train_num)

imputer2 = SimpleImputer(strategy='most_frequent')
imputer2.fit(X_train_cat)

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False


In [22]:
#NUMERICAL FEATURES
#  Transform all splits using the median learned from the training data
X_train_num.loc[:, :] = imputer.transform(X_train_num) 
X_val_num.loc[:, :] = imputer.transform(X_val_num)
X_test_num.loc[:, :] = imputer.transform(X_test_num)

X_predict_num.loc[:, :] = imputer.transform(X_predict_num)

#CATEGORICAL FEATURES
#  Transform all splits using the most frequent value learned from the training data
X_train_cat.loc[:, :] = imputer2.transform(X_train_cat)
X_val_cat.loc[:, :] = imputer2.transform(X_val_cat)
X_test_cat.loc[:, :] = imputer2.transform(X_test_cat)   

X_predict_cat.loc[:, :] = imputer2.transform(X_predict_cat)

### **5.3. Encoding Categorical Features**

We utilized the `OneHotEncoder` to transform **categorical data into a machine readable format**. We created binary columns for each category (e.g. `origin_Lisboa`, `origin_Porto`)

**Parameters Used:**
- `handle_unknown ='ignore'`:  If the Test or Prediction set contains a category label **never seen** during training, the encoder will output **zero** instead of crashing.
- `sparse_output= False`: it forces the encoder to return a format that can easily be transformed in a DataFrame.

In [23]:
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False) 
#if an unseen category appears in the test or prediction data, the encoder will ignore it.

encoder.fit(X_train_cat) #Once again, only fitting on training data to avoid Data Leakage

#Training Data
X_train_cat_encoded = encoder.transform(X_train_cat)
X_train_cat_encoded = pd.DataFrame(X_train_cat_encoded, columns = encoder.get_feature_names_out(categorical_features), index = X_train_cat.index)

#Validation Data
X_val_cat_encoded = encoder.transform(X_val_cat)
X_val_cat_encoded = pd.DataFrame(X_val_cat_encoded, columns = encoder.get_feature_names_out(categorical_features), index = X_val_cat.index)

#Testing Data
X_test_cat_encoded = encoder.transform(X_test_cat)  
X_test_cat_encoded = pd.DataFrame(X_test_cat_encoded, columns = encoder.get_feature_names_out(categorical_features), index = X_test_cat.index)

#Prediction Data
X_predict_cat_encoded = encoder.transform(X_predict_cat)
X_predict_cat_encoded = pd.DataFrame(X_predict_cat_encoded, columns = encoder.get_feature_names_out(categorical_features), index = X_predict_cat.index)

### **5.4. Feature Scaling**

Since our dataset has outliers, we chose **RobustScaler** which ensures that the majority of our data is scaled to a standard range without being distorted by the few extreme values

In [27]:
"""
The RobustScaler scales the data using the Median and the Interquartile Range (IQR). 
It subtracts the median and divides by the IQR (75th percentile - 25th percentile). 
This centers the data and scales it based on the bulk of the data points, effectively ignoring the influence of extreme outliers.
"""

scaler = RobustScaler()


#Fitting the scaler ONLY on the training data
scaler.fit(X_train_num)

#Transforming training data
X_train_scaled = scaler.transform(X_train_num)
X_train_scaled = pd.DataFrame(X_train_scaled, columns = numerical_features, index = X_train_num.index)

#Transforming validation data
X_val_scaled = scaler.transform(X_val_num)
X_val_scaled = pd.DataFrame(X_val_scaled, columns = numerical_features, index = X_val_num.index)

#Transforming testing data
X_test_scaled = scaler.transform(X_test_num)
X_test_scaled = pd.DataFrame(X_test_scaled, columns = numerical_features, index = X_test_num.index)

#Transforming prediction data
X_predict_scaled = scaler.transform(X_predict_num[numerical_features])
X_predict_scaled = pd.DataFrame(X_predict_scaled, columns = numerical_features, index = X_predict_num.index)

## **6. Prepare for Feature Selection and Modelling and Saving Data**

To prepare for **Feature Selection and Modeling**, we merged the processed numerical and categorical features into a unified dataset. 

In the Feature Selection part we must have a view of how **ALL** the variables are related to each other and how important to the classification they are. Models like Lasso and RFE do not choose the 'best numerical' or 'best categorical' in isolation but instead in general.

In [25]:
"""
Here we concatenate the scaled numerical features with the encoded categorical features to get the final sets
"""

#TRAIN
X_train_final = pd.concat([X_train_scaled, X_train_cat_encoded], axis=1)

#VALIDATION
X_val_final = pd.concat([X_val_scaled, X_val_cat_encoded], axis=1)

#TEST
X_test_final = pd.concat([X_test_scaled, X_test_cat_encoded], axis=1)

#PREDICTION
X_predict_final = pd.concat([X_predict_scaled, X_predict_cat_encoded], axis=1)

explicar aqui o pickle e ver se isso ta bem

In [5]:
# 6.2 Saving the Processed Datasets
import pickle

# Dictionary containing all the arrays/dataframes we need for the next steps
processed_data = {
    'X_train_final': X_train_final,
    'y_train': y_train,
    'X_val_final': X_val_final,
    'y_val': y_val,
    'X_test_final': X_test_final,
    'y_test': y_test,
    'X_predict_final': X_predict_final,
    'id_predict': predict_data.index # Keeping IDs for the final submission
}

# Save to a pickle file (preserves formatting better than CSV)
with open('Nata_Files/processed_data.pkl', 'wb') as f:
    pickle.dump(processed_data, f)

print("Data successfully saved to 'Nata_Files/processed_data.pkl'")

NameError: name 'X_train_final' is not defined

## **7. Conclusion**

escrever conclusao