# **Notebook 2**
## **Preprocessing**
### Introduction
The primary goal of this stage is to refine the raw data, transforming it from its initial state into a structurally sound format suitable for feature engineering and modeling.  
The objective here s strictly on structural, row-by-row cleaning. This includes:
- Handling Irrelevant Features
- Target Encoding
- Data Type Standardization
- Categorical Consistency

To ensure no data leakage occurs into the validation or test sets, this notebook avoids any transformation that relies on calculating statistics from the entire dataset.


In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [2]:
dataset_learn = pd.read_csv('Nata_Files/learn.csv', index_col = 0)
dataset_learn

Unnamed: 0_level_0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,notes_baker,origin,oven_temperature,pastry_type,preheating_time,salt_ratio,sugar_content,vanilla_extract,quality_class
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1,54.0,24.0,26.0,100.4,52.0,11.0,309.0,3.2,,Lisboa,,Pastel Nata,207.0,42.74,22.8,5.7,KO
2,66.0,37.0,34.0,98.0,46.0,10.0,317.0,3.3,,Lisboa,306.0,,245.0,41.73,11.6,4.0,KO
3,41.0,30.0,19.0,99.3,53.0,10.0,130.0,3.4,,Porto,121.0,,186.0,75.10,20.3,7.5,OK
4,62.0,24.0,48.0,98.0,115.0,9.0,354.0,3.3,,Lisboa,357.0,Pastel de Nata,186.0,46.41,73.3,4.2,OK
5,55.0,21.0,34.0,100.1,48.0,9.0,211.0,3.0,,Lisboa,202.0,Pastel de nata,218.0,56.52,80.1,6.0,KO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5196,60.0,18.0,35.0,96.0,72.0,11.0,215.0,3.3,,Lisboa,222.0,,177.0,34.42,58.9,5.7,OK
5197,61.0,25.0,40.0,96.4,99.0,9.0,367.0,3.2,,Lisboa,366.0,Pastel De Nata,224.0,46.18,141.4,6.5,KO
5198,69.0,18.0,36.0,97.7,90.0,11.0,206.0,3.6,,Lisboa,203.0,Pastel de nata,158.0,28.46,10.0,6.0,OK
5199,70.0,25.0,40.0,101.2,139.0,9.0,414.0,3.1,,Lisboa,391.0,,196.0,56.92,188.9,5.7,KO


#### **2.1 Feature Dropping**
Drop columns that are internal identifiers or text notes, as they are not useful for the model.  

The value of the ID has no physical or che ical relationship to the quality of the Pastel de Nata.  
The column `notes_baker` has 5200 missing values, which means it does not give us any useful information, therefore, we decided to remove it. \
 Additionally, the column `pastry_type` is a constant. It does not add any predictive value to our project, so, after checking if there are any values other than 'Pastel de Nata' written in different ways, we will also remove it.


In [3]:
print(dataset_learn['pastry_type'].value_counts(dropna=False))
#just so we can make sure that all the values in pastry type represent the exact same

pastry_type
NaN               1789
Pastel Nata        879
Pastel de Nata     859
Pastel de nata     840
Pastel De Nata     833
Name: count, dtype: int64


In [4]:
dataset_learn = dataset_learn.drop(columns=['notes_baker','pastry_type'])
dataset_learn.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5200 entries, 1 to 5200
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ambient_humidity   5182 non-null   float64
 1   baking_duration    5199 non-null   float64
 2   cooling_period     5199 non-null   float64
 3   cream_fat_content  5176 non-null   float64
 4   egg_temperature    5176 non-null   float64
 5   egg_yolk_count     5176 non-null   float64
 6   final_temperature  5175 non-null   float64
 7   lemon_zest_ph      5174 non-null   float64
 8   origin             5039 non-null   object 
 9   oven_temperature   5179 non-null   float64
 10  preheating_time    5181 non-null   float64
 11  salt_ratio         5187 non-null   float64
 12  sugar_content      5178 non-null   float64
 13  vanilla_extract    5182 non-null   float64
 14  quality_class      5199 non-null   object 
dtypes: float64(13), object(2)
memory usage: 650.0+ KB


Drop rows where the target variable is null, due to reasons, one in the code and one logic. The one in the code is the stratify=y cannot process them, and the logic one is that either the Pastel de Nata is Ok or not Ok, can't be NaN

In [5]:
dataset_learn.isna().sum()

ambient_humidity      18
baking_duration        1
cooling_period         1
cream_fat_content     24
egg_temperature       24
egg_yolk_count        24
final_temperature     25
lemon_zest_ph         26
origin               161
oven_temperature      21
preheating_time       19
salt_ratio            13
sugar_content         22
vanilla_extract       18
quality_class          1
dtype: int64

In [6]:
dataset_learn = dataset_learn.dropna(subset=['quality_class'])

#### **2.2 Categorical Consistnecy and Standarization**
We will conduct this to beacuse ensures that all text or nominal values within a categorical feature column are uniform, consistent, and correctly represented.

As we saw in the first notebook, the column `origin`, which keeps track of where the bakery is located (Lisbon or Porto), has a lot of inconsistencies in the names of those cities. 'Lisboa' and 'Porto' are written in a lot of different ways, therefore, we decided to start Notebook 2 deleting those differences, replacing all the values with either 'Lisboa' and 'Porto' written exactly in like that.

# **PERCEBER SE É PARA FAZER ENCODING DISTO PARA 1 E 0**

In [7]:
dataset_learn['origin'] = dataset_learn['origin'].str.strip().str.lower().str.title()
dataset_learn['origin'].unique()


array(['Lisboa', 'Porto', nan], dtype=object)

##### **2.3 Data Type Correction**

The 'egg_yolk_count' is a count, implying an integer, but often loaded as a float due to NaNs.

In [8]:
dataset_learn.loc[:, 'egg_yolk_count'] = dataset_learn['egg_yolk_count'].astype('Int64')
dataset_learn.head(1) #just to check if the changes were applied

[11, 10, 10,  9,  9,  9,  9, 13,  9, 11,
 ...
  9, 11, 11, 13, 12, 11,  9, 11,  9, 10]
Length: 5199, dtype: Int64' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  dataset_learn.loc[:, 'egg_yolk_count'] = dataset_learn['egg_yolk_count'].astype('Int64')


Unnamed: 0_level_0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,origin,oven_temperature,preheating_time,salt_ratio,sugar_content,vanilla_extract,quality_class
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,54.0,24.0,26.0,100.4,52.0,11,309.0,3.2,Lisboa,,207.0,42.74,22.8,5.7,KO


##### **2.4 Target Variable Encoding**

The target variable `quality_class` is categorical ('OK' or 'KO'). To better prepare the data for the binary classification models (which is the case), it is necessary to transform it into a binary variable, which means either '0' or '1'.  

We decided to attribute:
- **1** for "OK", the Pastel de Nata is in a good state.
- **0** for "KO", you should not eat the Pastel de Nata.

The "OK" class is positive and is the one that will be predicted.

In [9]:
dataset_learn['quality_class_binary'] = dataset_learn['quality_class'].replace({'OK': 1, 'KO': 0})
dataset_learn = dataset_learn.drop(columns=['quality_class'])
dataset_learn.head(1) #just to check if the changes were applied

  dataset_learn['quality_class_binary'] = dataset_learn['quality_class'].replace({'OK': 1, 'KO': 0})


Unnamed: 0_level_0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,origin,oven_temperature,preheating_time,salt_ratio,sugar_content,vanilla_extract,quality_class_binary
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,54.0,24.0,26.0,100.4,52.0,11,309.0,3.2,Lisboa,,207.0,42.74,22.8,5.7,0


##### **2.5 Export Cleaned Data**

Now we need to export the final dataset for us to be able to work with it on the next notebook(feature engineering).  
Export the structurally cleaned data. It still contains NaNs and unscaled numericals.

In [10]:
dataset_learn.to_csv('datasetlearn_cleaned.csv')

##### **Conclusion and Strucura Clening Insights**

This structural cleaning phase has been successfully completed, resulting in the dataset `datasetlearn_structurally_cleaned.csv`.

Insights and Preparation for Modeling

- **Focus on Key Variables:** We removed columns with no predictive value (`id`, `notes_baker`) and the uninformative `pastry_type` column, leaving the 14 essential recipe and process characteristics (e.g., `sugar_content`, `oven_temperature`, and `origin`) for modeling.
- **Categorical Consistency:** We ensured the `origin` column is standardized to only two consistent, clean values (`Lisboa` and `Porto`) by manually correcting capitalization.
- **Binary Target:** The target variable, `quality_class`, was converted into a binary format (`OK`=1, `KO`=0) as required for our classification model. 
- **Anti-Leakage Confirmed (Crucial Step):** The exported dataset intentionally still contains missing values (`NaNs`) and unscaled numerical features. This is critical: all statistical transformations (Imputation, Scaling, and Encoding) have been **deferred** to **Notebook 3** where the individual Scikit-learn transformers (e.g., `SimpleImputer`, `StandardScaler`, `OneHotEncoder`) will be **manually fitted exclusively on the training data** and then applied to the validation and test sets. This manual process guarantees we completely avoid **data leakage**, fulfilling a core requirement of the project's evaluation criteria.

##### **Next Step**

The `datasetlearn_structurally_cleaned.csv` is now the clean input for **Notebook 3 (Feature Management)**, where we will perform the necessary data split (Train/Validation/Test) and define the non-leaking preprocessing Pipeline.