## **Notebook 3**
## **Feature Management**

### Introduction
This notebook executes the core **Feature Engineering** phase by performing two critical operations: splitting the data and applying statistical transformations in a controlled, non-leaking sequence. Since we are **not using a Scikit-learn Pipeline**, all transformation steps (Imputation, Scaling, Encoding) must be manually fitted and applied.

### Anti-Leakage Protocol: Split First
1.  **Split Data First:** The structurally cleaned dataset is first divided into dedicated Training, Validation, and Test sets.
2.  **Fit on Training Only:** All statistical transformers (`SimpleImputer`, `StandardScaler`, `OneHotEncoder`) are then **fitted exclusively on the Training set**.
3.  **Transform All:** The parameters learned solely from the Training set are applied to transform the Training, Validation, and Test sets.

This explicit, manual sequencing guarantees that no information from the Test or Validation set contaminates the Training process.

### Objectives
The objectives are:
* **Data Splitting:** Divide the dataset into 60% Training, 20% Validation, and 20% Test sets using a stratified approach to maintain class balance (`OK`/`KO`) in all partitions.
* **Imputation:** Manually fit and transform the data, filling missing values in numerical features using the median calculated only from the Training data.
* **Scaling:** Manually fit and transform the numerical features using the `StandardScaler`, with its mean ($\mu$) and standard deviation ($\sigma$) calculated only from the Training data.
* **Encoding:** Manually fit and transform the categorical feature (`origin`) using `OneHotEncoder` based only on the unique categories present in the Training data.
* **Export:** Save the fully transformed data splits and the fitted transformer objects for direct use by the **Modelling (NB4)** and **Final Model (NB9)** notebooks.

In [67]:
import pandas as pd
import numpy as np
import pickle 

# data partition
from sklearn.model_selection import train_test_split, StratifiedKFold 

# imputação
from sklearn.impute import SimpleImputer # <-- ESSENCIAL para imputação manual

# scaling methods
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# encoding
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LassoCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import scipy.stats as stats
from scipy.stats import chi2_contingency
import warnings
warnings.filterwarnings('ignore')

##### **3.1 Load Structurally Cleaned Data from NB2**

In [68]:
learn_clean = pd.read_csv('datasetlearn_structurally_cleaned.csv', sep = ',')
display(learn_clean.shape)
learn_clean.head()

(5200, 15)

Unnamed: 0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,origin,oven_temperature,preheating_time,salt_ratio,sugar_content,vanilla_extract,quality_class_binary
0,54.0,24.0,26.0,100.4,52.0,11.0,309.0,3.2,Lisboa,,207.0,42.74,22.8,5.7,0.0
1,66.0,37.0,34.0,98.0,46.0,10.0,317.0,3.3,Lisboa,306.0,245.0,41.73,11.6,4.0,0.0
2,41.0,30.0,19.0,99.3,53.0,10.0,130.0,3.4,Porto,121.0,186.0,75.1,20.3,7.5,1.0
3,62.0,24.0,48.0,98.0,115.0,9.0,354.0,3.3,Lisboa,357.0,186.0,46.41,73.3,4.2,1.0
4,55.0,21.0,34.0,100.1,48.0,9.0,211.0,3.0,Lisboa,202.0,218.0,56.52,80.1,6.0,0.0


Drop rows where the target variable is null, due to reasons, one in the code and one logic. The one in the code is the stratify=y cannot process them, and the logic one is that either the Pastel de Nata is Ok or not Ok, can't be NaN

In [69]:
learn_clean = learn_clean.dropna(subset=['quality_class_binary'])

Separate features (X) and the binary target (y)

In [70]:
X = learn_clean.drop('quality_class_binary',axis = 1)
y = learn_clean['quality_class_binary']

In [71]:
numeric_features = [
    'ambient_humidity', 'baking_duration', 'cooling_period', 'cream_fat_content',
    'egg_temperature', 'egg_yolk_count', 'final_temperature', 'lemon_zest_ph',
    'oven_temperature', 'preheating_time', 'salt_ratio', 'sugar_content', 
    'vanilla_extract'
]
categorical_features = ['origin']

#### **3.2 Data Splitting**

In [72]:
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

This will create two different datasets, one for train (80% of the data) and one for test (20% of the data). <br>
`shuffle` randomizes the order of the observations, and `stratify` makes it so that every dataset resulting from the split has the same proportion of each label of the dependent variable.

In [73]:
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size= 0.25, random_state= 15, shuffle= True, stratify= y_train_val)

To create three datasets (train, validation and test) with the function train_test_split, the function has to be called twice. <br>
First we are going to create two sets of datasets, one for test (X_test and y_test) and another one that includes the data for training and validation (X_train_val and y_train_val).

Run the cell below to check the proportion of data for each dataset.

In [74]:
print('train:{}% | validation:{}% | test:{}%'.format(round(len(y_train)/len(y),2),
                                                     round(len(y_val)/len(y),2),
                                                     round(len(y_test)/len(y),2)
                                                    ))

train:0.6% | validation:0.2% | test:0.2%


#### **3.3 Preprocessing** 
This section applies the two primary statistical transformations—Imputation and Scaling—to the numerical features, strictly adhering to the "Fit on Train, Transform on All" principle to prevent data leakage.

In [75]:
X_train_num = X_train[numeric_features].copy()
X_val_num = X_val[numeric_features].copy()
X_test_num = X_test[numeric_features].copy()

##### **Imputation (Handling NaNs)**

In [76]:
imputer = SimpleImputer(strategy='median')

The `SimpleImputer` with a `median` strategy is used to fill missing values (`NaNs`). The median is chosen because it is robust to outliers, which helps avoid distorting the feature distributions.

In [77]:
imputer.fit(X_train_num)

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False


The `imputer` is **fitted ONLY on the `X_train_num`** data. This calculates the median for each numerical column exclusively using the training set's knowledge.

In [78]:
# Transform all splits using the median learned from the training data
X_train_num.loc[:, :] = imputer.transform(X_train_num) 
X_val_num.loc[:, :] = imputer.transform(X_val_num)
X_test_num.loc[:, :] = imputer.transform(X_test_num)

This median is then applied (`.transform()`) to `X_train_num`, `X_val_num`, and `X_test_num`, ensuring that the Val/Test sets are treated as unseen data.

##### **Leading with outliers**

In [79]:
for col in X_train_num.columns:
    Q1 = X_train_num[col].quantile(0.25)
    Q3 = X_train_num[col].quantile(0.75)
    IQR = Q3 - Q1
    upper_bound = Q3 + 1.5 * IQR
    lower_bound = Q1 - 1.5 * IQR

Calculate the statistical thresholds used to identify and potentially cap outliers in the numerical features. This calculation is performed only on the training data.  

By performing these calculations exclusively on `X_train_num`, we ensure that the limits defined for treating outliers (the upper_bound and lower_bound) are derived solely from the information available during the model's training phase.

In [80]:
#upper bound
X_train_num[col] = np.where(X_train_num[col] > upper_bound, upper_bound, X_train_num[col])
X_val_num[col] = np.where(X_val_num[col] > upper_bound, upper_bound, X_val_num[col])
X_test_num[col] = np.where(X_test_num[col] > upper_bound, upper_bound, X_test_num[col])

#lower bound   
X_train_num[col] = np.where(X_train_num[col] < lower_bound, lower_bound, X_train_num[col])
X_val_num[col] = np.where(X_val_num[col] < lower_bound, lower_bound, X_val_num[col])
X_test_num[col] = np.where(X_test_num[col] < lower_bound, lower_bound, X_test_num[col])

Capping (Transform All): The calculated `upper_bound` and `lower_bound` are then applied to cap the extreme values in the Training, Validation, and Test sets. This guarantees that the validation and test data are treated using rules derived only from the training distribution.

##### **Scaling**

In [81]:
scaler = StandardScaler()

We use the `StandardScaler` to standardize the numerical features. This process transforms the data such that it has a mean ($\mu$) of zero and a standard deviation ($\sigma$) of one. This is crucial for models that rely on distance metrics (like K-Nearest Neighbors) or gradient descent optimization (like Neural Networks).

In [82]:
scaler.fit(X_train_num)

0,1,2
,copy,True
,with_mean,True
,with_std,True


The `StandardScaler` is fitted only on the `X_train_num`. The mean ($\mu$) and standard deviation ($\sigma$) used for standardization are calculated solely from the training data.

In [None]:
X_train_num.loc[:, :] = scaler.transform(X_train_num)
X_val_num.loc[:, :] = scaler.transform(X_val_num)
X_test_num.loc[:, :] = scaler.transform(X_test_num)

Scaling complete: Numerical variables standardized based on Train set statistics.


Processamento dos Dados de Previsão (predict.csv)

In [None]:
df_predict = pd.read_csv('predict.csv')
id_predict = df_predict['id']

Apply STRUCTURAL CLEANING from NB2 (Must match the training data cleaning)

In [None]:
df_predict = df_predict.drop(columns=['id', 'notes_baker', 'pastry_type'])
df_predict['origin'] = df_predict['origin'].replace({
    'lisboa': 'Lisboa', ' lisboa': 'Lisboa', 'lisboa ': 'Lisboa', 'porto': 'Porto', 
    ' porto': 'Porto', 'porto ': 'Porto', 'LISBOA': 'Lisboa', 'PORTO': 'Porto'    
})

Prepare numerical and categorical columns

In [None]:
X_predict_num = df_predict[numeric_features].copy()
X_predict_cat = df_predict[categorical_features].copy()

Apply TRANSFORMERS FITTED ON TRAIN (APPLY ONLY)