# **Notebook 3**
## **Feature Management**

### Introduction
This notebook executes the core **Feature Engineering** phase by performing two critical operations: splitting the data and applying statistical transformations in a controlled, non-leaking sequence. Since we are **not using a Scikit-learn Pipeline**, all transformation steps (Imputation, Scaling, Encoding) must be manually fitted and applied.

### Anti-Leakage Protocol: Split First
1.  **Split Data First:** The structurally cleaned dataset is first divided into dedicated Training, Validation, and Test sets.
2.  **Fit on Training Only:** All statistical transformers (`SimpleImputer`, `StandardScaler`, `OneHotEncoder`) are then **fitted exclusively on the Training set**.
3.  **Transform All:** The parameters learned solely from the Training set are applied to transform the Training, Validation, and Test sets.

This explicit, manual sequencing guarantees that no information from the Test or Validation set contaminates the Training process.

### Objectives
The objectives are:
* **Data Splitting:** Divide the dataset into 60% Training, 20% Validation, and 20% Test sets using a stratified approach to maintain class balance (`OK`/`KO`) in all partitions.
* **Imputation:** Manually fit and transform the data, filling missing values in numerical features using the median calculated only from the Training data.
* **Scaling:** Manually fit and transform the numerical features using the `StandardScaler`, with its mean ($\mu$) and standard deviation ($\sigma$) calculated only from the Training data.
* **Encoding:** Manually fit and transform the categorical feature (`origin`) using `OneHotEncoder` based only on the unique categories present in the Training data.
* **Export:** Save the fully transformed data splits and the fitted transformer objects for direct use by the **Modelling (NB4)** and **Final Model (NB9)** notebooks.

In [184]:
import pandas as pd
import numpy as np
import pickle, os

# data partition
from sklearn.model_selection import train_test_split, StratifiedKFold 

# imputação
from sklearn.impute import SimpleImputer # <-- ESSENCIAL para imputação manual

# scaling methods
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# encoding
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LassoCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import scipy.stats as stats
from scipy.stats import chi2_contingency
import warnings
warnings.filterwarnings('ignore')

##### **3.1 Load Structurally Cleaned Data from NB2**

In [185]:
learn_clean = pd.read_csv('datasetlearn_structurally_cleaned.csv', sep = ',')
display(learn_clean.shape)
learn_clean.head()

(5200, 15)

Unnamed: 0,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,origin,oven_temperature,preheating_time,salt_ratio,sugar_content,vanilla_extract,quality_class_binary
0,54.0,24.0,26.0,100.4,52.0,11.0,309.0,3.2,Lisboa,,207.0,42.74,22.8,5.7,0.0
1,66.0,37.0,34.0,98.0,46.0,10.0,317.0,3.3,Lisboa,306.0,245.0,41.73,11.6,4.0,0.0
2,41.0,30.0,19.0,99.3,53.0,10.0,130.0,3.4,Porto,121.0,186.0,75.1,20.3,7.5,1.0
3,62.0,24.0,48.0,98.0,115.0,9.0,354.0,3.3,Lisboa,357.0,186.0,46.41,73.3,4.2,1.0
4,55.0,21.0,34.0,100.1,48.0,9.0,211.0,3.0,Lisboa,202.0,218.0,56.52,80.1,6.0,0.0


Drop rows where the target variable is null, due to reasons, one in the code and one logic. The one in the code is the stratify=y cannot process them, and the logic one is that either the Pastel de Nata is Ok or not Ok, can't be NaN

In [186]:
learn_clean = learn_clean.dropna(subset=['quality_class_binary'])

Separate features (X) and the binary target (y)

In [187]:
X = learn_clean.drop('quality_class_binary',axis = 1)
y = learn_clean['quality_class_binary']

In [188]:
numeric_features = [
    'ambient_humidity', 'baking_duration', 'cooling_period', 'cream_fat_content',
    'egg_temperature', 'egg_yolk_count', 'final_temperature', 'lemon_zest_ph',
    'oven_temperature', 'preheating_time', 'salt_ratio', 'sugar_content', 
    'vanilla_extract'
]
categorical_features = ['origin']

#### **3.2 Data Splitting**

In [189]:
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

This will create two different datasets, one for train (80% of the data) and one for test (20% of the data). <br>
`shuffle` randomizes the order of the observations, and `stratify` makes it so that every dataset resulting from the split has the same proportion of each label of the dependent variable.

In [190]:
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size= 0.25, random_state= 15, shuffle= True, stratify= y_train_val)

To create three datasets (train, validation and test) with the function train_test_split, the function has to be called twice. <br>
First we are going to create two sets of datasets, one for test (X_test and y_test) and another one that includes the data for training and validation (X_train_val and y_train_val).

Run the cell below to check the proportion of data for each dataset.

In [191]:
print('train:{}% | validation:{}% | test:{}%'.format(round(len(y_train)/len(y),2),
                                                     round(len(y_val)/len(y),2),
                                                     round(len(y_test)/len(y),2)
                                                    ))

train:0.6% | validation:0.2% | test:0.2%


#### **3.3 Load and Structural Clean-up of predict.csv**

In [192]:
predict_data = pd.read_csv('Nata_Files/predict.csv', sep = ',')
id_predict = predict_data['id'] #to later be eliminated from the predict dataset
predict_data.head()

Unnamed: 0,id,ambient_humidity,baking_duration,cooling_period,cream_fat_content,egg_temperature,egg_yolk_count,final_temperature,lemon_zest_ph,notes_baker,origin,oven_temperature,pastry_type,preheating_time,salt_ratio,sugar_content,vanilla_extract
0,5201,79.0,22,40,98.6,79.0,9.0,259.0,3.2,,Lisboa,268.0,,208.0,49.63,182.6,4.0
1,5202,49.0,26,32,101.9,105.0,9.0,287.0,3.2,,Lisboa,287.0,Pastel de nata,189.0,182.54,76.2,4.8
2,5203,80.0,28,24,96.6,20.0,10.0,64.0,3.4,,Porto,74.0,Pastel Nata,201.0,100.41,23.5,6.1
3,5204,74.0,21,37,97.2,81.0,9.0,314.0,3.0,,Lisboa,317.0,,220.0,46.66,143.2,4.9
4,5205,41.0,19,41,97.3,104.0,10.0,246.0,3.2,,Lisboa,243.0,Pastel Nata,191.0,39.45,143.0,7.0


The `predict.csv` data must undergo the same structural cleaning as the training data (NB2) before statistical transformations are applied. This includes saving the required `id` and cleaning categorical text formats.

In [193]:
predict_data = predict_data.drop(columns=['id', 'notes_baker', 'pastry_type'])
predict_data['origin'] = predict_data['origin'].replace({
    'lisboa': 'Lisboa', ' lisboa': 'Lisboa', 'lisboa ': 'Lisboa', 'porto': 'Porto', 
    ' porto': 'Porto', 'porto ': 'Porto', 'LISBOA': 'Lisboa', 'PORTO': 'Porto'    
})

#### **3.4 Prepare numerical and categorical columns**
Prepare sub-dataframes for ALL sets (Train, Val, Test, Predict).  
This section applies the two primary statistical transformations, Imputation and Scaling, to the numerical features, strictly adhering to the "Fit on Train, Transform on All" principle to prevent data leakage.

Numerical Sub-DataFrames

In [194]:
X_train_num = X_train[numeric_features].copy()
X_val_num = X_val[numeric_features].copy()
X_test_num = X_test[numeric_features].copy()

X_predict_num = predict_data[numeric_features].copy()

Categorical Sub-DataFrames

In [195]:
X_train_cat = X_train[categorical_features].copy()
X_val_cat = X_val[categorical_features].copy()
X_test_cat = X_test[categorical_features].copy()

X_predict_cat = predict_data[categorical_features].copy()

#### **3.5 Fit and transform numerical features (Imputation, Capping, Scaling)**

This block executes the three core statistical transformations on numerical features in a controlled, sequential manner. The goal is to calculate the necessary parameters (median, IQR bounds, $\mu$, $\sigma$) **only** from the training data and then apply them to **all** datasets (Train, Val, Test, Predict).

##### **Imputation (Handling NaNs)**

In [196]:
imputer = SimpleImputer(strategy='median')

The `SimpleImputer` with a `median` strategy is used to fill missing values (`NaNs`). The median is chosen because it is robust to outliers, which helps avoid distorting the feature distributions.

In [197]:
imputer.fit(X_train_num)

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False


The `imputer` is **fitted ONLY on the `X_train_num`** data. This calculates the median for each numerical column exclusively using the training set's knowledge.

In [198]:
# Transform all splits using the median learned from the training data
X_train_num.loc[:, :] = imputer.transform(X_train_num) 
X_val_num.loc[:, :] = imputer.transform(X_val_num)
X_test_num.loc[:, :] = imputer.transform(X_test_num)

X_predict_num.loc[:, :] = imputer.transform(X_predict_num)

This median is then applied (`.transform()`) to `X_train_num`, `X_val_num`,`X_test_num` and `X_predict_num`, ensuring that the Val/Test/Predict sets are treated as unseen data.

##### **Leading with outliers**

In [199]:
outlier_bounds = {}

for col in X_train_num.columns:
    # Calculate Bounds (FIT ONLY ON TRAIN)
    Q1 = X_train_num[col].quantile(0.25)
    Q3 = X_train_num[col].quantile(0.75)
    IQR = Q3 - Q1
    upper_bound = Q3 + 1.5 * IQR
    lower_bound = Q1 - 1.5 * IQR
    outlier_bounds[col] = {'upper': upper_bound, 'lower': lower_bound}

Calculate the statistical thresholds used to identify and potentially cap outliers in the numerical features. This calculation is performed only on the training data.  

By performing these calculations exclusively on `X_train_num`, we ensure that the limits defined for treating outliers (the upper_bound and lower_bound) are derived solely from the information available during the model's training phase.

In [200]:
for data_num in [X_train_num, X_val_num, X_test_num, X_predict_num]:
        data_num[col] = np.where(data_num[col] > upper_bound, upper_bound, data_num[col])
        data_num[col] = np.where(data_num[col] < lower_bound, lower_bound, data_num[col])

Capping (Transform All): The calculated `upper_bound` and `lower_bound` are then applied to cap the extreme values in the Training, Validation,Test and Predict sets. This guarantees that the validation, test and predict data are treated using rules derived only from the training distribution.

##### **Scaling**

In [201]:
scaler = StandardScaler()

We use the `StandardScaler` to standardize the numerical features. This process transforms the data such that it has a mean ($\mu$) of zero and a standard deviation ($\sigma$) of one. This is crucial for models that rely on distance metrics (like K-Nearest Neighbors) or gradient descent optimization (like Neural Networks).

In [202]:
scaler.fit(X_train_num)

0,1,2
,copy,True
,with_mean,True
,with_std,True


The `StandardScaler` is fitted only on the `X_train_num`. The mean ($\mu$) and standard deviation ($\sigma$) used for standardization are calculated solely from the training data.

In [203]:
X_train_num.loc[:, :] = scaler.transform(X_train_num)
X_val_num.loc[:, :] = scaler.transform(X_val_num)
X_test_num.loc[:, :] = scaler.transform(X_test_num)

X_predict_num.loc[:, :] = scaler.transform(X_predict_num)

##### **Encoding (OneHotEncoder)**

This section handles the categorical feature by converting it into a numerical format suitable for Machine Learning algorithms.

In [204]:
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

Setting `handle_unknown='ignore'` is vital. If an unseen category appears in the Test or Prediction data, the encoder will safely ignore it, maintaining the correct column structure.

In [205]:
encoder.fit(X_train_cat)

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


The encoder is fitted only on `X_train_cat`.

In [206]:
feature_names = list(encoder.get_feature_names_out(categorical_features))

The line above is fundamental to data integrity, as the OneHotEncoder's primary function is to return a numerical NumPy array that, by itself, lacks column headers. This method is specifically called after fitting to capture the new, descriptive column names created by the encoder ( `origin_Lisboa`, `origin_Porto`). Capturing these names is mandatory, ensuring the final DataFrames (`X_train_final` and `X_predict_final`) maintain the structural consistency and readable headers necessary for the model to interpret the features correctly during all phases.

Transform all sets (Train, Val, Test, Predict)

In [207]:
X_train_cat_encoded = encoder.transform(X_train_cat)
X_val_cat_encoded = encoder.transform(X_val_cat)
X_test_cat_encoded = encoder.transform(X_test_cat)

X_predict_cat_encoded = encoder.transform(X_predict_cat)

##### Convert arrays back to Dateframes

In [208]:
X_train_cat_df = pd.DataFrame(X_train_cat_encoded, index=X_train.index, columns=feature_names)
X_val_cat_df = pd.DataFrame(X_val_cat_encoded, index=X_val.index, columns=feature_names)
X_test_cat_df = pd.DataFrame(X_test_cat_encoded, index=X_test.index, columns=feature_names)
X_predict_cat_df = pd.DataFrame(X_predict_cat_encoded, index=X_predict_cat.index, columns=feature_names)

#### **3.6 Final merging and export**

Concatenate transformed features

In [209]:
X_train_final = pd.concat([X_train_num, X_train_cat_df], axis=1)
X_val_final = pd.concat([X_val_num, X_val_cat_df], axis=1)
X_test_final = pd.concat([X_test_num, X_test_cat_df], axis=1)
X_predict_final = pd.concat([X_predict_num, X_predict_cat_df], axis=1)

print(f" X_train_final shape: {X_train_final.shape}")

 X_train_final shape: (3119, 16)


The Python `pickle` library is used to save the entire dictionary (`notebook3_data_fixed`) to disk as a `.pkl` file.

In [210]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) 

# Cria o dicionário mestre com todos os dados e chaves anti-leakage
notebook3_data_fixed = {
    # Dataframes Transformados
    'X_train': X_train_final, 'y_train': y_train,
    'X_val': X_val_final, 'y_val': y_val,
    'X_test': X_test_final, 'y_test': y_test,
    # Dados de Previsão PRONTOS
    'X_predict_final': X_predict_final,
    'id_predict': id_predict,
    # Salvamos os transformadores AJUSTADOS (Anti-Leakage Keys)
    'imputer': imputer, 'scaler': scaler, 'encoder': encoder, 
    'outlier_bounds': outlier_bounds, 
    'skf': skf, 'numeric_features': numeric_features, 'categorical_features': categorical_features
}

with open(r'Nata_Files\\train_test_split_fixed.pkl', 'wb') as f:
    pickle.dump(notebook3_data_fixed, f)
    
print("Processing and export complete. Notebook 4 can be started.")

Processing and export complete. Notebook 4 can be started.


This dictionary contains the **fitted transformers** (`imputer`, `scaler`, `encoder`, `outlier_bounds`, and `skf`). Saving these objects is mandatory because they hold the statistical parameters learned **exclusively** from the training set.
By saving these "keys," we ensure that when the final prediction is made (Notebook 9), the raw `predict.csv` data can be transformed using the exact same rules derived from the training set, strictly enforcing the anti-leakage protocol throughout the entire project lifecycle.