# Feature Engineering – NutriClass

This notebook transforms cleaned nutrition data into a model-compatible,
inference-ready feature set.

Key objectives:
- Encode categorical and boolean features
- Transform skewed numerical variables
- Scale numerical features
- Prepare data for both supervised and unsupervised learning
- Keep feature engineering independent of modeling logic

Project Feature engineering is performed independently of model training.
This separation ensures:

- Reusability of features across multiple models
- Consistency during inference
- Clean distinction between data engineering and modeling phases


In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import (
    StandardScaler,
    LabelEncoder
)
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif

In [2]:
#Load Data
df=pd.read_csv('../data/processed/clean_food_data.csv')
df.shape

(31387, 16)

In [3]:
df.columns

Index(['Calories', 'Protein', 'Fat', 'Carbs', 'Sugar', 'Fiber', 'Sodium',
       'Cholesterol', 'Glycemic_Index', 'Water_Content', 'Serving_Size',
       'Meal_Type', 'Preparation_Method', 'Is_Vegan', 'Is_Gluten_Free',
       'Food_Name'],
      dtype='object')

In [5]:
#  Define Columns
numeric_cols = [
    'Calories','Protein','Fat','Carbs','Sugar','Fiber',
    'Sodium','Cholesterol','Glycemic_Index',
    'Water_Content','Serving_Size'
]

categorical_cols = ['Meal_Type','Preparation_Method']
bool_cols = ['Is_Vegan','Is_Gluten_Free']

In [6]:
# Encode Categorical Variables
df_encoded = pd.get_dummies(
    df,
    columns=categorical_cols,
    drop_first=True
)
#Encode Boolean Variables
df_encoded[bool_cols] = df_encoded[bool_cols].astype(int)



In [7]:
# Handle Skewness in Numeric Features
for col in ['Calories','Sugar','Sodium']:
    df_encoded[col] = np.log1p(df_encoded[col])


In [None]:
# Scale Numerical Features
scaler = StandardScaler()
df_encoded[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])

In [9]:
df_encoded[numeric_cols]

Unnamed: 0,Calories,Protein,Fat,Carbs,Sugar,Fiber,Sodium,Cholesterol,Glycemic_Index,Water_Content,Serving_Size
0,0.349183,0.679606,0.239530,0.565314,-0.251722,-0.360123,0.859420,-0.061040,0.765305,-0.154973,0.985719
1,-0.427461,-0.723920,-0.090477,-0.474229,1.135807,-1.085584,-0.404926,0.351636,0.247214,0.070727,-0.952531
2,0.672409,0.677776,0.969042,0.065938,0.026193,-0.670615,0.872898,0.374468,-0.052381,-0.453525,1.015068
3,-0.603870,0.010943,-0.891782,0.291895,-0.735400,-0.098313,0.426065,0.094996,0.211382,0.478902,-0.293313
4,0.208342,-0.447756,0.513781,0.089880,1.207453,-0.251257,0.417680,-0.485938,0.284290,-0.828986,-1.589693
...,...,...,...,...,...,...,...,...,...,...,...
31382,-2.204460,-1.111064,-1.546816,-0.071014,1.019869,-0.020279,-2.126979,-1.168601,-0.433308,1.792601,-0.612181
31383,-0.063952,0.122787,0.103185,0.664542,-0.077232,-0.185095,0.816449,-0.245010,0.747163,-0.365727,0.871836
31384,0.429639,0.418566,-0.104657,0.439471,-0.067298,0.645550,0.803034,0.114725,0.488845,-0.569552,0.980511
31385,0.829058,-0.027662,-0.518264,0.910047,-1.215664,-1.149650,-1.065252,-0.833545,0.409558,-0.005774,0.044929


In [10]:
#Save Processed Data Output
df_encoded.to_csv(
    "../data/processed/X_features_inference_ready.csv",
    index=False
)


Outcome:
- Dataset is fully numeric and standardized
- Ready for clustering (unsupervised learning)
- Ready for classification (supervised learning)
- Feature engineering is now locked
