## Data Cleaning

This notebook addresses data quality issues identified during the data understanding phase.
The following steps are performed:

- Load raw dataset
- Enforce correct data types
- Handle missing values
- Remove duplicate records
- Save cleaned dataset for further analysis

In [1]:
import pandas as pd

df = pd.read_csv('../data/raw/synthetic_food_dataset_imbalanced.csv')
df.shape

(31700, 16)

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31700 entries, 0 to 31699
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Calories            31325 non-null  float64
 1   Protein             31325 non-null  float64
 2   Fat                 31325 non-null  float64
 3   Carbs               31325 non-null  float64
 4   Sugar               31325 non-null  float64
 5   Fiber               31325 non-null  float64
 6   Sodium              31325 non-null  float64
 7   Cholesterol         31325 non-null  float64
 8   Glycemic_Index      31325 non-null  float64
 9   Water_Content       31325 non-null  float64
 10  Serving_Size        31325 non-null  float64
 11  Meal_Type           31700 non-null  object 
 12  Preparation_Method  31700 non-null  object 
 13  Is_Vegan            31700 non-null  bool   
 14  Is_Gluten_Free      31700 non-null  bool   
 15  Food_Name           31700 non-null  object 
dtypes: b

In [4]:
numeric_cols = [
    'Calories','Protein','Fat','Carbs','Sugar','Fiber',
    'Sodium','Cholesterol','Glycemic_Index',
    'Water_Content','Serving_Size'
]

categorical_cols = ['Meal_Type', 'Preparation_Method']
boolean_cols = ['Is_Vegan', 'Is_Gluten_Free']


In [5]:
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')

In [6]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])

In [7]:
df = df.drop_duplicates()

In [8]:
df.isna().sum(), df.duplicated().sum(), df.shape


(Calories              0
 Protein               0
 Fat                   0
 Carbs                 0
 Sugar                 0
 Fiber                 0
 Sodium                0
 Cholesterol           0
 Glycemic_Index        0
 Water_Content         0
 Serving_Size          0
 Meal_Type             0
 Preparation_Method    0
 Is_Vegan              0
 Is_Gluten_Free        0
 Food_Name             0
 dtype: int64,
 np.int64(0),
 (31387, 16))

In [9]:
df.to_csv('../data/processed/clean_food_data.csv', index=False)


## Data Cleaning Summary

- Numerical features were converted to appropriate numeric data types.
- Missing values in numerical columns were handled using median imputation.
- Duplicate rows were removed from the dataset.
- The cleaned dataset was saved for downstream EDA and modeling.