# Training a Healthy Meal Classifier

This notebook trains a simple classifier that predicts whether a meal is
"healthy" or "unhealthy" based on its nutritional values.

- Input data: `dataset/healthy_eating_dataset.csv`
- Target column: `is_healthy` (0 = unhealthy, 1 = healthy)
- Features used:
  - calories
  - protein_g
  - carbs_g
  - fat_g
  - fiber_g
  - sugar_g
  - sodium_mg

We will train a Logistic Regression model with class balancing to handle
the imbalance between healthy and unhealthy meals, evaluate it, and save
the trained pipeline to `models/health_classifier.joblib`.

This model will later be used inside the agent as a tool.


# Imports

In [38]:
import os

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    roc_auc_score,
    f1_score,
)

import joblib

# Make sure relative paths are from project root when you run this notebook.
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath("__file__")))
DATA_DIR = os.path.join(BASE_DIR, "dataset")
MODELS_DIR = os.path.join(BASE_DIR, "models")

os.makedirs(MODELS_DIR, exist_ok=True)

print("Base dir:", BASE_DIR)
print("Data dir:", DATA_DIR)
print("Models dir:", MODELS_DIR)


Base dir: c:\AI LLM\AgentAI
Data dir: c:\AI LLM\AgentAI\dataset
Models dir: c:\AI LLM\AgentAI\models


# Load dataset

In [40]:
healthy_csv_path = os.path.join(DATA_DIR, "healthy_eating_dataset.csv")

print("Healthy dataset path:", healthy_csv_path)

df = pd.read_csv(healthy_csv_path)
print("Healthy dataset shape:", df.shape)
df.head()


Healthy dataset path: c:\AI LLM\AgentAI\dataset\healthy_eating_dataset.csv
Healthy dataset shape: (2000, 20)


Unnamed: 0,meal_id,meal_name,cuisine,meal_type,diet_type,calories,protein_g,carbs_g,fat_g,fiber_g,sugar_g,sodium_mg,cholesterol_mg,serving_size_g,cooking_method,prep_time_min,cook_time_min,rating,is_healthy,image_url
0,1,Kid Pasta,Indian,Lunch,Keto,737,52.4,43.9,34.3,16.8,42.9,2079,91,206,Grilled,47,56,4.4,0,https://example.com/images/meal_1.jpg
1,2,Husband Rice,Mexican,Lunch,Paleo,182,74.7,144.4,0.1,22.3,38.6,423,7,317,Roasted,51,34,2.4,0,https://example.com/images/meal_2.jpg
2,3,Activity Rice,Indian,Snack,Paleo,881,52.9,97.3,18.8,20.0,37.5,2383,209,395,Boiled,58,29,4.3,0,https://example.com/images/meal_3.jpg
3,4,Another Salad,Mexican,Snack,Keto,427,17.5,73.1,7.6,9.8,41.7,846,107,499,Grilled,14,81,4.6,0,https://example.com/images/meal_4.jpg
4,5,Quite Stew,Thai,Lunch,Vegan,210,51.6,104.3,26.3,24.8,18.2,1460,42,486,Raw,47,105,4.3,0,https://example.com/images/meal_5.jpg


## General infos

In [41]:
df.info()

print("\nClass distribution for is_healthy:")
print(df["is_healthy"].value_counts())

print("\nClass proportion:")
print(df["is_healthy"].value_counts(normalize=True))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   meal_id         2000 non-null   int64  
 1   meal_name       2000 non-null   object 
 2   cuisine         2000 non-null   object 
 3   meal_type       2000 non-null   object 
 4   diet_type       2000 non-null   object 
 5   calories        2000 non-null   int64  
 6   protein_g       2000 non-null   float64
 7   carbs_g         2000 non-null   float64
 8   fat_g           2000 non-null   float64
 9   fiber_g         2000 non-null   float64
 10  sugar_g         2000 non-null   float64
 11  sodium_mg       2000 non-null   int64  
 12  cholesterol_mg  2000 non-null   int64  
 13  serving_size_g  2000 non-null   int64  
 14  cooking_method  2000 non-null   object 
 15  prep_time_min   2000 non-null   int64  
 16  cook_time_min   2000 non-null   int64  
 17  rating          2000 non-null   f

## Cleaning and sanity check

In [43]:
FEATURE_COLUMNS = [
    "calories",
    "protein_g",
    "carbs_g",
    "fat_g",
    "fiber_g",
    "sugar_g",
    "sodium_mg",
]

TARGET_COLUMN = "is_healthy"

# Remove exact duplicates
before = df.shape[0]
df = df.drop_duplicates()
after = df.shape[0]
print(f"Removed {before - after} duplicate rows.")

# Drop rows with missing values in features or target
df = df.dropna(subset=FEATURE_COLUMNS + [TARGET_COLUMN])

# Simple numerical sanity filters (values outside are highly unlikely)
df = df[
    (df["calories"] > 0)
    & (df["protein_g"] >= 0)
    & (df["carbs_g"] >= 0)
    & (df["fat_g"] >= 0)
    & (df["fiber_g"] >= 0)
    & (df["sugar_g"] >= 0)
    & (df["sodium_mg"] >= 0)
]

print("\nDataset shape after basic cleaning:", df.shape)

df[FEATURE_COLUMNS].describe()


Removed 0 duplicate rows.

Dataset shape after basic cleaning: (2000, 20)


Unnamed: 0,calories,protein_g,carbs_g,fat_g,fiber_g,sugar_g,sodium_mg
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,650.0615,42.86375,75.92425,30.0653,15.2458,24.6023,1257.316
std,315.419877,21.992887,42.749671,17.573243,8.754933,14.48074,709.587762
min,100.0,5.0,0.0,0.0,0.0,0.0,50.0
25%,372.0,23.6,39.2,14.8,7.6,12.0,647.5
50%,648.0,43.6,75.95,30.3,15.15,24.75,1273.0
75%,914.5,61.9,113.025,45.2,23.2,37.2,1854.5
max,1200.0,79.9,150.0,60.0,30.0,50.0,2499.0


# Train/test split

In [44]:
X = df[FEATURE_COLUMNS].values
y = df[TARGET_COLUMN].values

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

print("Train size:", X_train.shape[0])
print("Test size:", X_test.shape[0])


Train size: 1600
Test size: 400


# Pipeline

In [45]:
log_reg = LogisticRegression(
    class_weight="balanced",  # handle class imbalance
    max_iter=500,
    solver="liblinear",
)

pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("clf", log_reg),
    ]
)

pipeline


0,1,2
,steps,"[('scaler', ...), ('clf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,
,solver,'liblinear'
,max_iter,500


# Training

In [46]:
pipeline.fit(X_train, y_train)
print("Training completed.")


Training completed.


# Evaluate

In [47]:
y_proba = pipeline.predict_proba(X_test)[:, 1]  # probability of class 1 (healthy)
y_pred_05 = (y_proba >= 0.5).astype(int)

print("=== Evaluation with default threshold 0.5 ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_05):.3f}")
print(f"ROC-AUC:  {roc_auc_score(y_test, y_proba):.3f}")
print("\nClassification report:")
print(classification_report(y_test, y_pred_05))
print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred_05))


=== Evaluation with default threshold 0.5 ===
Accuracy: 0.897
ROC-AUC:  0.967

Classification report:
              precision    recall  f1-score   support

           0       0.99      0.89      0.94       363
           1       0.47      0.95      0.63        37

    accuracy                           0.90       400
   macro avg       0.73      0.92      0.79       400
weighted avg       0.95      0.90      0.91       400

Confusion matrix:
[[324  39]
 [  2  35]]


## Treshold for healthy class

In [48]:
thresholds = np.linspace(0.1, 0.9, 17)
best_thr = 0.5
best_f1 = -1.0

print("\n=== Threshold sweep (optimizing F1 for healthy class = 1) ===")
for thr in thresholds:
    y_pred_thr = (y_proba >= thr).astype(int)
    f1 = f1_score(y_test, y_pred_thr, pos_label=1)
    print(f"Threshold {thr:.2f} -> F1 (healthy class) = {f1:.3f}")
    if f1 > best_f1:
        best_f1 = f1
        best_thr = thr

print(f"\nBest threshold based on F1 for class 1: {best_thr:.2f} (F1 = {best_f1:.3f})")

# Evaluate at this best threshold
y_pred_best = (y_proba >= best_thr).astype(int)
print("\n=== Evaluation with best threshold ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_best):.3f}")
print("\nClassification report:")
print(classification_report(y_test, y_pred_best))
print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred_best))



=== Threshold sweep (optimizing F1 for healthy class = 1) ===
Threshold 0.10 -> F1 (healthy class) = 0.451
Threshold 0.15 -> F1 (healthy class) = 0.493
Threshold 0.20 -> F1 (healthy class) = 0.532
Threshold 0.25 -> F1 (healthy class) = 0.569
Threshold 0.30 -> F1 (healthy class) = 0.592
Threshold 0.35 -> F1 (healthy class) = 0.612
Threshold 0.40 -> F1 (healthy class) = 0.627
Threshold 0.45 -> F1 (healthy class) = 0.626
Threshold 0.50 -> F1 (healthy class) = 0.631
Threshold 0.55 -> F1 (healthy class) = 0.630
Threshold 0.60 -> F1 (healthy class) = 0.629
Threshold 0.65 -> F1 (healthy class) = 0.621
Threshold 0.70 -> F1 (healthy class) = 0.646
Threshold 0.75 -> F1 (healthy class) = 0.659
Threshold 0.80 -> F1 (healthy class) = 0.651
Threshold 0.85 -> F1 (healthy class) = 0.649
Threshold 0.90 -> F1 (healthy class) = 0.667

Best threshold based on F1 for class 1: 0.90 (F1 = 0.667)

=== Evaluation with best threshold ===
Accuracy: 0.943

Classification report:
              precision    recall

# Save the model

In [50]:
model_path = os.path.join(MODELS_DIR, "health_classifier.joblib")
model_bundle = {
    "pipeline": pipeline,
    "feature_columns": FEATURE_COLUMNS,
    "decision_threshold": float(best_thr),
}

joblib.dump(model_bundle, model_path)

print(f"Saved health classifier to: {model_path}")
print("Decision threshold stored:", best_thr)


Saved health classifier to: c:\AI LLM\AgentAI\models\health_classifier.joblib
Decision threshold stored: 0.9


# Sanity check

In [51]:
loaded = joblib.load(model_path)
loaded_pipeline = loaded["pipeline"]
loaded_features = loaded["feature_columns"]
loaded_thr = loaded["decision_threshold"]

print("Loaded feature columns:", loaded_features)
print("Loaded decision threshold:", loaded_thr)

sample = X_test[0:1]
proba = loaded_pipeline.predict_proba(sample)[0, 1]
pred = int(proba >= loaded_thr)

print("Example probability of being healthy:", proba)
print("Example prediction with threshold:", pred)


Loaded feature columns: ['calories', 'protein_g', 'carbs_g', 'fat_g', 'fiber_g', 'sugar_g', 'sodium_mg']
Loaded decision threshold: 0.9
Example probability of being healthy: 0.34862010549688166
Example prediction with threshold: 0
