<a href="https://colab.research.google.com/github/aayushis1203/dietcheck/blob/main/01_task1_Baselines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DietCheck Task 1: Dietary Classification - Baseline Models

**Purpose:** This notebook implements all baseline models for Task 1 (Dietary Classification) as required for A-grade contract:
- ‚úÖ Rule-Based Classifier (FDA thresholds)
- ‚úÖ TF-IDF + Logistic Regression (text-only baseline)
- ‚úÖ BERT Fine-tuned Model (text-only neural baseline)
- ‚úÖ Multimodal BERT (text + numeric features)
- ‚úÖ Comprehensive Evaluation & Error Analysis

**Dataset:** 139 packaged food products with ingredient lists and nutrition facts

**Labels:** Multi-label classification for 4 dietary categories:
- `keto_compliant`: ‚â§5g net carbs per serving (FDA standard)
- `high_protein`: ‚â•10g protein per serving (20% DV)
- `low_sodium`: ‚â§140mg sodium per serving (FDA standard)
- `low_fat`: ‚â§3g fat per serving (FDA standard)

In [2]:
# ============================================================================
# DietCheck Task 1: Dietary Classification - Baseline Models
# ============================================================================
#
# This notebook implements four baseline models for automatic dietary
# classification of packaged food products:
#   1. Rule-Based Classifier (FDA threshold-based)
#   2. TF-IDF + Logistic Regression (text features only)
#   3. BERT Fine-tuned Model (transformer-based text classification)
#   4. Multimodal BERT (combining text embeddings + numeric features)

# Standard library imports
import os
import sys
import warnings
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
import json
from pathlib import Path

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning - classical methods
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    classification_report,
    precision_recall_fscore_support,
    hamming_loss,
    f1_score,
    multilabel_confusion_matrix
)

# Deep learning
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW

# Transformers
from transformers import (
    BertTokenizer,
    BertModel,
    BertForSequenceClassification,
    get_linear_schedule_with_warmup
)

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(RANDOM_SEED)

# Configure plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 10

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("‚úÖ All libraries imported successfully")
print(f"üîß PyTorch version: {torch.__version__}")
print(f"üîß CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"üîß CUDA device: {torch.cuda.get_device_name(0)}")
print(f"üîß Using device: {device}")

‚úÖ All libraries imported successfully
üîß PyTorch version: 2.9.0+cu126
üîß CUDA available: False
üîß Using device: cpu


## üì¶ Section 1: Setup and Data Loading

This section handles:
1. **Library imports** - Loading all necessary Python packages for ML/NLP
2. **Workspace setup** - Connecting to GitHub repository (no Google Drive needed!)
3. **Data loading** - Reading train.csv and test.csv that we created in notebook 00
4. **Environment verification** - Checking GPU availability and file paths

In [3]:
# ============================================================================
# Workspace Setup - GitHub Repository Integration
# ============================================================================
#
# This cell sets up paths to your GitHub repository where data is stored.
# Unlike notebook 00, we DON'T clone the repo here - we just connect to it
# if it already exists, or guide you to clone it manually.
#
# Expected structure after running notebook 00:
# dietcheck/
#   ‚îú‚îÄ‚îÄ data/
#   ‚îÇ   ‚îú‚îÄ‚îÄ products.csv
#   ‚îÇ   ‚îú‚îÄ‚îÄ train.csv
#   ‚îÇ   ‚îî‚îÄ‚îÄ test.csv
#   ‚îú‚îÄ‚îÄ results/
#   ‚îî‚îÄ‚îÄ models/

# GitHub repository URL (update this to YOUR repository)
GITHUB_REPO = "https://github.com/aayushis1203/dietcheck.git"
REPO_NAME = GITHUB_REPO.split('/')[-1].replace('.git', '')

def find_repo_root():
    """
    Locate the repository root directory.
    Searches current directory and parent directories for the repo.
    """
    current = Path.cwd()

    # Check current directory first
    if (current / REPO_NAME).exists():
        return current / REPO_NAME

    # Check if we're already inside the repo
    if current.name == REPO_NAME:
        return current

    # Check parent directories
    for parent in current.parents:
        if (parent / REPO_NAME).exists():
            return parent / REPO_NAME

    return None

def setup_workspace():
    """
    Set up workspace paths and verify data availability.
    Returns paths to repo root, data directory, and results directory.
    """
    repo_root = find_repo_root()

    if repo_root is None:
        print("‚ùå Repository not found!")
        print(f"\nüìã To set up your workspace:")
        print(f"1. Run this command in Colab:")
        print(f"   !git clone {GITHUB_REPO}")
        print(f"2. Re-run this cell")
        raise FileNotFoundError(f"Repository '{REPO_NAME}' not found")

    # Set up directory paths
    data_dir = repo_root / "data"
    results_dir = repo_root / "results"
    models_dir = repo_root / "models"

    # Create directories if they don't exist
    results_dir.mkdir(exist_ok=True)
    models_dir.mkdir(exist_ok=True)

    # Verify critical files exist
    required_files = {
        "products.csv": data_dir / "products.csv",
        "train.csv": data_dir / "train.csv",
        "test.csv": data_dir / "test.csv"
    }

    missing_files = []
    for name, path in required_files.items():
        if not path.exists():
            missing_files.append(name)

    if missing_files:
        print(f"‚ùå Missing required files: {', '.join(missing_files)}")
        print(f"\nüí° These files should have been created by notebook 00.")
        print(f"   Please run notebook 00 first to generate train/test splits.")
        raise FileNotFoundError(f"Missing files: {missing_files}")

    print(f"‚úÖ Repository found: {repo_root}")
    print(f"‚úÖ Data directory: {data_dir}")
    print(f"‚úÖ Results directory: {results_dir}")
    print(f"‚úÖ All required files present")

    return repo_root, data_dir, results_dir

# Execute setup
REPO_ROOT, DATA_DIR, RESULTS_DIR = setup_workspace()

print(f"\nüìä Workspace ready!")
print(f"   REPO_ROOT = {REPO_ROOT}")
print(f"   DATA_DIR = {DATA_DIR}")
print(f"   RESULTS_DIR = {RESULTS_DIR}")

‚ùå Repository not found!

üìã To set up your workspace:
1. Run this command in Colab:
   !git clone https://github.com/aayushis1203/dietcheck.git
2. Re-run this cell


FileNotFoundError: Repository 'dietcheck' not found