<a href="https://colab.research.google.com/github/aayushis1203/dietcheck/blob/main/01_task1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìä Task 1: Dietary Classification

## üìã What This Notebook Does

This is a **complete Task 1 implementation** covering all B+ grade requirements:

1. ‚úÖ Apply FDA labels to full dataset (279 products)
2. ‚úÖ Cohen's Kappa analysis (25 products, 2 annotators)
3. ‚úÖ Implement 3 baseline models (Rule-based, TF-IDF, BERT)
4. ‚úÖ Evaluation (Macro-F1, Micro-F1, per-label metrics)
5. ‚úÖ Error analysis (20 misclassified products)
6. ‚úÖ Generate documentation for final report

---

## üöÄ How to Use

### **Upload 3 files to Colab:**
1. `products.csv` (your full dataset)
2. `task1_annotation_AAYUSHI.csv` (Aayushi's 25 annotations)
3. `task1_annotation_RAHUL.csv` (Rahul's 25 annotations)

### **Then:**
- Click "Runtime" ‚Üí "Run all"
- Wait ~60-90 minutes (BERT training takes time)
- Download all output files

---

## üì¶ Outputs You'll Get

**Data files:**
- `products_with_task1_labels.csv` - Full dataset with labels
- `task1_annotations_consensus.csv` - Consensus from kappa analysis
- `task1_disagreements.csv` - Products where annotators disagreed

**Model files:**
- `rule_based_predictions.csv`
- `tfidf_model.pkl`, `tfidf_predictions.csv`
- `bert_model/` directory, `bert_predictions.csv`

**Reports:**
- `TASK1_COHENS_KAPPA_REPORT.md`
- `TASK1_MODEL_COMPARISON.md`
- `TASK1_ERROR_ANALYSIS.md`

---

## ‚öôÔ∏è FDA Thresholds Used

- **Keto-compliant:** Net carbs ‚â§ 5g
- **High protein:** Protein ‚â• 10g (20% DV)
- **Low sodium:** Sodium ‚â§ 140mg
- **Low fat:** Fat ‚â§ 3g


In [4]:
# ======================================================================
# Cell 1: Setup & Install Dependencies
# ======================================================================

print("Installing required packages...\n")

!pip install -q transformers torch scikit-learn pandas numpy

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

print("‚úÖ Setup complete")
print(f"Random seed: {RANDOM_SEED}")

Installing required packages...

‚úÖ Setup complete
Random seed: 42


---
# Part 1: Apply Labels to Full Dataset
---

In [5]:
# ======================================================================
# Cell 2: Load Products Dataset
# ======================================================================

print("üì• LOADING PRODUCTS DATASET\n")
print("="*70)

try:
    products = pd.read_csv("products.csv")
    print(f"‚úÖ Loaded products.csv")
    print(f"   Total products: {len(products)}")
    print(f"   Columns: {len(products.columns)}")

except FileNotFoundError:
    print("‚ùå ERROR: products.csv not found!")
    print("\n‚ö†Ô∏è  Please upload products.csv to Colab before running.")
    raise

print("\n" + "="*70)

üì• LOADING PRODUCTS DATASET

‚úÖ Loaded products.csv
   Total products: 279
   Columns: 28



In [6]:
# ======================================================================
# Cell 3: Apply FDA Thresholds to All Products
# ======================================================================

print("‚Æï APPLYING FDA LABELS TO ALL PRODUCTS\n")
print("="*70)

# FDA Thresholds
KETO_THRESHOLD = 5.0  # Net carbs ‚â§ 5g
HIGH_PROTEIN_THRESHOLD = 10.0  # Protein ‚â• 10g (20% DV)
LOW_SODIUM_THRESHOLD = 140.0  # Sodium ‚â§ 140mg
LOW_FAT_THRESHOLD = 3.0  # Fat ‚â§ 3g

print("FDA Thresholds:")
print(f"  Keto: net_carbs ‚â§ {KETO_THRESHOLD}g")
print(f"  High protein: protein ‚â• {HIGH_PROTEIN_THRESHOLD}g")
print(f"  Low sodium: sodium ‚â§ {LOW_SODIUM_THRESHOLD}mg")
print(f"  Low fat: fat ‚â§ {LOW_FAT_THRESHOLD}g")

# Apply thresholds
products['keto_compliant'] = (
    products['net_carbs_per_serving'] <= KETO_THRESHOLD
).fillna(False).astype(int)

products['high_protein'] = (
    products['protein_per_serving'] >= HIGH_PROTEIN_THRESHOLD
).fillna(False).astype(int)

products['low_sodium'] = (
    products['sodium_per_serving'] <= LOW_SODIUM_THRESHOLD
).fillna(False).astype(int)

products['low_fat'] = (
    products['fat_per_serving'] <= LOW_FAT_THRESHOLD
).fillna(False).astype(int)

print("\nüìä Label Distribution:")
print(f"  Keto-compliant: {products['keto_compliant'].sum()}/{len(products)} ({products['keto_compliant'].sum()/len(products)*100:.1f}%)")
print(f"  High protein: {products['high_protein'].sum()}/{len(products)} ({products['high_protein'].sum()/len(products)*100:.1f}%)")
print(f"  Low sodium: {products['low_sodium'].sum()}/{len(products)} ({products['low_sodium'].sum()/len(products)*100:.1f}%)")
print(f"  Low fat: {products['low_fat'].sum()}/{len(products)} ({products['low_fat'].sum()/len(products)*100:.1f}%)")

# Save labeled dataset
products.to_csv("products_with_task1_labels.csv", index=False)
print("\n‚úÖ Saved: products_with_task1_labels.csv")

print("\n" + "="*70)

‚Æï APPLYING FDA LABELS TO ALL PRODUCTS

FDA Thresholds:
  Keto: net_carbs ‚â§ 5.0g
  High protein: protein ‚â• 10.0g
  Low sodium: sodium ‚â§ 140.0mg
  Low fat: fat ‚â§ 3.0g

üìä Label Distribution:
  Keto-compliant: 90/279 (32.3%)
  High protein: 105/279 (37.6%)
  Low sodium: 124/279 (44.4%)
  Low fat: 103/279 (36.9%)

‚úÖ Saved: products_with_task1_labels.csv



---
# Part 2: Cohen's Kappa Analysis
---

In [7]:
# ======================================================================
# Cell 4: Load Annotator Files
# ======================================================================

from sklearn.metrics import cohen_kappa_score, confusion_matrix

print("üì• LOADING ANNOTATOR FILES\n")
print("="*70)

try:
    aayushi = pd.read_csv("task1_annotation_AAYUSHI.csv")
    rahul = pd.read_csv("task1_annotation_RAHUL.csv")

    print(f"‚úÖ Loaded Aayushi's annotations: {len(aayushi)} products")
    print(f"‚úÖ Loaded Rahul's annotations: {len(rahul)} products")

    if len(aayushi) != len(rahul):
        print("\n‚ö†Ô∏è  WARNING: Different number of products!")

except FileNotFoundError as e:
    print("‚ùå ERROR: Annotation files not found!")
    print("\n‚ö†Ô∏è  Please upload both:")
    print("   ‚Ä¢ task1_annotation_AAYUSHI.csv")
    print("   ‚Ä¢ task1_annotation_RAHUL.csv")
    raise

print("\n" + "="*70)

üì• LOADING ANNOTATOR FILES

‚úÖ Loaded Aayushi's annotations: 25 products
‚úÖ Loaded Rahul's annotations: 25 products



In [8]:
# ======================================================================
# Cell 5: Calculate Cohen's Kappa
# ======================================================================

print("üìä CALCULATING COHEN'S KAPPA\n")
print("="*70)

labels = ['keto_compliant_manual', 'high_protein_manual', 'low_sodium_manual', 'low_fat_manual']
kappa_results = {}

for label in labels:
    a_labels = aayushi[label].fillna(-1).astype(int)
    r_labels = rahul[label].fillna(-1).astype(int)

    valid_mask = (a_labels != -1) & (r_labels != -1)
    a_clean = a_labels[valid_mask]
    r_clean = r_labels[valid_mask]

    if len(a_clean) == 0:
        continue

    kappa = cohen_kappa_score(a_clean, r_clean)
    kappa_results[label] = kappa
    agreement = (a_clean == r_clean).sum() / len(a_clean) * 100

    if kappa < 0.40:
        interpretation = "Fair"
    elif kappa < 0.60:
        interpretation = "Moderate"
    elif kappa < 0.80:
        interpretation = "Substantial"
    else:
        interpretation = "Almost Perfect"

    status = "‚úÖ" if kappa >= 0.65 else "‚ö†Ô∏è"

    print(f"\n{label.upper().replace('_', ' ')}:")
    print(f"  Œ∫ = {kappa:.3f} ({interpretation}) {status}")
    print(f"  Agreement: {agreement:.1f}%")

    cm = confusion_matrix(a_clean, r_clean, labels=[0, 1])
    print(f"  Confusion: [[{cm[0,0]}, {cm[0,1]}], [{cm[1,0]}, {cm[1,1]}]]")

avg_kappa = np.mean(list(kappa_results.values()))
print(f"\n{'='*70}")
print(f"\nAverage Œ∫ = {avg_kappa:.3f}")
print(f"B+ requirement: Œ∫ ‚â• 0.65 {'‚úÖ PASS' if avg_kappa >= 0.65 else '‚ö†Ô∏è FAIL'}")

if avg_kappa == 1.0:
    print("\n‚ö†Ô∏è  Perfect agreement - may indicate non-independent annotation")

print("\n" + "="*70)

üìä CALCULATING COHEN'S KAPPA


KETO COMPLIANT MANUAL:
  Œ∫ = 0.702 (Substantial) ‚úÖ
  Agreement: 92.0%
  Confusion: [[20, 1], [1, 3]]

HIGH PROTEIN MANUAL:
  Œ∫ = 0.915 (Almost Perfect) ‚úÖ
  Agreement: 96.0%
  Confusion: [[15, 1], [0, 9]]

LOW SODIUM MANUAL:
  Œ∫ = 0.918 (Almost Perfect) ‚úÖ
  Agreement: 96.0%
  Confusion: [[14, 0], [1, 10]]

LOW FAT MANUAL:
  Œ∫ = 1.000 (Almost Perfect) ‚úÖ
  Agreement: 100.0%
  Confusion: [[14, 0], [0, 11]]


Average Œ∫ = 0.884
B+ requirement: Œ∫ ‚â• 0.65 ‚úÖ PASS



In [9]:
# ======================================================================
# Cell 6: Analyze Disagreements & Create Consensus
# ======================================================================

print("üîç ANALYZING DISAGREEMENTS\n")
print("="*70)

disagreements = []

for idx in range(len(aayushi)):
    product_id = aayushi.loc[idx, 'product_id']
    product_name = aayushi.loc[idx, 'name'] if 'name' in aayushi.columns else f"Product {product_id}"

    disagreed_labels = []

    for label in labels:
        a_val = aayushi.loc[idx, label]
        r_val = rahul.loc[idx, label]

        if pd.isna(a_val) or pd.isna(r_val):
            continue

        if int(a_val) != int(r_val):
            disagreed_labels.append(label)

    if disagreed_labels:
        disagreements.append({
            'product_id': product_id,
            'name': product_name,
            'labels': ', '.join(disagreed_labels)
        })

print(f"Disagreements: {len(disagreements)}/{len(aayushi)} products\n")

if disagreements:
    disagreement_df = pd.DataFrame(disagreements)
    disagreement_df.to_csv("task1_disagreements.csv", index=False)
    print("‚úÖ Saved: task1_disagreements.csv")
else:
    print("‚úÖ No disagreements found")

# Create consensus (use FDA thresholds as ground truth)
consensus = aayushi.copy()
for label in labels:
    if label == 'keto_compliant_manual' and 'net_carbs_per_serving' in consensus.columns:
        consensus[label] = (consensus['net_carbs_per_serving'] <= 5.0).astype(int)
    elif label == 'high_protein_manual' and 'protein_per_serving' in consensus.columns:
        consensus[label] = (consensus['protein_per_serving'] >= 10.0).astype(int)
    elif label == 'low_sodium_manual' and 'sodium_per_serving' in consensus.columns:
        consensus[label] = (consensus['sodium_per_serving'] <= 140.0).astype(int)
    elif label == 'low_fat_manual' and 'fat_per_serving' in consensus.columns:
        consensus[label] = (consensus['fat_per_serving'] <= 3.0).astype(int)

consensus.to_csv("task1_annotations_consensus.csv", index=False)
print("‚úÖ Saved: task1_annotations_consensus.csv")

print("\n" + "="*70)

üîç ANALYZING DISAGREEMENTS

Disagreements: 4/25 products

‚úÖ Saved: task1_disagreements.csv
‚úÖ Saved: task1_annotations_consensus.csv



---
# Part 3: Baseline Models
---

In [10]:
# ======================================================================
# Cell 7: Prepare Data for Models
# ======================================================================

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score

print("üìä PREPARING DATA FOR MODELS\n")
print("="*70)

# Filter products with complete data
model_data = products[
    products['ingredients'].notna() &
    products['fat_per_serving'].notna() &
    products['protein_per_serving'].notna() &
    products['sodium_per_serving'].notna() &
    products['net_carbs_per_serving'].notna()
].copy()

print(f"Products with complete data: {len(model_data)}/{len(products)}")

# Fill missing ingredients with empty string
model_data['ingredients'] = model_data['ingredients'].fillna('')

# Train/test split (80/20)
train_data, test_data = train_test_split(
    model_data,
    test_size=0.2,
    random_state=RANDOM_SEED,
    stratify=model_data['keto_compliant']  # Stratify by one label
)

print(f"\nTrain set: {len(train_data)} products")
print(f"Test set: {len(test_data)} products")

# Label columns
label_cols = ['keto_compliant', 'high_protein', 'low_sodium', 'low_fat']

print("\n‚úÖ Data prepared")
print("="*70)

üìä PREPARING DATA FOR MODELS

Products with complete data: 279/279

Train set: 223 products
Test set: 56 products

‚úÖ Data prepared


In [11]:
# ======================================================================
# Cell 8: Model 1 - Rule-Based Classifier
# ======================================================================

print("ü§ñ MODEL 1: RULE-BASED CLASSIFIER\n")
print("="*70)

# Apply FDA thresholds
rule_predictions = pd.DataFrame()
rule_predictions['keto_compliant'] = (test_data['net_carbs_per_serving'] <= 5.0).astype(int)
rule_predictions['high_protein'] = (test_data['protein_per_serving'] >= 10.0).astype(int)
rule_predictions['low_sodium'] = (test_data['sodium_per_serving'] <= 140.0).astype(int)
rule_predictions['low_fat'] = (test_data['fat_per_serving'] <= 3.0).astype(int)

# Evaluate
rule_results = {}
for label in label_cols:
    y_true = test_data[label].values
    y_pred = rule_predictions[label].values

    f1_macro = f1_score(y_true, y_pred, average='macro', zero_division=0)
    f1_micro = f1_score(y_true, y_pred, average='micro', zero_division=0)

    rule_results[label] = {'macro_f1': f1_macro, 'micro_f1': f1_micro}

    print(f"\n{label}:")
    print(f"  Macro-F1: {f1_macro:.3f}")
    print(f"  Micro-F1: {f1_micro:.3f}")

# Average
avg_macro = np.mean([r['macro_f1'] for r in rule_results.values()])
avg_micro = np.mean([r['micro_f1'] for r in rule_results.values()])

print(f"\n{'='*70}")
print(f"Average Macro-F1: {avg_macro:.3f}")
print(f"Average Micro-F1: {avg_micro:.3f}")

# Save predictions
rule_predictions.to_csv("rule_based_predictions.csv", index=False)
print("\n‚úÖ Saved: rule_based_predictions.csv")
print("="*70)

ü§ñ MODEL 1: RULE-BASED CLASSIFIER


keto_compliant:
  Macro-F1: 1.000
  Micro-F1: 1.000

high_protein:
  Macro-F1: 1.000
  Micro-F1: 1.000

low_sodium:
  Macro-F1: 1.000
  Micro-F1: 1.000

low_fat:
  Macro-F1: 1.000
  Micro-F1: 1.000

Average Macro-F1: 1.000
Average Micro-F1: 1.000

‚úÖ Saved: rule_based_predictions.csv


In [12]:
# ======================================================================
# Cell 9: Model 2 - TF-IDF + Logistic Regression
# ======================================================================

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import pickle

print("ü§ñ MODEL 2: TF-IDF + LOGISTIC REGRESSION\n")
print("="*70)

# TF-IDF from ingredients
tfidf = TfidfVectorizer(max_features=500, min_df=2, max_df=0.8)
X_train_text = tfidf.fit_transform(train_data['ingredients'])
X_test_text = tfidf.transform(test_data['ingredients'])

print(f"TF-IDF features: {X_train_text.shape[1]}")

# Numeric features
numeric_features = ['fat_per_serving', 'protein_per_serving', 'sodium_per_serving', 'net_carbs_per_serving']
scaler = StandardScaler()
X_train_numeric = scaler.fit_transform(train_data[numeric_features])
X_test_numeric = scaler.transform(test_data[numeric_features])

# Combine features
import scipy.sparse as sp
X_train = sp.hstack([X_train_text, X_train_numeric])
X_test = sp.hstack([X_test_text, X_test_numeric])

print(f"Combined features: {X_train.shape[1]}")

# Train models for each label
tfidf_models = {}
tfidf_predictions = pd.DataFrame()
tfidf_results = {}

for label in label_cols:
    print(f"\nTraining {label}...")

    y_train = train_data[label].values
    y_test = test_data[label].values

    model = LogisticRegression(max_iter=1000, random_state=RANDOM_SEED)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    tfidf_predictions[label] = y_pred
    tfidf_models[label] = model

    f1_macro = f1_score(y_test, y_pred, average='macro', zero_division=0)
    f1_micro = f1_score(y_test, y_pred, average='micro', zero_division=0)

    tfidf_results[label] = {'macro_f1': f1_macro, 'micro_f1': f1_micro}

    print(f"  Macro-F1: {f1_macro:.3f}")
    print(f"  Micro-F1: {f1_micro:.3f}")

# Average
avg_macro = np.mean([r['macro_f1'] for r in tfidf_results.values()])
avg_micro = np.mean([r['micro_f1'] for r in tfidf_results.values()])

print(f"\n{'='*70}")
print(f"Average Macro-F1: {avg_macro:.3f}")
print(f"Average Micro-F1: {avg_micro:.3f}")

# Save
tfidf_predictions.to_csv("tfidf_predictions.csv", index=False)
with open("tfidf_model.pkl", "wb") as f:
    pickle.dump({'models': tfidf_models, 'tfidf': tfidf, 'scaler': scaler}, f)

print("\n‚úÖ Saved: tfidf_predictions.csv, tfidf_model.pkl")
print("="*70)

ü§ñ MODEL 2: TF-IDF + LOGISTIC REGRESSION

TF-IDF features: 500
Combined features: 504

Training keto_compliant...
  Macro-F1: 0.644
  Micro-F1: 0.768

Training high_protein...
  Macro-F1: 0.499
  Micro-F1: 0.607

Training low_sodium...
  Macro-F1: 0.738
  Micro-F1: 0.750

Training low_fat...
  Macro-F1: 0.717
  Micro-F1: 0.821

Average Macro-F1: 0.650
Average Micro-F1: 0.737

‚úÖ Saved: tfidf_predictions.csv, tfidf_model.pkl


In [None]:
# ======================================================================
# Cell 10: Model 3 - BERT Fine-tuning (PROPERLY TUNED)
# ======================================================================

import os
os.environ['WANDB_DISABLED'] = 'true'

from transformers import BertTokenizer, BertModel, Trainer, TrainingArguments
from torch.utils.data import Dataset
import torch
import torch.nn as nn
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from sklearn.utils.class_weight import compute_class_weight

print("ü§ñ MODEL 3: BERT FINE-TUNING (OPTIMIZED)\n")
print("="*70)
print("\n‚è≥ This will take 10-15 minutes...\n")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Custom multimodal BERT model with class weights
class MultimodalBERT(nn.Module):
    def __init__(self, num_labels=2, num_numeric_features=4, class_weights=None):
        super(MultimodalBERT, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(0.3)

        combined_size = 768 + num_numeric_features

        self.classifier = nn.Sequential(
            nn.Linear(combined_size, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_labels)
        )

        # Store class weights
        self.class_weights = class_weights

    def forward(self, input_ids, attention_mask, numeric_features, labels=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        pooled_output = self.dropout(pooled_output)

        combined = torch.cat([pooled_output, numeric_features], dim=1)
        logits = self.classifier(combined)

        loss = None
        if labels is not None:
            # Use weighted loss
            if self.class_weights is not None:
                loss_fct = nn.CrossEntropyLoss(weight=self.class_weights)
            else:
                loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits, labels)

        return {'loss': loss, 'logits': logits} if loss is not None else {'logits': logits}

class MultimodalDataset(Dataset):
    def __init__(self, texts, labels, numeric_features, tokenizer, max_length=128):
        self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=max_length, return_tensors='pt')
        self.labels = torch.tensor(labels, dtype=torch.long)
        self.numeric_features = torch.tensor(numeric_features, dtype=torch.float32)

    def __getitem__(self, idx):
        return {
            'input_ids': self.encodings['input_ids'][idx],
            'attention_mask': self.encodings['attention_mask'][idx],
            'numeric_features': self.numeric_features[idx],
            'labels': self.labels[idx]
        }

    def __len__(self):
        return len(self.labels)

class MultimodalTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        outputs = model(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            numeric_features=inputs['numeric_features'],
            labels=inputs['labels']
        )
        loss = outputs['loss']
        return (loss, outputs) if return_outputs else loss

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

numeric_cols = ['fat_per_serving', 'protein_per_serving', 'sodium_per_serving', 'net_carbs_per_serving']
scaler = StandardScaler()
train_numeric = scaler.fit_transform(train_data[numeric_cols].fillna(0))
test_numeric = scaler.transform(test_data[numeric_cols].fillna(0))

bert_predictions = pd.DataFrame()
bert_results = {}

for label_idx, label in enumerate(label_cols):
    print(f"\n{'='*70}")
    print(f"Training Optimized BERT for {label}...")
    print(f"{'='*70}")

    train_texts = train_data['ingredients'].fillna('').tolist()
    train_labels = train_data[label].tolist()
    test_texts = test_data['ingredients'].fillna('').tolist()
    test_labels = test_data[label].tolist()

    # Compute class weights
    class_weights = compute_class_weight(
        'balanced',
        classes=np.unique(train_labels),
        y=train_labels
    )
    class_weights_tensor = torch.tensor(class_weights, dtype=torch.float32).to(device)

    print(f"  Class distribution: {np.bincount(train_labels)}")
    print(f"  Class weights: {class_weights}")

    train_dataset = MultimodalDataset(train_texts, train_labels, train_numeric, tokenizer)
    test_dataset = MultimodalDataset(test_texts, test_labels, test_numeric, tokenizer)

    # Initialize model with class weights
    model = MultimodalBERT(
        num_labels=2,
        num_numeric_features=4,
        class_weights=class_weights_tensor
    ).to(device)

    # OPTIMIZED training arguments
    training_args = TrainingArguments(
        output_dir=f'./bert_optimized_{label}',
        num_train_epochs=4,           # Reduced to 4 (sweet spot)
        per_device_train_batch_size=8,  # Smaller batch for better gradients
        per_device_eval_batch_size=16,
        warmup_steps=100,              # More warmup
        weight_decay=0.01,
        learning_rate=1e-5,            # LOWER learning rate (was 2e-5)
        logging_steps=20,
        eval_strategy="no",
        save_strategy="no",
        report_to="none",
        gradient_accumulation_steps=2,  # Effective batch size = 16
    )

    trainer = MultimodalTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
    )

    # Train
    trainer.train()

    # Predict
    model.eval()
    predictions = []

    with torch.no_grad():
        for i in range(0, len(test_dataset), 16):
            batch_end = min(i+16, len(test_dataset))
            batch = [test_dataset[j] for j in range(i, batch_end)]

            input_ids = torch.stack([item['input_ids'] for item in batch]).to(device)
            attention_mask = torch.stack([item['attention_mask'] for item in batch]).to(device)
            numeric_features = torch.stack([item['numeric_features'] for item in batch]).to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                numeric_features=numeric_features
            )

            logits = outputs['logits']
            preds = torch.argmax(logits, dim=1).cpu().numpy()
            predictions.extend(preds)

    bert_predictions[label] = predictions

    # Evaluate
    y_test = test_data[label].values
    f1_macro = f1_score(y_test, predictions, average='macro', zero_division=0)
    f1_micro = f1_score(y_test, predictions, average='micro', zero_division=0)

    bert_results[label] = {'macro_f1': f1_macro, 'micro_f1': f1_micro}

    print(f"\n{label} Results:")
    print(f"  Macro-F1: {f1_macro:.3f}")
    print(f"  Micro-F1: {f1_micro:.3f}")

avg_macro = np.mean([r['macro_f1'] for r in bert_results.values()])
avg_micro = np.mean([r['micro_f1'] for r in bert_results.values()])

print(f"\n{'='*70}")
print(f"Optimized BERT Average Macro-F1: {avg_macro:.3f}")
print(f"Optimized BERT Average Micro-F1: {avg_micro:.3f}")

bert_predictions.to_csv("bert_predictions.csv", index=False)
print("\n‚úÖ Saved: bert_predictions.csv")
print("="*70)

ü§ñ MODEL 3: BERT FINE-TUNING (OPTIMIZED)


‚è≥ This will take 10-15 minutes...

Using device: cuda


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]


Training Optimized BERT for keto_compliant...
  Class distribution: [151  72]
  Class weights: [0.7384106  1.54861111]


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Step,Training Loss
20,0.7113
40,0.692



keto_compliant Results:
  Macro-F1: 0.562
  Micro-F1: 0.607

Training Optimized BERT for high_protein...
  Class distribution: [140  83]
  Class weights: [0.79642857 1.34337349]


Step,Training Loss
20,0.7084
40,0.6977



high_protein Results:
  Macro-F1: 0.514
  Micro-F1: 0.518

Training Optimized BERT for low_sodium...
  Class distribution: [121 102]
  Class weights: [0.9214876  1.09313725]


Step,Training Loss
20,0.6825
40,0.6846



low_sodium Results:
  Macro-F1: 0.562
  Micro-F1: 0.643

Training Optimized BERT for low_fat...
  Class distribution: [134  89]
  Class weights: [0.83208955 1.25280899]


Step,Training Loss
20,0.6938
40,0.6849


---
# Part 4: Model Comparison & Error Analysis
---

In [None]:
# ======================================================================
# Cell 11: Compare All Models
# ======================================================================

print("üìä MODEL COMPARISON\n")
print("="*70)

# Create comparison table
comparison = []

for label in label_cols:
    comparison.append({
        'Label': label,
        'Rule-Based (Macro-F1)': f"{rule_results[label]['macro_f1']:.3f}",
        'TF-IDF (Macro-F1)': f"{tfidf_results[label]['macro_f1']:.3f}",
        'BERT (Macro-F1)': f"{bert_results[label]['macro_f1']:.3f}",
    })

comparison_df = pd.DataFrame(comparison)
print(comparison_df.to_string(index=False))

# Overall averages
rule_avg = np.mean([r['macro_f1'] for r in rule_results.values()])
tfidf_avg = np.mean([r['macro_f1'] for r in tfidf_results.values()])
bert_avg = np.mean([r['macro_f1'] for r in bert_results.values()])

print(f"\n{'='*70}")
print(f"\nOVERALL AVERAGES:")
print(f"  Rule-Based:  {rule_avg:.3f}")
print(f"  TF-IDF:      {tfidf_avg:.3f}")
print(f"  BERT:        {bert_avg:.3f}")

# Determine best model
best_model = max(
    [('Rule-Based', rule_avg), ('TF-IDF', tfidf_avg), ('BERT', bert_avg)],
    key=lambda x: x[1]
)

print(f"\nüèÜ Best model: {best_model[0]} (Macro-F1: {best_model[1]:.3f})")

# Save comparison
comparison_df.to_csv("model_comparison.csv", index=False)
print("\n‚úÖ Saved: model_comparison.csv")
print("="*70)

In [None]:
# ======================================================================
# Cell 12: Error Analysis (20 Products) - FIXED
# ======================================================================

print("üîç ERROR ANALYSIS\n")
print("="*70)

# Use best model's predictions (BERT)
error_analysis = []

test_data_reset = test_data.reset_index(drop=True)

for label in label_cols:
    y_true = test_data_reset[label].values
    y_pred = bert_predictions[label].values

    # Find misclassified products
    misclassified_mask = (y_true != y_pred)
    misclassified = test_data_reset[misclassified_mask].copy()
    misclassified['predicted'] = y_pred[misclassified_mask]
    misclassified['actual'] = y_true[misclassified_mask]
    misclassified['label'] = label

    error_analysis.append(misclassified)

# Combine all errors
all_errors = pd.concat(error_analysis, ignore_index=True)

# Take first 20 errors
errors_sample = all_errors.head(20)

print(f"Total errors found: {len(all_errors)}")
print(f"\nAnalyzing first 20 errors:\n")

for idx, row in errors_sample.iterrows():
    # Handle missing names safely
    product_name = row.get('name', 'Unknown Product')
    if pd.isna(product_name):
        product_name = f"Product {row.get('product_id', 'Unknown')}"
    else:
        product_name = str(product_name)[:50]  # Convert to string and truncate

    print(f"{idx+1}. {product_name}")
    print(f"   Label: {row['label']}")
    print(f"   Predicted: {int(row['predicted'])}, Actual: {int(row['actual'])}")

    # Show nutrition values
    if row['label'] == 'keto_compliant':
        net_carbs = row.get('net_carbs_per_serving', 'N/A')
        print(f"   Net carbs: {net_carbs:.2f}g (threshold: ‚â§5g)" if isinstance(net_carbs, (int, float)) else f"   Net carbs: {net_carbs}")
    elif row['label'] == 'high_protein':
        protein = row.get('protein_per_serving', 'N/A')
        print(f"   Protein: {protein:.2f}g (threshold: ‚â•10g)" if isinstance(protein, (int, float)) else f"   Protein: {protein}")
    elif row['label'] == 'low_sodium':
        sodium = row.get('sodium_per_serving', 'N/A')
        print(f"   Sodium: {sodium:.2f}mg (threshold: ‚â§140mg)" if isinstance(sodium, (int, float)) else f"   Sodium: {sodium}")
    elif row['label'] == 'low_fat':
        fat = row.get('fat_per_serving', 'N/A')
        print(f"   Fat: {fat:.2f}g (threshold: ‚â§3g)" if isinstance(fat, (int, float)) else f"   Fat: {fat}")

    print()

# Save error analysis
errors_to_save = errors_sample[['product_id', 'name', 'label', 'predicted', 'actual']].copy()
errors_to_save['name'] = errors_to_save['name'].fillna('Unknown Product')
errors_to_save.to_csv("error_analysis_20products.csv", index=False)

print("‚úÖ Saved: error_analysis_20products.csv")
print("="*70)

---
# Part 5: Generate Final Reports
---

In [None]:
# ======================================================================
# Cell 13: Generate Comprehensive Report
# ======================================================================

print("üìÑ GENERATING FINAL REPORTS\n")
print("="*70)

# Cohen's Kappa Report
kappa_report = []
kappa_report.append("# Task 1: Cohen's Kappa Report\n")
kappa_report.append(f"**Annotators:** Aayushi Saraswat, Rahul Thirumurugan\n")
kappa_report.append(f"**Products:** {len(aayushi)}\n")
kappa_report.append("\n## Results\n")
kappa_report.append("| Label | Cohen's Kappa | Status |")
kappa_report.append("|-------|---------------|--------|")

for label, kappa in kappa_results.items():
    status = "‚úÖ Pass" if kappa >= 0.65 else "‚ö†Ô∏è Review"
    kappa_report.append(f"| {label} | {kappa:.3f} | {status} |")

kappa_report.append(f"\n**Average Œ∫:** {avg_kappa:.3f}\n")
kappa_report.append(f"**B+ Requirement (Œ∫ ‚â• 0.65):** {'‚úÖ MET' if avg_kappa >= 0.65 else '‚ö†Ô∏è NOT MET'}\n")

with open("TASK1_COHENS_KAPPA_REPORT.md", "w") as f:
    f.write("\n".join(kappa_report))

print("‚úÖ Saved: TASK1_COHENS_KAPPA_REPORT.md")

# Model Comparison Report
model_report = []
model_report.append("# Task 1: Model Comparison Report\n")
model_report.append("## Performance Summary\n")
model_report.append("| Model | Average Macro-F1 | Average Micro-F1 |")
model_report.append("|-------|------------------|------------------|")
model_report.append(f"| Rule-Based | {rule_avg:.3f} | {np.mean([r['micro_f1'] for r in rule_results.values()]):.3f} |")
model_report.append(f"| TF-IDF + LogReg | {tfidf_avg:.3f} | {np.mean([r['micro_f1'] for r in tfidf_results.values()]):.3f} |")
model_report.append(f"| BERT | {bert_avg:.3f} | {np.mean([r['micro_f1'] for r in bert_results.values()]):.3f} |")
model_report.append(f"\n**Best Model:** {best_model[0]} ({best_model[1]:.3f})\n")

with open("TASK1_MODEL_COMPARISON.md", "w") as f:
    f.write("\n".join(model_report))

print("‚úÖ Saved: TASK1_MODEL_COMPARISON.md")

# Error Analysis Report
error_report = []
error_report.append("# Task 1: Error Analysis Report\n")
error_report.append(f"**Total errors:** {len(all_errors)}\n")
error_report.append(f"**Sample analyzed:** 20 products\n")
error_report.append("\n## Error Patterns\n")
error_report.append("See error_analysis_20products.csv for details\n")

with open("TASK1_ERROR_ANALYSIS.md", "w") as f:
    f.write("\n".join(error_report))

print("‚úÖ Saved: TASK1_ERROR_ANALYSIS.md")

print("\n" + "="*70)

In [None]:
# ======================================================================
# Cell 14: Summary & Deliverables
# ======================================================================

print("\n" + "="*70)
print("‚úÖ TASK 1 COMPLETE!")
print("="*70)

print("\nüìÅ DATA FILES:")
print("  1. products_with_task1_labels.csv - Full dataset with labels")
print("  2. task1_annotations_consensus.csv - Consensus annotations")
if len(disagreements) > 0:
    print("  3. task1_disagreements.csv - Annotator disagreements")

print("\nüìÅ MODEL FILES:")
print("  1. rule_based_predictions.csv")
print("  2. tfidf_model.pkl, tfidf_predictions.csv")
print("  3. bert_predictions.csv")
print("  4. model_comparison.csv")

print("\nüìÅ REPORTS:")
print("  1. TASK1_COHENS_KAPPA_REPORT.md")
print("  2. TASK1_MODEL_COMPARISON.md")
print("  3. TASK1_ERROR_ANALYSIS.md")
print("  4. error_analysis_20products.csv")

print("\nüìä KEY RESULTS:")
print(f"  ‚Ä¢ Total products labeled: {len(products)}")
print(f"  ‚Ä¢ Cohen's Kappa: {avg_kappa:.3f} {'‚úÖ' if avg_kappa >= 0.65 else '‚ö†Ô∏è'}")
print(f"  ‚Ä¢ Best model: {best_model[0]} (F1: {best_model[1]:.3f})")

print("\n‚úÖ B+ GRADE CONTRACT STATUS:")
print("  ‚úÖ 180+ products labeled")
print(f"  {'‚úÖ' if avg_kappa >= 0.65 else '‚ö†Ô∏è'} Cohen's Kappa ‚â• 0.65")
print("  ‚úÖ Rule-based model implemented")
print("  ‚úÖ TF-IDF + LogReg implemented")
print("  ‚úÖ BERT model implemented")
print("  ‚úÖ Evaluation metrics calculated")
print("  ‚úÖ Error analysis (20 products)")

print("\n‚è≠Ô∏è  NEXT STEPS:")
print("  1. Download all output files")
print("  2. Review error analysis for insights")
print("  3. Move to Task 2: Claim Verification")
print("  4. Use these results in final report")

print("\n" + "="*70)
print("\nüéâ ALL TASK 1 REQUIREMENTS COMPLETED!")
print("="*70)