<a href="https://colab.research.google.com/github/aayushis1203/dietcheck/blob/main/00_DietCheck_data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 00 – Data Collection & Preparation for DietCheck

**Course:** CS6120 – Natural Language Processing  
**Project:** DietCheck – NLP System for Dietary Claim Verification  
**Notebook:** `00` – Core data preparation, numeric labels for Task 1, and claim subsets for Task 2.

This notebook does the following:

1. Loads the **core product table** (`products.csv`) for DietCheck.
2. Computes **per-serving nutrition features** and **Task 1 dietary labels**:
   - `keto_compliant`, `high_protein`, `low_sodium`, `low_fat`  
     (using the FDA-style thresholds in the research plan).
3. Creates **train/validation/test splits** with label-combination awareness.
4. Extracts a **small, high-precision set of claim-like strings** from `products.csv`
   for **Task 2 manual annotation** → `candidate_claims_task2.csv`.
5. Builds a **claim-rich subset from OpenFoodFacts via HuggingFace** using
   `labels_tags` → `openfoodfacts_claims_subset.csv` for additional Task 2 data.

You should run this notebook top-to-bottom in a Colab or local environment with internet access
(for the HuggingFace step).


In [11]:

# ======================================================================
# Cell 1: Imports, paths, and logging
# ======================================================================

import os
import math
from pathlib import Path

import numpy as np
import pandas as pd

import logging

# Configure basic logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    datefmt="%H:%M:%S",
)
logger = logging.getLogger(__name__)

# Data directory
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

print(f"DATA_DIR set to: {DATA_DIR.resolve()}")


DATA_DIR set to: /content/data



## 1. Load base products table (`products.csv`)

We assume there is a `products.csv` in the `data/` directory, containing one row per product with:

- `product_id`
- `name`
- `brand`
- `category`
- ingredient and nutrient information (per 100g), e.g.:
  - `serving_size_g`
  - `energy_100g`, `fat_100g`, `saturated_fat_100g`, `carbs_100g`,
    `fiber_100g`, `sugars_100g`, `protein_100g`, `sodium_100g`

If you generated `products.csv` in an earlier step, just place it in `data/` before running this cell.


In [12]:

# ======================================================================
# Cell 2: Load products.csv
# ======================================================================

products_path = DATA_DIR / "products.csv"

if not products_path.exists():
    raise FileNotFoundError(
        f"Expected {products_path} to exist.\n"
        "Please copy your DietCheck products table to data/products.csv and re-run."
    )

df = pd.read_csv(products_path)
print(f"Loaded products.csv with shape: {df.shape}")
print("Columns:", list(df.columns))


Loaded products.csv with shape: (279, 29)
Columns: ['product_id', 'name', 'brand', 'category', 'ingredients', 'serving_size_g', 'energy_100g', 'fat_100g', 'saturated_fat_100g', 'carbs_100g', 'fiber_100g', 'sugars_100g', 'protein_100g', 'sodium_100g', 'net_carbs_100g', 'energy_per_serving', 'fat_per_serving', 'saturated_fat_per_serving', 'carbs_per_serving', 'fiber_per_serving', 'sugars_per_serving', 'protein_per_serving', 'sodium_per_serving', 'net_carbs_per_serving', 'keto_compliant', 'high_protein', 'low_sodium', 'low_fat', 'label_combination']



## 2. Compute per-serving features and Task 1 dietary labels

We follow the research plan thresholds:

- **keto_compliant**: net carbs ≤ 5g per serving  
  where `net_carbs = carbs – fiber – sugar_alcohols` (here we assume no sugar alcohols in the table).
- **high_protein**: ≥ 10g protein per serving (≈ 20% DV).  
- **low_sodium**: ≤ 140mg sodium per serving.  
- **low_fat**: ≤ 3g fat per serving.

We recompute per-serving values from the per-100g columns and `serving_size_g` to ensure consistency.
Existing columns with the same names will be overwritten by these computed values.


In [13]:

# ======================================================================
# Cell 3: Per-serving features + Task 1 labels
# ======================================================================

REQUIRED_COLS = [
    "product_id", "name", "brand", "category",
    "serving_size_g",
    "energy_100g", "fat_100g", "saturated_fat_100g",
    "carbs_100g", "fiber_100g", "sugars_100g",
    "protein_100g", "sodium_100g",
]

missing = [c for c in REQUIRED_COLS if c not in df.columns]
if missing:
    raise KeyError(f"Missing required columns in products.csv: {missing}")

# Ensure numeric dtypes
numeric_cols = [
    "serving_size_g",
    "energy_100g", "fat_100g", "saturated_fat_100g",
    "carbs_100g", "fiber_100g", "sugars_100g",
    "protein_100g", "sodium_100g",
]
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors="coerce")

# Net carbs per 100g and per serving
df["net_carbs_100g"] = df["carbs_100g"] - df["fiber_100g"]

factor = df["serving_size_g"] / 100.0

df["energy_per_serving"] = df["energy_100g"] * factor
df["fat_per_serving"] = df["fat_100g"] * factor
df["saturated_fat_per_serving"] = df["saturated_fat_100g"] * factor
df["carbs_per_serving"] = df["carbs_100g"] * factor
df["fiber_per_serving"] = df["fiber_100g"] * factor
df["sugars_per_serving"] = df["sugars_100g"] * factor
df["protein_per_serving"] = df["protein_100g"] * factor
df["sodium_per_serving"] = df["sodium_100g"] * factor
df["net_carbs_per_serving"] = df["net_carbs_100g"] * factor

# Task 1 labels
df["keto_compliant"] = (
    (df["net_carbs_per_serving"] <= 5.0)
).fillna(False).astype(int)

df["high_protein"] = (
    (df["protein_per_serving"] >= 10.0)
).fillna(False).astype(int)

df["low_sodium"] = (
    (df["sodium_per_serving"] <= 140.0)
).fillna(False).astype(int)

df["low_fat"] = (
    (df["fat_per_serving"] <= 3.0)
).fillna(False).astype(int)

df["label_combination"] = (
    df["keto_compliant"].astype(str)
    + "_" + df["high_protein"].astype(str)
    + "_" + df["low_sodium"].astype(str)
    + "_" + df["low_fat"].astype(str)
)

df.head()


Unnamed: 0,product_id,name,brand,category,ingredients,serving_size_g,energy_100g,fat_100g,saturated_fat_100g,carbs_100g,...,fiber_per_serving,sugars_per_serving,protein_per_serving,sodium_per_serving,net_carbs_per_serving,keto_compliant,high_protein,low_sodium,low_fat,label_combination
0,5010029000016,Weetabix,Weetabix,en:plant-based-foods-and-beverages,"Wholegrain Wheat (95%), Malted Barley Extract,...",238.0,358.0,2.11,0.526,68.4,...,23.8,10.0198,28.084,245.14,138.992,0,1,0,0,0_1_0_0
1,3168930010265,cruesly mélange de noix,Quaker,en:plant-based-foods-and-beverages,"_avoine_ complète (32%), _blé_ complet (18%), ...",45.0,462.0,19.0,2.0,57.0,...,4.5,5.4,3.825,0.0,21.15,0,0,1,0,0_0_1_0
2,5010029000801,Family Pack,Weetabix,en:plant-based-foods-and-beverages,"Wholegrain _Wheat_ (95%), Malted _Barley_ Extr...",40.0,362.0,2.0,0.6,74.0,...,4.0,1.68,4.8,44.0,25.6,0,0,1,1,0_0_1_1
3,20003166,Haferflocken,"Brownfield, CROWNFIELD",en:plant-based-foods-and-beverages,100 % wholemeal oat flakes,40.0,372.0,7.0,1.3,58.7,...,4.0,0.28,5.4,4.8,19.48,0,0,1,1,0_0_1_1
4,3229820019307,Flocons d'avoine,Bjorg,en:plant-based-foods-and-beverages,Flocons d'_avoine_ complète issue de l'agricul...,60.0,362.0,7.1,1.3,58.0,...,6.6,1.02,6.6,4.8,28.2,0,0,1,0,0_0_1_0



### Save updated `products.csv`

We now overwrite `data/products.csv` with the refreshed per-serving features and Task 1 labels.
This table is the **master dataset** for downstream notebooks.


In [14]:

# ======================================================================
# Cell 4: Save updated products.csv with Task 1 labels
# ======================================================================

df.to_csv(DATA_DIR / "products.csv", index=False)
print(f"Saved updated products.csv with Task 1 labels to: {DATA_DIR / 'products.csv'}")


Saved updated products.csv with Task 1 labels to: data/products.csv



## 3. Create train/validation/test splits

We create stratified splits by `label_combination` where possible, but avoid pathological cases
where a combination only appears once.

- **Train**: 70%  
- **Validation**: 15%  
- **Test**: 15%  

We keep rare combinations (with count < 2) in the training set only.


In [15]:

# ======================================================================
# Cell 5: Train/Val/Test splits with label-aware logic
# ======================================================================

from sklearn.model_selection import train_test_split

SPLIT_SEED = 42
np.random.seed(SPLIT_SEED)

TRAIN_RATIO = 0.70
VAL_RATIO = 0.15
TEST_RATIO = 0.15

assert abs(TRAIN_RATIO + VAL_RATIO + TEST_RATIO - 1.0) < 1e-6

if "label_combination" not in df.columns:
    raise KeyError("Expected 'label_combination' column before splitting.")

combo_counts = df["label_combination"].value_counts()
keep_combos = combo_counts[combo_counts >= 2].index

df_for_splits = df[df["label_combination"].isin(keep_combos)].copy()
df_rare = df[~df["label_combination"].isin(keep_combos)].copy()

print(f"Total products          : {len(df)}")
print(f"Products used for splits: {len(df_for_splits)}")
print(f"Rare-combo products     : {len(df_rare)} (will be added to train only)")

# Split into train+val and test
df_temp, df_test = train_test_split(
    df_for_splits,
    test_size=TEST_RATIO,
    random_state=SPLIT_SEED,
    stratify=df_for_splits["label_combination"],
)

val_ratio_adj = VAL_RATIO / (TRAIN_RATIO + VAL_RATIO)

df_train, df_val = train_test_split(
    df_temp,
    test_size=val_ratio_adj,
    random_state=SPLIT_SEED,
    stratify=df_temp["label_combination"],
)

df_train = pd.concat([df_train, df_rare], ignore_index=True)

print("\nFinal split sizes:")
print("  Train:", len(df_train))
print("  Val  :", len(df_val))
print("  Test :", len(df_test))

df_train.to_csv(DATA_DIR / "products_train.csv", index=False)
df_val.to_csv(DATA_DIR / "products_val.csv", index=False)
df_test.to_csv(DATA_DIR / "products_test.csv", index=False)

print("\nSaved:")
print(f"  {DATA_DIR / 'products_train.csv'}")
print(f"  {DATA_DIR / 'products_val.csv'}")
print(f"  {DATA_DIR / 'products_test.csv'}")


Total products          : 279
Products used for splits: 278
Rare-combo products     : 1 (will be added to train only)

Final split sizes:
  Train: 195
  Val  : 42
  Test : 42

Saved:
  data/products_train.csv
  data/products_val.csv
  data/products_test.csv



## 4. Candidate claim strings from products table (Task 2, small manual set)

This step extracts a **small, high-precision set of claim-like strings** from the product metadata,
to serve as a **manual annotation set for Task 2**.

We scan `name`, `category`, and `brand` for claims like "no added sugar", "0% fat",
"gluten free", etc., and save to `data/candidate_claims_task2.csv`.


In [16]:

# ======================================================================
# Cell 6: Candidate claim extraction from products table
# ======================================================================

import re

print("\n➤ Extracting claim-like strings for Task 2 (manual annotation)\n")

source_df = df_train.copy()
print(f"  ⮕ Using TRAINING split for claim extraction: {len(source_df)} products")

TEXT_FIELDS = [f for f in ["name", "category", "brand"] if f in source_df.columns]
print(f"  ⮕ Scanning text fields (claims likely here): {TEXT_FIELDS}\n")

CLAIM_PATTERNS = {
    "low_sugar": [
        r"\bno\s+added\s+sugar\b",
        r"\bwithout\s+added\s+sugar\b",
        r"\bsugar[-\s]?free\b",
        r"\bsans\s+sucre[s]?\s+ajout[ée]?[s]?\b",
        r"\bsans\s+sucre[s]?\b",
        r"\b0\s*%\s*sucre\b",
    ],
    "low_fat": [
        r"\blow[-\s]?fat\b",
        r"\b0\s*%\s*fat\b",
        r"\bfat[-\s]?free\b",
        r"\bfaible\s+en\s+matière[s]?\s+grasse[s]?\b",
        r"\bpauvre\s+en\s+matière[s]?\s+grasse[s]?\b",
    ],
    "high_protein": [
        r"\b(high|rich)\s+in\s+protein\b",
        r"\bprotein[-\s]?rich\b",
        r"\bsource\s+of\s+protein\b",
        r"\briche\s+en\s+protéines?\b",
        r"\bsource\s+de\s+protéines?\b",
    ],
    "high_fiber": [
        r"\b(high|rich)\s+in\s+fib(re|er)s?\b",
        r"\bsource\s+of\s+fib(re|er)s?\b",
        r"\briche\s+en\s+fibres?\b",
        r"\bsource\s+de\s+fibres?\b",
        r"\bfibres?\b",
    ],
    "low_sodium": [
        r"\blow\s+(salt|sodium)\b",
        r"\breduced\s+salt\b",
        r"\breduced\s+sodium\b",
        r"\bno\s+added\s+salt\b",
        r"\bsans\s+sel\s+ajouté\b",
        r"\bfaible\s+en\s+sel\b",
        r"\bpauvre\s+en\s+sel\b",
    ],
    "gluten_free": [
        r"\bgluten[-\s]?free\b",
        r"\bsans\s+gluten\b",
    ],
    "lactose_free": [
        r"\blactose[-\s]?free\b",
        r"\bsans\s+lactose\b",
    ],
    "keto": [
        r"\bketo(?:genic)?\b",
        r"\bketo[-\s]?friendly\b",
    ],
    "light": [
        r"\blight\b",
        r"\blightly\s+salted\b",
        r"\bléger\b",
    ],
}

compiled_patterns = {
    k: [re.compile(p, flags=re.IGNORECASE) for p in v]
    for k, v in CLAIM_PATTERNS.items()
}

def extract_claims_from_text(pid, field_name, text, context_window=25):
    if not isinstance(text, str) or not text.strip():
        return []
    candidates = []
    for claim_type, regex_list in compiled_patterns.items():
        for regex in regex_list:
            for match in regex.finditer(text):
                start, end = match.span()
                left = max(0, start - context_window)
                right = min(len(text), end + context_window)
                snippet = text[left:right].strip()
                candidates.append(
                    {
                        "product_id": pid,
                        "claim_text": snippet,
                        "claim_type_hint": claim_type,
                        "source_field": field_name,
                        "full_text": text,
                    }
                )
    return candidates

all_candidates = []
for _, row in source_df.iterrows():
    pid = row.get("product_id", None)
    for field in TEXT_FIELDS:
        text = row.get(field, None)
        all_candidates.extend(
            extract_claims_from_text(pid, field, text, context_window=25)
        )

if not all_candidates:
    print("⚠️ No candidate claims found with current settings.")
    candidates_df = pd.DataFrame(
        columns=["product_id", "claim_text", "claim_type_hint", "source_field", "full_text"]
    )
else:
    candidates_df = pd.DataFrame(all_candidates)
    candidates_df = candidates_df.drop_duplicates(
        subset=["product_id", "claim_text", "claim_type_hint", "source_field"]
    ).reset_index(drop=True)

print(f"  ⮕ Extracted {len(candidates_df)} claim-like strings")

if not candidates_df.empty:
    print("\n  ⮕ Claim type counts:")
    print(candidates_df["claim_type_hint"].value_counts())
    print(f"\n  ⮕ Products with ≥1 claim: {candidates_df['product_id'].nunique()}")

claims_path = DATA_DIR / "candidate_claims_task2.csv"
candidates_df.to_csv(claims_path, index=False)
print(f"\n➤ Saved candidate claim strings to: {claims_path}")



➤ Extracting claim-like strings for Task 2 (manual annotation)

  ⮕ Using TRAINING split for claim extraction: 195 products
  ⮕ Scanning text fields (claims likely here): ['name', 'category', 'brand']

  ⮕ Extracted 6 claim-like strings

  ⮕ Claim type counts:
claim_type_hint
low_fat       3
low_sugar     1
high_fiber    1
light         1
Name: count, dtype: int64

  ⮕ Products with ≥1 claim: 4

➤ Saved candidate claim strings to: data/candidate_claims_task2.csv



## 5. Claim-rich subset from HuggingFace OpenFoodFacts

This step uses the `openfoodfacts/product-database` dataset on HuggingFace to build a **claim-rich**
subset of products, based on `labels_tags`. It outputs `data/openfoodfacts_claims_subset.csv`.


In [None]:
# ======================================================================
# Cell 7: HuggingFace OpenFoodFacts claim-rich subset
# ======================================================================

# NOTE: This cell requires internet access.
!pip install -q datasets pyarrow

from datasets import load_dataset

CLAIM_LABEL_MAP = {
    # Sugar
    "en:no-added-sugars": "low_sugar",
    "en:without-added-sugars": "low_sugar",
    "en:without-added-sugar": "low_sugar",
    "en:without-sugars": "low_sugar",
    "en:low-sugar": "low_sugar",
    "en:sugar-free": "low_sugar",
    "en:without-sugar": "low_sugar",
    "en:no-sugars": "low_sugar",
    "en:reduced-sugars": "low_sugar",
    "en:no-added-salt-or-sugar": "low_sugar",
    "fr:sans-sucres-ajoutes": "low_sugar",
    "fr:sans-sucre-ajoute": "low_sugar",
    "fr:faible-teneur-en-sucres": "low_sugar",
    "fr:teneur-reduite-en-sucres": "low_sugar",

    # Fat
    "en:low-fat": "low_fat",
    "en:fat-free": "low_fat",
    "en:reduced-fat": "low_fat",
    "en:0-fat": "low_fat",
    "en:0-percent-fat": "low_fat",
    "en:skimmed-milk": "low_fat",
    "en:half-skimmed-milk": "low_fat",
    "fr:faible-teneur-en-matieres-grasses": "low_fat",
    "fr:teneur-reduite-en-matieres-grasses": "low_fat",
    "fr:0-de-matieres-grasses": "low_fat",

    # Sodium / salt
    "en:low-sodium": "low_sodium",
    "en:low-salt": "low_sodium",
    "en:very-low-sodium": "low_sodium",
    "en:very-low-salt": "low_sodium",
    "en:no-added-salt": "low_sodium",
    "en:reduced-salt": "low_sodium",
    "en:reduced-sodium": "low_sodium",
    "fr:faible-teneur-en-sel": "low_sodium",
    "fr:teneur-reduite-en-sel": "low_sodium",
    "fr:sans-sel-ajoute": "low_sodium",

    # Fibre
    "en:high-fibre": "high_fiber",
    "en:high-fiber": "high_fiber",
    "en:fibre-rich": "high_fiber",
    "en:fiber-rich": "high_fiber",
    "en:source-of-fibre": "high_fiber",
    "en:source-of-fiber": "high_fiber",
    "fr:riche-en-fibres": "high_fiber",
    "fr:source-de-fibres": "high_fiber",

    # Protein
    "en:high-protein": "high_protein",
    "en:protein-rich": "high_protein",
    "en:source-of-protein": "high_protein",
    "fr:riche-en-proteines": "high_protein",
    "fr:source-de-proteines": "high_protein",

    # Gluten / lactose
    "en:no-gluten": "gluten_free",
    "en:gluten-free": "gluten_free",
    "fr:sans-gluten": "gluten_free",
    "en:lactose-free": "lactose_free",
    "fr:sans-lactose": "lactose_free",

    # Keto / low-carb-ish
    "en:ketogenic": "keto",
    "en:keto": "keto",
    "en:low-carb": "keto",

    # Vegan / vegetarian
    "en:vegan": "vegan",
    "en:vegetarian": "vegetarian",
    "fr:vegetarien": "vegetarian",
    "fr:vegetalien": "vegan",
    "fr:vegan": "vegan",

    # Organic
    "en:organic": "organic",
    "fr:biologique": "organic",
    "fr:agriculture-biologique": "organic",

    # Palm oil free
    "en:palm-oil-free": "palm_oil_free",
    "fr:sans-huile-de-palme": "palm_oil_free",
}

TARGET_TAGS = set(CLAIM_LABEL_MAP.keys())
print(f"Mapped labels: {len(CLAIM_LABEL_MAP)} unique tags")

def extract_main_text(val):
    if val is None:
        return ""
    if isinstance(val, str):
        return val.strip()
    if isinstance(val, (list, tuple)):
        for v in val:
            if isinstance(v, str) and v.startswith("en:"):
                return v.split(":", 1)[1].strip()
        for v in val:
            if isinstance(v, str):
                if len(v) > 3 and v[2] == ":":
                    return v.split(":", 1)[1].strip()
                return v.strip()
        return ""
    return str(val).strip()

def extract_ingredients_text(val):
    if val is None:
        return ""
    if isinstance(val, str):
        return val.strip()
    if isinstance(val, (list, tuple)):
        parts = []
        for v in val:
            if isinstance(v, str):
                if len(v) > 3 and v[2] == ":":
                    parts.append(v.split(":", 1)[1].strip())
                else:
                    parts.append(v.strip())
        return ", ".join(parts)
    return str(val).strip()

def extract_nutriments(nut):
    """
    Normalize the 'nutriments' field from the dataset into a flat dict of floats.
    Handles several possible shapes:
      - dict: {"energy_100g": ..., ...}
      - list of dicts: [{"energy_100g": ...}, {...}, ...]
      - anything else: treated as empty.
    """
    # Normalize to a dict
    if nut is None:
        nut = {}
    elif isinstance(nut, list):
        # Sometimes nutriments may come as a list of dicts. Merge them.
        merged = {}
        for item in nut:
            if isinstance(item, dict):
                for k, v in item.items():
                    # Only fill missing keys to avoid weird overwrites
                    if k not in merged:
                        merged[k] = v
        nut = merged
    elif not isinstance(nut, dict):
        # Any other unexpected type → treat as empty
        nut = {}

    def get_float(key):
        v = nut.get(key)
        try:
            return float(v)
        except (TypeError, ValueError):
            return None

    return {
        "energy_100g": get_float("energy_100g"),
        "fat_100g": get_float("fat_100g"),
        "saturated_fat_100g": get_float("saturated-fat_100g"),
        "carbs_100g": get_float("carbohydrates_100g"),
        "fiber_100g": get_float("fiber_100g"),
        "sugars_100g": get_float("sugars_100g"),
        "protein_100g": get_float("proteins_100g"),
        "sodium_100g": get_float("sodium_100g"),
    }


MAX_ROWS = 2000

print("\n➤ Loading OpenFoodFacts dataset (streaming from HuggingFace)...")
ds = load_dataset("openfoodfacts/product-database", split="food", streaming=True)

rows = []
seen_codes = set()
n_scanned = 0

for example in ds:
    n_scanned += 1
    labels_tags = example.get("labels_tags") or []
    labels_tags = [t for t in labels_tags if isinstance(t, str)]
    matching_tags = [t for t in labels_tags if t in TARGET_TAGS]
    if not matching_tags:
        continue

    code = example.get("code")
    if not code or code in seen_codes:
        continue
    seen_codes.add(code)

    claim_types = []
    for t in matching_tags:
        mapped = CLAIM_LABEL_MAP.get(t)
        if mapped:
            claim_types.append(mapped)
    if not claim_types:
        continue

    product_name = extract_main_text(example.get("product_name"))
    brand = (example.get("brands") or "").strip()
    categories = (example.get("categories") or "").strip()
    ingredients_text = extract_ingredients_text(example.get("ingredients_text"))
    labels_str = (example.get("labels") or "").strip()
    nutriments = extract_nutriments(example.get("nutriments"))

    row = {
        "product_id": code,
        "name": product_name,
        "brand": brand,
        "category": categories,
        "ingredients_text": ingredients_text,
        "labels": labels_str,
        "labels_tags": "|".join(labels_tags),
        "claim_type_hint": ";".join(sorted(set(claim_types))),
        "source_field": "labels/labels_tags",
        "full_text": " | ".join(
            [x for x in [product_name, brand, categories, ingredients_text, labels_str] if x]
        ),
        **nutriments,
    }
    rows.append(row)

    if len(rows) % 200 == 0:
        print(f"  ⮕ Collected {len(rows)} claim-rich products (scanned {n_scanned})")

    if len(rows) >= MAX_ROWS:
        break

print(f"\n➤ Finished. Collected {len(rows)} claim-rich products (scanned {n_scanned} total rows).")

df_claims = pd.DataFrame(rows)
hf_output_path = DATA_DIR / "openfoodfacts_claims_subset.csv"
df_claims.to_csv(hf_output_path, index=False)
print(f"➤ Saved claim-rich subset to: {hf_output_path}")
df_claims.head()


Mapped labels: 65 unique tags

➤ Loading OpenFoodFacts dataset (streaming from HuggingFace)...
  ⮕ Collected 200 claim-rich products (scanned 3197)
  ⮕ Collected 400 claim-rich products (scanned 6498)
  ⮕ Collected 600 claim-rich products (scanned 11042)
  ⮕ Collected 800 claim-rich products (scanned 12621)
  ⮕ Collected 1000 claim-rich products (scanned 14639)
  ⮕ Collected 1200 claim-rich products (scanned 16900)
  ⮕ Collected 1400 claim-rich products (scanned 20423)
