# Cleaning the What's Cooking Dataset
## Ingredients
Before we can get started, we need to standardize our ingredients.
As of now, one-hot coding would be unreliable if the same ingredients cannot be interpreted as such.

For example:
- "skinless chicken breasts"
- "free range chicken breasts"
- "cooked chicken breasts"

Since our goal is to collect information on what ingredients constitute to what cuisine, these must all be canonicalized as "chicken"

In [2]:
# First, we have to collect our ingredients from the hypergraph.
# The ingredients can be found in the "node-data" of our hypergraph.

import json
import os

dataset_path = "dataset"
# Load the dataset
with open(os.path.join(dataset_path, 'kaggle-whats-cooking.json'), 'r') as f:
    data = json.load(f)

# Extract ingredients from node-data
ingredients_list = []
for key, value in data['node-data'].items():
    ingredients_list.append({
        'id': key,
        'ingredient': value['name'],
        'canonicalized': ''
    })

# Save to new JSON file
with open(os.path.join(dataset_path, 'ingredients.json'), 'w') as f:
    json.dump(ingredients_list, f, indent=2)

print(f"Extracted {len(ingredients_list)} ingredients to ingredients.json")

Extracted 6714 ingredients to ingredients.json


Next, we'll employ a local LLM to assist us in generating the canonicalized ingredients.

In [None]:
# Here is the prompt we will use for this step.
def generate_prompt(ingredients_list):
    num_ingredients = len(ingredients_list)
    prompt = f"""
    SYSTEM / INSTRUCTION PROMPT:
    You will canonicalize exactly {num_ingredients} raw ingredient strings and return a JSON array containing exactly 30 objects. The input will be a JSON array of 30 objects, each with fields:
    - id: string or number
    - ingredient: original ingredient string

    Required output
    - Return only a JSON array of length {num_ingredients}.
    - Each element must be an object with exactly these fields in this order:
    1) "id" (copy the original id)
    2) "ingredient" (copy the original ingredient string exactly)
    3) "canonical" (one canonical ingredient string)

    Output example for one item:
    {{"id": "3689", "ingredient": "coke zero", "canonical": "soft drink"}}

    Hard rules for canonical
    1. Output must be JSON only. No extra text, no commentary, no trailing commas.
    2. Output must contain exactly 30 objects. If you cannot produce 30, return a JSON array of 30 objects with "canonical":"other" for entries you cannot map.
    3. Preserve the input id and ingredient exactly as provided.
    4. The "canonical" value must be a single, general ingredient name in lowercase. Use plain words and spaces. Do not add extra fields.
    5. Use singular nouns (potato, not potatoes).
    6. Remove quantities, percentages, fat labels and marketing fluff. Example: "1% low-fat buttermilk" -> "buttermilk".
    7. Remove preparation/state words. Drop words like cooked, raw, peeled, shelled, chopped, diced, frozen, fresh.
    8. Collapse cuts and varieties into the base ingredient. Example: "chicken breast" -> "chicken", "fingerling potatoes" -> "potato".
    9. Collapse seasoning blends to "seasoning".
    10. Collapse mixed or composite products to their broad function: e.g., "tropical fruits" -> "fruit", "bloody mary mix" -> "cocktail mix", "meat stock" -> "stock".
    11. Brand-name packaged goods collapse to product type: e.g., "coke zero" -> "soft drink".
    12. If the string clearly names a distinct item (sugar, bucatini, shrimp), canonicalize to that item but prefer broader categories when unsure.
    13. If ambiguous or unknown, set canonical to "other".
    14. Use conservative generalization. Prefer fewer, stable tokens over many rare tokens.

    Formatting and validation
    - canonical must not be empty or null.
    - canonical must be lowercased and trimmed.
    - Do not include punctuation beyond internal hyphens or spaces if needed.
    - The assistant must validate output length. If the input contains fewer or more than 30 items, still produce exactly 30 outputs by processing the first 30 inputs or padding with objects where canonical is "other".

    Short allowed examples (not an exhaustive list)
    chicken, beef, pork, lamb, shrimp, fish, potato, onion, garlic, tomato, pepper, peas, beans, rice, pasta, noodles, bread, milk, butter, buttermilk, oil, vinegar, seasoning, spice, salt, sugar, stock, broth, fruit, soft drink, cocktail mix, other

    Final instruction
    - Return only the JSON array described above, exactly 30 objects. No text before or after the JSON.

    INPUT (JSON array of {num_ingredients} items):
    {ingredients_list}
    """
    return prompt

# And here is the code to get a batch of 30 ingredients.

with open(os.path.join(dataset_path, 'ingredients.json'), 'r') as f:
    ingredients = json.load(f)        
def get_ingredients_batch(n):
    start_idx = n * 30
    end_idx = start_idx + 30
    batch = ingredients[start_idx:end_idx]
    num_ingredients = len(batch)
    return batch

# let's see the first batch
prompt = generate_prompt(get_ingredients_batch(0))
print(prompt)
print(f"Prompt length: {len(prompt)} characters")


    SYSTEM / INSTRUCTION PROMPT:
    You will canonicalize exactly 30 raw ingredient strings and return a JSON array containing exactly 30 objects. The input will be a JSON array of 30 objects, each with fields:
    - id: string or number
    - ingredient: original ingredient string

    Required output
    - Return only a JSON array of length 30.
    - Each element must be an object with exactly these fields in this order:
    1) "id" (copy the original id)
    2) "ingredient" (copy the original ingredient string exactly)
    3) "canonical" (one canonical ingredient string)

    Output example for one item:
    {"id": "3689", "ingredient": "coke zero", "canonical": "soft drink"}

    Hard rules for canonical
    1. Output must be JSON only. No extra text, no commentary, no trailing commas.
    2. Output must contain exactly 30 objects. If you cannot produce 30, return a JSON array of 30 objects with "canonical":"other" for entries you cannot map.
    3. Preserve the input id and ingr