# Keto/Vegan Diet classifier
Argmax, a consulting firm specializing in search and recommendation solutions with offices in New York and Israel, is hiring entry-level Data Scientists and Machine Learning Engineers.

At Argmax, we prioritize strong coding skills and a proactive, “get-things-done” attitude over a perfect resume. As part of our selection process, candidates are required to complete a coding task demonstrating their practical abilities.

In this task, you’ll work with a large recipe dataset sourced from Allrecipes.com. Your challenge will be to classify recipes based on their ingredients, accurately identifying keto (low-carb) and vegan (no animal products) dishes.

Successfully completing this assignment is a crucial step toward joining Argmax’s talented team.

In [1]:
!pip install transformers torch



In [3]:
from opensearchpy import OpenSearch
from decouple import config
import pandas as pd
import json
import sys
from argparse import ArgumentParser
from typing import List
from time import time
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import re
from typing import Set
from typing import List, Dict, Any, Set, Optional
from thefuzz import process, fuzz
from typing import Dict, Any, Optional, List
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"
import sys
import ast
client = OpenSearch(
    hosts=[config('OPENSEARCH_URL', 'http://localhost:9200')],
    http_auth=None,
    use_ssl=False,
    verify_certs=False,
    ssl_show_warn=False,
)

# Recipes Index
Our data is stored in OpenSearch, and you can query it using either Elasticsearch syntax or SQL.
## Elasticsearch Syntax

In [4]:
query = {
    "query": {
        "match": {
            "description": { "query": "egg" }
        }
    }
}

res = client.search(
    index="recipes",
    body=query,
    size=2
)

hits = res['hits']['hits']
hits

[{'_index': 'recipes',
  '_id': 'XCOOdJcBa3QUhkV0gTUH',
  '_score': 3.9817066,
  '_source': {'title': 'Genuine Egg Noodles',
   'description': 'These egg noodles are the original egg noodles.  ',
   'instructions': ['Combine flour, salt and baking powder. Mix in eggs and enough water to make the dough workable. Knead dough until stiff. Roll into ball and cut into quarters. Using 1/4 of the dough at a time, roll flat to about 1/8 inch use flour as needed, top and bottom, to prevent sticking. Peel up and roll from one end to the other. Cut roll into 3/8 inch strips. Noodles should be about 4 to 5 inches long depending on how thin it was originally flattened. Let dry for 1 to 3 hours.',
    'Cook like any pasta or, instead of drying first cook it fresh but make sure water is boiling and do not allow to stick. It takes practice to do this right.'],
   'ingredients': ['2 cups Durum wheat flour',
    '1/2 teaspoon salt',
    '1/4 teaspoon baking powder',
    '3 eggs',
    'water as needed'],

## SQL syntax

In [5]:
query = """
SELECT *
FROM recipes
WHERE description like '%cheese%'
LIMIT 20
"""

res = client.sql.query(body={'query': query})
df = pd.DataFrame(res["datarows"], columns=[c["name"] for c in res["schema"]])
df

Unnamed: 0,description,ingredients,instructions,photo_url,title
0,"Bell peppers stuffed with hashbrowns, ground b...","[4 frozen hash brown patties, 4 bell peppers, ...",[Cook the hashbrown patties according to packa...,http://images.media-allrecipes.com/userphotos/...,Hash Brown Hot Dish Stuffed Bell Peppers
1,"I got this recipe from my sister, whom I nickn...",[1 (16 ounce) package fully cooked kielbasa sa...,[Cook and stir the cut-up kielbasa in a large ...,http://images.media-allrecipes.com/userphotos/...,Cheese's Baked Macaroni and Cheese
2,Chicken breasts are roasted with herbs and the...,"[1 cup red wine, 1/4 cup olive oil, 1 teaspoon...","[In a large resealable bag, combine the red wi...",http://images.media-allrecipes.com/userphotos/...,Sage Apple Chicken with Brie
3,Flatbread and chicken tenders pair with veggie...,"[2 Damascus Bakeries panini flatbread, 2 table...","[Heat George Foreman or panini grill., In a sm...",http://images.media-allrecipes.com/userphotos/...,Chicken Tender Panini Sandwiches
4,"Layers of flavors, including chili and cheese,...","[1/2 cup salsa, 1/2 teaspoon chili powder, 1 (...","[Mix salsa, chili powder and beans in 1-quart ...",http://images.media-allrecipes.com/userphotos/...,Refried Bean Roll-ups
5,This is a recipe I concocted when I feel like ...,"[2 pounds turkey tenderloins, cut into 1/2 inc...","[In a medium bowl, toss the turkey with the So...",http://images.media-allrecipes.com/userphotos/...,Spicy Turkey Wraps with Strawberry Salsa
6,I work at a coffee shop and my favorite coffee...,"[2 cups graham cracker crumbs, 1/2 cup butter,...",[Preheat oven to 350 degrees F (175 degrees C)...,http://images.media-allrecipes.com/userphotos/...,Caramel Macchiato Cheesecake
7,"This creamy pilaf incorporates the fluffy, nut...","[1/4 cup quinoa, 3 tablespoons olive oil, 2 ta...",[Bring a pot of lightly salted water to a boil...,http://images.media-allrecipes.com/userphotos/...,Cheesy Quinoa Pilaf with Spinach
8,"Deliciously rich and oh-so-garlicky. Crabmeat,...","[1 (8 ounce) package cream cheese, softened, 1...",[Heat oven to 375 degrees F. Mix all ingredien...,http://images.media-allrecipes.com/global/reci...,Roasted Garlic Crab Dip
9,It is not a holiday meal without a generous se...,"[3 pounds Yukon gold potatoes, cut into chunks...",[Heat 1-inch water to boiling in large saucepa...,http://images.media-allrecipes.com/global/reci...,Garlic and Parmesan Smashed Potatoes


# Task Instructions

Your goal is to implement two classifiers:

1.	Vegan Meal Classifier
1.	Keto Meal Classifier

Unlike typical supervised machine learning tasks, the labels are not provided in the dataset. Instead, you will rely on clear and verifiable definitions to classify each meal based on its ingredients.

### Definitions:

1. **Vegan Meal**: Contains no animal products whatsoever (no eggs, milk, meat, etc.).
1. **Keto Meal**: Contains no ingredients with more than 10g of carbohydrates per 100g serving. For example, eggs are keto-friendly, while apples are not.

Note that some meals may meet both vegan and keto criteria (e.g., meals containing avocados), though most meals typically fall into neither category.

## Example heuristic:

In [5]:
def is_ingredient_vegan(ing):
    for animal_product in "egg meat milk butter veel lamb beef chicken sausage".split():
        if animal_product in ing:
            return False
    return True

def is_vegan_example(ingredients):
    return all(map(is_ingredient_vegan, ingredients))
    
df["vegan"] = df["ingredients"].apply(is_vegan_example)
df

Unnamed: 0,description,ingredients,instructions,photo_url,title,vegan
0,"Bell peppers stuffed with hashbrowns, ground b...","[4 frozen hash brown patties, 4 bell peppers, ...",[Cook the hashbrown patties according to packa...,http://images.media-allrecipes.com/userphotos/...,Hash Brown Hot Dish Stuffed Bell Peppers,False
1,"I got this recipe from my sister, whom I nickn...",[1 (16 ounce) package fully cooked kielbasa sa...,[Cook and stir the cut-up kielbasa in a large ...,http://images.media-allrecipes.com/userphotos/...,Cheese's Baked Macaroni and Cheese,False
2,Chicken breasts are roasted with herbs and the...,"[1 cup red wine, 1/4 cup olive oil, 1 teaspoon...","[In a large resealable bag, combine the red wi...",http://images.media-allrecipes.com/userphotos/...,Sage Apple Chicken with Brie,False
3,Flatbread and chicken tenders pair with veggie...,"[2 Damascus Bakeries panini flatbread, 2 table...","[Heat George Foreman or panini grill., In a sm...",http://images.media-allrecipes.com/userphotos/...,Chicken Tender Panini Sandwiches,False
4,"Layers of flavors, including chili and cheese,...","[1/2 cup salsa, 1/2 teaspoon chili powder, 1 (...","[Mix salsa, chili powder and beans in 1-quart ...",http://images.media-allrecipes.com/userphotos/...,Refried Bean Roll-ups,True
5,This is a recipe I concocted when I feel like ...,"[2 pounds turkey tenderloins, cut into 1/2 inc...","[In a medium bowl, toss the turkey with the So...",http://images.media-allrecipes.com/userphotos/...,Spicy Turkey Wraps with Strawberry Salsa,True
6,I work at a coffee shop and my favorite coffee...,"[2 cups graham cracker crumbs, 1/2 cup butter,...",[Preheat oven to 350 degrees F (175 degrees C)...,http://images.media-allrecipes.com/userphotos/...,Caramel Macchiato Cheesecake,False
7,"This creamy pilaf incorporates the fluffy, nut...","[1/4 cup quinoa, 3 tablespoons olive oil, 2 ta...",[Bring a pot of lightly salted water to a boil...,http://images.media-allrecipes.com/userphotos/...,Cheesy Quinoa Pilaf with Spinach,True
8,"Deliciously rich and oh-so-garlicky. Crabmeat,...","[1 (8 ounce) package cream cheese, softened, 1...",[Heat oven to 375 degrees F. Mix all ingredien...,http://images.media-allrecipes.com/global/reci...,Roasted Garlic Crab Dip,False
9,It is not a holiday meal without a generous se...,"[3 pounds Yukon gold potatoes, cut into chunks...",[Heat 1-inch water to boiling in large saucepa...,http://images.media-allrecipes.com/global/reci...,Garlic and Parmesan Smashed Potatoes,False


### Limitations of the Simplistic Heuristic

The heuristic described above is straightforward but can lead to numerous false positives and negatives due to its reliance on keyword matching. Common examples of incorrect classifications include:
- "Peanut butter" being misclassified as non-vegan, as “butter” is incorrectly assumed to imply dairy.
- "eggless" recipes being misclassified as non-vegan, due to the substring “egg.”
- Animal-derived ingredients such as “pork” and “bacon” being incorrectly identified as vegan, as they may not be explicitly listed in the keyword set.


# Submission
## 1. Implement Diet Classifiers
Complete the two classifier functions in the diet_classifiers.py file within this repository. Ensure your implementation correctly identifies “keto” and “vegan” meals. After implementing these functions, verify that the Flask server displays the appropriate badges (“keto” and “vegan”) next to the corresponding recipes.

> **Note**
>
> This repo contains two `diet_classifiers.py` files:
> 1. One in this folder (`nb/src/diet_classifiers.py`)
> 2. One in the Flask web app folder (`web/src/diet_classifiers.py`)
>
> You can develop your solution here in the notebook environment, but to apply your solution 
> to the Flask app you will need to copy your implementation into the `diet_classifiers.py` 
> file in the Flask folder!!!

In [6]:
import re
from typing import Set

# ==============================================================================
#  Constants for Maintainability
# ==============================================================================
# Using sets for efficient O(1) average time complexity for lookups.

# A comprehensive list of common cooking units.
# Includes singular, plural, and common abbreviations.
UNITS: Set[str] = {
    "c", "cup", "cups",
    "g", "gram", "grams",
    "kg", "kilogram", "kilograms",
    "l", "liter", "liters",
    "lb", "lbs", "pound", "pounds",
    "ml", "milliliter", "milliliters",
    "oz", "ounce", "ounces",
    "pinch", "pinches",
    "splash", "splashes",
    "sprig", "sprigs",
    "t", "tsp", "teaspoon", "teaspoons",
    "T", "tbsp", "tablespoon", "tablespoons",
    "can", "cans",
    "clove", "cloves",
    "dash", "dashes",
    "drizzle",
    "drop", "drops",
    "gallon", "gallons",
    "handful", "handfuls",
    "head", "heads",
    "package", "packages",
    "packet", "packets",
    "pint", "pints",
    "quart", "quarts",
    "scoop", "scoops",
    "sheet", "sheets",
    "slice", "slices",
    "stalk", "stalks",
    "stick", "sticks",
    "strip", "strips",
}
# A comprehensive and categorized set of non-essential words found in ingredient lists.
# The purpose of this set is to remove these words from an ingredient string
# to help isolate the core, identifiable name of the food item.
DESCRIPTORS: Set[str] = {
    # --- Preparation & Actions ---
    'beaten', 'blanched', 'boiled', 'braised', 'brewed', 'brined', 'broken',
    'charred', 'chilled', 'chopped', 'clarified', 'coarsely', 'crumbled', 'crushed',
    'cubed', 'cut', 'deboned', 'deglazed', 'deseeded', 'deveined', 'diced',
    'dissolved', 'divided', 'drained', 'finely', 'flaked', 'folded', 'grated',
    'grilled', 'halved', 'heated', 'hulled', 'husked', 'infused',
    'julienned', 'juiced', 'kneaded', 'marinated', 'mashed', 'melted', 'minced',
    'mixed', 'parboiled', 'patted', 'peeled', 'pitted', 'poached', 'pounded',
    'prepared', 'pressed', 'pureed', 'quartered', 'rinsed', 'roasted', 'rolled',
    'roughly', 'scalded', 'scored', 'scrubbed', 'seared', 'seeded', 'segmented',
    'shaved', 'shredded', 'shucked', 'sifted', 'skewered', 'sliced', 'slivered',
    'smashed', 'soaked', 'softened', 'squeezed', 'steamed', 'stemmed', 'stewed',
    'strained', 'stuffed', 'thawed', 'thinly', 'tied', 'toasted', 'torn', 'trimmed',
    'whisked', 'zested',

    # --- State, Condition & Temperature ---
    'canned', 'cold', 'condensed', 'cooked', 'cooled', 'cored', 'creamed', 'cured',
    'defrosted', 'dried', 'fermented', 'firmly', 'fresh', 'freshly', 'frozen',
    'hard', 'hot', 'instant', 'jarred', 'lean', 'leftover', 'light', 'lukewarm',
    'optional', 'pasteurized', 'powdered', 'preserved', 'raw', 'ready-to-use',
    'refrigerated', 'ripe', 'room', 'skin-on', 'skinless', 'soft', 'stiff',
    'temperature', 'uncooked', 'undrained', 'unripe', 'warm', 'washed', 'whole',

    # --- Size & Shape ---
    'bite-sized', 'chunky', 'clump', 'coarse', 'fine', 'jumbo', 'large', 'long',
    'medium', 'round', 'short', 'small', 'thick', 'thin',

    # --- Quantifiers & Qualifiers ---
    'about', 'additional', 'approximately', 'bunch', 'coarse', 'extra', 'generous',
    'heavy', 'heaping', 'level', 'more', 'packed', 'plus', 'scant', 'splash',
    'sprig', 'sprinkle',

    # --- Flavor & Taste ---
    'bitter', 'salty', 'savory', 'sour', 'spicy', 'sweet', 'sweetened', 'unsalted',
    'unsweetened',

    # --- Common Stop Words (Articles, Conjunctions, Prepositions) ---
    'a', 'an', 'and', 'as', 'at', 'for', 'in', 'into', 'of', 'on', 'or', 'the',
    'to', 'with', 'without',

    # --- Instructions & Meta-words ---
    'divided', 'dusting', 'garnish', 'needed', 'serving', 'taste',
}

def parse_ingredient(ingredient_string: str) -> str:
    """
    Parses a raw ingredient string to extract its essential name.

    This function cleans the input string by performing a series of sequential
    operations:
    1.  Converts the string to lowercase.
    2.  Removes text within parentheses (e.g., "(optional)").
    3.  Removes numerical quantities, including fractions and decimals.
    4.  Removes punctuation.
    5.  Splits the string into words and removes common units and descriptors.
    6.  Reassembles the string and normalizes whitespace.

    Args:
        ingredient_string: The raw ingredient string from a recipe.
                           Example: "2 1/2 cups (12.5 oz) sifted all-purpose flour, for dusting"

    Returns:
        A cleaned, normalized string representing the core ingredient.
        Example: "all-purpose flour"
    """
    if not isinstance(ingredient_string, str) or not ingredient_string:
        return ""

    # 1. Convert to lowercase for consistent processing.
    text = ingredient_string.lower()

    # 2. Remove parenthetical remarks (e.g., "(optional)", "(about 1 pound)").
    text = re.sub(r'\([^)]*\)', '', text)

    # 3. Remove numerical quantities, including integers, decimals, and fractions.
    # This regex handles formats like "1 1/2", "1/2", "1.5", "1".
    text = re.sub(r'(\d+\s+)?\d+/\d+|\d+(\.\d+)?|\d+', '', text)

    # 4. Remove common punctuation. We keep hyphens as they can be part of a name.
    text = re.sub(r'[,.;:?!"]', '', text)

    # 5. Tokenize and filter out units and descriptors.
    words = text.split()
    # This list comprehension is efficient for filtering. We check against the
    # predefined sets of UNITS and DESCRIPTORS.
    clean_words = [
        word for word in words if word not in UNITS and word not in DESCRIPTORS
    ]

    # 6. Reassemble the string and clean up whitespace.
    # ' '.join() handles the list-to-string conversion.
    # The final split/join is a robust way to normalize multiple spaces to single spaces.
    clean_name = ' '.join(clean_words).strip()

    return clean_name


# ==============================================================================
#  Self-testing Block
# ==============================================================================
if __name__ == "__main__":
    print("Running Ingredient Parser Self-Test...\n")
    test_cases = [
        "2 cups Durum wheat flour",
        "1/2 teaspoon salt",
        "1/4 teaspoon baking powder",
        "3 large eggs, beaten",
        "water as needed",
        "1 tablespoon butter",
        "1/4 cup chopped mushrooms",
        "1 (10.75 ounce) can condensed cream of mushroom soup",
        "1 pound ground pork",
        "1 tablespoon peanut butter",
        "1/4 cup eggless mayonnaise", # Pitfall test
        "1/2 cup soy milk",           # Vegan-specific test
    ]

    for case in test_cases:
        # The `:<55` part is for neat formatting of the output.
        print(f'Original: "{case}"\nParsed:   "{parse_ingredient(case)}"\n')




Running Ingredient Parser Self-Test...

Original: "2 cups Durum wheat flour"
Parsed:   "durum wheat flour"

Original: "1/2 teaspoon salt"
Parsed:   "salt"

Original: "1/4 teaspoon baking powder"
Parsed:   "baking powder"

Original: "3 large eggs, beaten"
Parsed:   "eggs"

Original: "water as needed"
Parsed:   "water"

Original: "1 tablespoon butter"
Parsed:   "butter"

Original: "1/4 cup chopped mushrooms"
Parsed:   "mushrooms"

Original: "1 (10.75 ounce) can condensed cream of mushroom soup"
Parsed:   "cream mushroom soup"

Original: "1 pound ground pork"
Parsed:   "ground pork"

Original: "1 tablespoon peanut butter"
Parsed:   "peanut butter"

Original: "1/4 cup eggless mayonnaise"
Parsed:   "eggless mayonnaise"

Original: "1/2 cup soy milk"
Parsed:   "soy milk"



## example

In [7]:
hits[0]['_source']['ingredients']

['2 cups Durum wheat flour',
 '1/2 teaspoon salt',
 '1/4 teaspoon baking powder',
 '3 eggs',
 'water as needed']

In [8]:
for ingredient in hits[0]['_source']['ingredients']:
    print(parse_ingredient(ingredient))

durum wheat flour
salt
baking powder
eggs
water


# Classifier with a model
in order to deal with unknowns in the reseacrch for the task i came across a small model that is traind to classifiy between PLANT_BASED and ANIMAL_BASED foods the model was trained on a dataset from USDA FoodData Central which contains the ANIMAL_BASED and PLANT_BASED classification labels based on the available protein type in a food product.
This model is a fine-tuned version of distilbert-base-uncased . It achieves the following results on the evaluation set:

Loss: 0.0249
Accuracy: 0.9

and can be found here: https://huggingface.co/nisuga/food_type_classification_model940

In [10]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

def run_classification_example():
    """
    Loads a food classification model and runs it on several example ingredients.
    This function demonstrates the end-to-end process from loading the model
    to interpreting its predictions.
    """
    # Define the specific model we want to use from the Hugging Face Hub
    model_name = "nisuga/food_type_classification_model"
    print(f"--- Loading Model: {model_name} ---")

    # 1. LOAD TOKENIZER AND MODEL
    # The tokenizer prepares the text input in a format the model can understand.
    # The model is the pre-trained neural network for sequence classification.
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name)
        print("Model and tokenizer loaded successfully.\n")
    except Exception as e:
        print(f"Error loading model: {e}")
        print("Please ensure you have an internet connection and the correct model name.")
        return

    # Put the model in evaluation mode. This disables layers like dropout
    # that are only used during training.
    model.eval()

    # 2. DEFINE EXAMPLE INPUTS
    # Let's test the model with a variety of food types.
    example_ingredients = [
        "chicken breast",
        "fresh spinach",
        "cheddar cheese",
        "red wine vinegar",
        "whole wheat bread",
        '1 bunch of asparagus',
        'beef'
    ]

    print("--- Running Classification on Examples ---")

    # 3. PROCESS EACH EXAMPLE
    for ingredient in example_ingredients:
        print(f"\nProcessing ingredient: '{ingredient}'")

        # A. Tokenize the input text
        # `return_tensors="pt"` tells the tokenizer to return PyTorch tensors.
        # `padding=True` and `truncation=True` handle inputs of different lengths.
        inputs = tokenizer(ingredient, return_tensors="pt", padding=True, truncation=True)

        # B. Perform Inference
        # `torch.no_grad()` is a crucial optimization for inference. It tells
        # PyTorch not to calculate gradients, which saves memory and computation.
        with torch.no_grad():
            outputs = model(**inputs)

        # C. Interpret the output
        # The model's output contains `logits`, which are the raw, unnormalized
        # scores for each possible class.
        logits = outputs.logits

        # To get a confidence score, we apply the softmax function to the logits.
        # This converts the scores into probabilities that sum to 1.
        probabilities = torch.softmax(logits, dim=1)

        # To find the predicted class, we find the index of the highest logit score.
        predicted_class_id = torch.argmax(logits, dim=1).item()

        # We can get the human-readable label from the model's configuration.
        predicted_label = model.config.id2label[predicted_class_id]
        
        # Get the confidence score for the predicted class.
        confidence_score = probabilities[0][predicted_class_id].item()

        # D. Print the results
        print(f"  -> Predicted Label: '{predicted_label}'")
        print(f"  -> Confidence: {confidence_score:.4f}")

    # You can also inspect all possible labels the model knows about
    print("\n--- Model's Known Labels ---")
    print(model.config.id2label)


# This block ensures the function runs when the script is executed directly
# if __name__ == "__main__":
#     run_classification_example()

# Final Vegan Classifier

In [11]:
import re
from typing import List, Dict, Any, Set, Optional

# Make sure you have `transformers` and a backend like `torch` or `tensorflow` installed.
# pip install transformers torch
try:
    from transformers import pipeline, Pipeline
except ImportError:
    print("Warning: `transformers` library not found. ML-based classification will not be available.")
    print("Please run 'pip install transformers torch' to install.")
    pipeline = None
    Pipeline = None

# Import the parser from the previous step.
# Assume it's in a file named `ingredient_parser.py` in the same directory.


# ==============================================================================
#  Global State and Constants for Classifier
# ==============================================================================

# Caching mechanism to store results and avoid re-computation, critical for performance.
# Key: clean ingredient name (str), Value: vegan status (bool)
VEGAN_CACHE: Dict[str, bool] = {}

# Lazy-loaded Hugging Face pipeline. Initialized only when first needed.
CLASSIFIER_PIPELINE: Optional[Pipeline] = None

# --- Rule-Based Keyword Sets ---

# Keywords for ingredients that are definitively NOT vegan.
# This list is comprehensive to catch common animal products quickly.
NON_VEGAN_KEYWORDS: Set[str] = {
    # --- Meats (Red & White) ---
    'andouille', 'bacon', 'beef', 'biltong', 'bison', 'boar', 'bologna', 'bratwurst', 'brisket', 'capicola', 'chorizo', 'chops', 'corned beef', 'frankfurter', 'goat', 'ground chuck', 'guanciale', 'ham', 'head cheese', 'jerky', 'kebab', 'kielbasa', 'kidney', 'lamb', 'liver', 'meat', 'meatball', 'meatballs', 'mince', 'mortadella', 'mutton', 'pancetta', 'pastrami', 'pemmican', 'pepperoni', 'pork', 'prosciutto', 'ribs', 'salami', 'sausage', 'shank', 'soppressata', 'steak', 'sweetbreads', 'tenderloin', 'tongue', 'tripe', 'veal', 'venison',
    # --- Poultry ---
    'albumen', 'capon', 'chicken', 'confit', 'cornish hen', 'duck', 'egg', 'eggs', 'foie gras', 'giblets', 'goose', 'guinea fowl', 'meringue', 'nuggets', 'ostrich', 'partridge', 'pate', 'pheasant', 'poultry', 'quail', 'turkey', 'yolk',
    # --- Seafood (Fish) ---
    'anchovies', 'bass', 'bluefish', 'carp', 'catfish', 'caviar', 'cod', 'eel', 'escargot', 'fish', 'flounder', 'gefilte fish', 'grouper', 'haddock', 'halibut', 'herring', 'lox', 'mahi-mahi', 'mackerel', 'monkfish', 'perch', 'pickerel', 'pollock', 'roe', 'salmon', 'sardine', 'seabass', 'sole', 'sturgeon', 'surimi', 'swordfish', 'tilapia', 'trout', 'tuna', 'walleye',
    # --- Seafood (Shellfish & Other) ---
    'abalone', 'calamari', 'clams', 'cockle', 'conch', 'crab', 'crawfish', 'crayfish', 'cuttlefish', 'krill', 'langostino', 'lobster', 'mussels', 'octopus', 'oyster', 'oysters', 'prawns', 'scallop', 'scallops', 'scampi', 'sea urchin', 'seafood', 'shrimp', 'squid', 'uni', 'whelk',
    # --- Dairy ---
    'asiago', 'bleu', 'brie', 'butter', 'buttermilk', 'camembert', 'casein', 'caseinate', 'cheddar', 'cheese', 'colby', 'cottage', 'cream', 'creme', 'curd', 'edam', 'feta', 'ghee', 'gorgonzola', 'gouda', 'gruyere', 'half-and-half', 'halloumi', 'havarti', 'kefir', 'lactalbumin', 'lactose', 'manchego', 'mascarpone', 'milk', 'monterey jack', 'mozzarella', 'muenster', 'neufchatel', 'paneer', 'parmesan', 'provolone', 'queso', 'ricotta', 'sour cream', 'whey', 'yogurt',
    # --- Animal Fats, By-products & Additives ---
    'ambergris', 'aspic', 'bone char', 'bone meal', 'bone marrow', 'bouillon', 'broth', 'carmine', 'chitin', 'cochineal', 'collagen', 'consomme', 'demi-glace', 'drippings', 'fat', 'fish oil', 'gelatin', 'glycerides', 'glycerol', 'isinglass', 'keratin', 'l-cysteine', 'lanolin', 'lard', 'lipase', 'musk', 'pepsin', 'rennet', 'schmaltz', 'shellac', 'stearic acid', 'stock', 'suet', 'tallow', 'vitamin d3',
    # --- Bee Products ---
    'bee pollen', 'beeswax', 'honey', 'propolis', 'royal jelly',
}

# --- ALWAYS_VEGAN_KEYWORDS ---
# A comprehensive list of common ingredients that are almost always vegan. 
# This helps quickly classify simple ingredients without needing the ML model.
ALWAYS_VEGAN_KEYWORDS: Set[str] = {
    # --- Staples & Dry Goods ---
    'arrowroot', 'beans', 'bread crumbs', 'chickpeas', 'cornmeal', 'cornstarch', 'couscous', 'flour', 'lentils', 'pasta', 'quinoa', 'rice', 'sugar', 'yeast',
    # --- Fats & Oils ---
    'margarine', 'oil', 'canola oil', 'coconut oil', 'olive oil', 'sesame oil', 'sunflower oil', 'vegetable oil',
    # --- Seasonings, Herbs & Spices ---
    'basil', 'black pepper', 'cayenne', 'chili', 'cilantro', 'cinnamon', 'clove', 'cumin', 'curry', 'garlic', 'ginger', 'herbs', 'mustard', 'nutmeg', 'onion', 'oregano', 'paprika', 'parsley', 'pepper', 'rosemary', 'saffron', 'salt', 'spices', 'thyme', 'turmeric',
    # --- Liquids & Acids ---
    'coffee', 'club soda', 'lemon juice', 'lime juice', 'seltzer', 'tea', 'vegetable stock', 'vegetable broth', 'vinegar', 'water',
}


# --- VEGAN_ALTERNATIVE_PREFIXES ---
# A list of prefixes for plant-based alternatives that can otherwise be misidentified
# by simple keyword matching (e.g., 'soy milk', 'cashew cheese').
VEGAN_ALTERNATIVE_PREFIXES: Set[str] = {
    # --- Nuts & Seeds ---
    'almond', 'cashew', 'flax', 'hazelnut', 'hemp', 'macadamia', 'pecan', 'pistachio', 'pumpkinseed', 'sesame', 'sunflower', 'walnut',
    # --- Grains & Legumes ---
    'chickpea', 'lentil', 'oat', 'pea', 'quinoa', 'rice', 'soy',
    # --- Fruits & Vegetables ---
    'apple', 'avocado', 'banana', 'potato', 'vegetable', 'veggie',
    # --- Other ---
    'cocoa', 'coconut', 'plant', 'plant-based', 'vegan',
}

def _get_classifier() -> Optional[Pipeline]:
    """
    Initializes and returns the Hugging Face text classification pipeline.

    This function uses lazy loading: the model is only loaded into memory
    the first time it's needed, preventing slow startup times.
    It is robust against the absence of the `transformers` library.
    """
    global CLASSIFIER_PIPELINE
    if CLASSIFIER_PIPELINE is None:
        if pipeline is None:
            # transformers library is not installed
            return None
        try:
            print("INFO: Initializing Hugging Face classifier for the first time. This may take a moment...")
            # Using the model mentioned in the task prompt.
            CLASSIFIER_PIPELINE  = pipeline("text-classification", model="nisuga/food_type_classification_model")
            print("INFO: Classifier initialized successfully.")
        except Exception as e:
            print(f"ERROR: Failed to load Hugging Face model '{model_name}'. ML classification will be disabled.")
            print(f"Error details: {e}")
            # Set to a dummy value to prevent re-trying
            CLASSIFIER_PIPELINE = "failed"
    
    if CLASSIFIER_PIPELINE == "failed":
        return None
        
    return CLASSIFIER_PIPELINE


def is_ingredient_vegan(ingredient_string: str) -> bool:
    """
    Classifies a single ingredient as vegan or not using a multi-layered approach.

    The process is as follows:
    1. Parse the ingredient to get a clean name.
    2. Check a cache for a previously computed result.
    3. Apply a series of precise rules to handle common cases and pitfalls
       (e.g., "eggless", "peanut butter", "soy milk").
    4. If no rule applies, use a Hugging Face ML model for classification.
    5. Cache the final result before returning.

    Args:
        ingredient_string: The raw ingredient string from a recipe.

    Returns:
        True if the ingredient is determined to be vegan, False otherwise.
    """
    # 1. Parse the ingredient string to get a clean, standardized name.
    clean_name = parse_ingredient(ingredient_string)
    if not clean_name:
        return True  # An empty or unparsable ingredient is assumed to not make a dish non-vegan.

    # 2. Check cache for a quick result.
    if clean_name in VEGAN_CACHE:
        return VEGAN_CACHE[clean_name]

    # --- 3. Apply Rule-Based Logic (Fast and Accurate Checks) ---
    
    # Pitfall: Handle "eggless" before checking for "egg".
    if "eggless" in clean_name:
        VEGAN_CACHE[clean_name] = True
        return True

    # Pitfall: Handle plant-based butters/milks before general "butter"/"milk" checks.
    if any(prefix in clean_name for prefix in VEGAN_ALTERNATIVE_PREFIXES):
        # e.g., "peanut butter", "soy milk"
        VEGAN_CACHE[clean_name] = True
        return True

    # General check for common non-vegan keywords.
    # We split the clean name to avoid false positives (e.g., "ham" in "shame").
    words_in_name = set(clean_name.split())
    if not NON_VEGAN_KEYWORDS.isdisjoint(words_in_name):
        VEGAN_CACHE[clean_name] = False
        
        return False

    # Check for always-vegan ingredients.
    if clean_name in ALWAYS_VEGAN_KEYWORDS:
        VEGAN_CACHE[clean_name] = True
        return True

    # --- 4. Fallback to ML Model for Ambiguous Cases ---
    
    # Assume non-vegan as a safe default if ML model fails or is unavailable.
    result = False
    
    classifier = _get_classifier()
    if classifier:
        try:
            prediction = classifier(clean_name, top_k=1)[0]
            
            # The model labels are 'plant-based' and 'animal-based'.
            if prediction['label'] == 'PLANT_BASED':
                result = True
        except Exception as e:
            print(f"WARNING: ML classification failed for '{clean_name}'. Defaulting to non-vegan. Error: {e}")
            result = False
    else:
        print(f"WARNING: No ML classifier available. Defaulting '{clean_name}' to non-vegan.")
        result = False

    # 5. Cache the result before returning.
    VEGAN_CACHE[clean_name] = result
    return result


# # ==============================================================================
# #  Self-testing Block
# # ==============================================================================
# if __name__ == "__main__":
#     #Test cases to validate the logic, including pitfalls.
#     vegan_recipe = ["1 cup flour", "1/2 cup sugar", "1/4 cup soy milk", "1 tbsp apple butter"]
#     non_vegan_recipe = ["2 large eggs", "1 cup milk", "50g butter", "1 cup flour"]
#     tricky_vegan_recipe = ["1/4 cup eggless mayonnaise", "1 tbsp peanut butter", "water as needed"]
#     tricky_non_vegan_recipe = ["1 lb ground chicken", "1 tbsp honey"]

#     print("--- Testing Vegan Recipe Classifier ---\n")

#     print(f"Is `vegan_recipe` vegan? \t\t{all(map(is_ingredient_vegan, vegan_recipe))} \t(Expected: True)")
#     print(f"Is `non_vegan_recipe` vegan? \t{all(map(is_ingredient_vegan, non_vegan_recipe))} \t(Expected: False)")
#     print(f"Is `tricky_vegan_recipe` vegan? {all(map(is_ingredient_vegan, tricky_vegan_recipe))} \t(Expected: True)")
#     print(f"Is `tricky_non_vegan_recipe` vegan? {all(map(is_ingredient_vegan, tricky_non_vegan_recipe))} \t(Expected: False)")

#     print("\n--- Testing Individual Ingredients ---")
#     print(f"Is '2 tbsp honey' vegan? \t{is_ingredient_vegan('2 tbsp honey')} \t(Expected: False)")
#     print(f"Is '1 cup of cheese' vegan? \t{is_ingredient_vegan('1 cup of cheese')} \t(Expected: False)")
#     print(f"Is 'peanut butter' vegan? \t{is_ingredient_vegan('peanut butter')} \t(Expected: True)")
#     print(f"Is '1/2 cup soy milk' vegan? \t{is_ingredient_vegan('1/2 cup soy milk')} \t(Expected: True)")
#     # This might require the ML model:
#     print(f"Is '1 bunch of asparagus' vegan? {is_ingredient_vegan('1 bunch of asparagus')} \t(Expected: True)")
    

In [12]:
def is_vegan(ingredients):
    return all(map(is_ingredient_vegan, ingredients))
    
df["vegan"] = df["ingredients"].apply(is_vegan_example)
df

INFO: Initializing Hugging Face classifier for the first time. This may take a moment...


Device set to use cpu


INFO: Classifier initialized successfully.


Unnamed: 0,description,ingredients,instructions,photo_url,title,vegan
0,"Bell peppers stuffed with hashbrowns, ground b...","[4 frozen hash brown patties, 4 bell peppers, ...",[Cook the hashbrown patties according to packa...,http://images.media-allrecipes.com/userphotos/...,Hash Brown Hot Dish Stuffed Bell Peppers,False
1,"I got this recipe from my sister, whom I nickn...",[1 (16 ounce) package fully cooked kielbasa sa...,[Cook and stir the cut-up kielbasa in a large ...,http://images.media-allrecipes.com/userphotos/...,Cheese's Baked Macaroni and Cheese,False
2,Chicken breasts are roasted with herbs and the...,"[1 cup red wine, 1/4 cup olive oil, 1 teaspoon...","[In a large resealable bag, combine the red wi...",http://images.media-allrecipes.com/userphotos/...,Sage Apple Chicken with Brie,False
3,Flatbread and chicken tenders pair with veggie...,"[2 Damascus Bakeries panini flatbread, 2 table...","[Heat George Foreman or panini grill., In a sm...",http://images.media-allrecipes.com/userphotos/...,Chicken Tender Panini Sandwiches,False
4,"Layers of flavors, including chili and cheese,...","[1/2 cup salsa, 1/2 teaspoon chili powder, 1 (...","[Mix salsa, chili powder and beans in 1-quart ...",http://images.media-allrecipes.com/userphotos/...,Refried Bean Roll-ups,False
5,This is a recipe I concocted when I feel like ...,"[2 pounds turkey tenderloins, cut into 1/2 inc...","[In a medium bowl, toss the turkey with the So...",http://images.media-allrecipes.com/userphotos/...,Spicy Turkey Wraps with Strawberry Salsa,False
6,I work at a coffee shop and my favorite coffee...,"[2 cups graham cracker crumbs, 1/2 cup butter,...",[Preheat oven to 350 degrees F (175 degrees C)...,http://images.media-allrecipes.com/userphotos/...,Caramel Macchiato Cheesecake,False
7,"This creamy pilaf incorporates the fluffy, nut...","[1/4 cup quinoa, 3 tablespoons olive oil, 2 ta...",[Bring a pot of lightly salted water to a boil...,http://images.media-allrecipes.com/userphotos/...,Cheesy Quinoa Pilaf with Spinach,True
8,"Deliciously rich and oh-so-garlicky. Crabmeat,...","[1 (8 ounce) package cream cheese, softened, 1...",[Heat oven to 375 degrees F. Mix all ingredien...,http://images.media-allrecipes.com/global/reci...,Roasted Garlic Crab Dip,False
9,It is not a holiday meal without a generous se...,"[3 pounds Yukon gold potatoes, cut into chunks...",[Heat 1-inch water to boiling in large saucepa...,http://images.media-allrecipes.com/global/reci...,Garlic and Parmesan Smashed Potatoes,False


# Nutrition Database

In [22]:

# --- Configuration ---
# Path to the directory where you unzipped the SR Legacy files.
DATA_SOURCE_PATH = './sr_legacy/'

# The name for our final, clean database file.
OUTPUT_DB_PATH = './data/nutrition_database.csv'

# Column names for the legacy text files, based on USDA documentation.
FOOD_DES_COLS = [
    'NDB_No', 'FdGrp_Cd', 'Long_Desc', 'Shrt_Desc', 'ComName', 'ManufacName', 
    'Survey', 'Ref_desc', 'Refuse', 'SciName', 'N_Factor', 'Pro_Factor', 
    'Fat_Factor', 'CHO_Factor'
]

NUT_DATA_COLS = [
    'NDB_No', 'Nutr_No', 'Nutr_Val', 'Num_Data_Pts', 'Std_Error', 'Src_Cd', 
    'Deriv_Cd', 'Ref_NDB_No', 'Add_Nutr_Mark', 'Num_Studies', 'Min', 'Max', 
    'DF', 'Low_EB', 'Up_EB', 'Stat_cmt', 'AddMod_Date', 'CC'
]

NUTR_DEF_COLS = [
    'Nutr_No', 'Units', 'Tagname', 'NutrDesc', 'Num_Dec', 'SR_Order'
]

# The specific nutrients we want to extract.
# `NutrDesc` is the column name in the legacy format.
TARGET_NUTRIENTS = {
    'Carbohydrate, by difference': 'carbs',
    'Protein': 'protein',
    'Sugars, total': 'sugar' # Note: In some versions, it's 'Sugars, total including NLEA'
}

def create_nutrition_database_from_txt():
    """
    Loads the raw USDA SR Legacy TXT files, processes them, and saves a clean,
    wide-format nutritional database as a single CSV file.
    
    This version is specifically adapted to parse the tilde/caret delimited format.
    """
    print("--- Starting Nutrition Database Preparation (from TXT files) ---")

    # --- 1. Load the necessary TXT files with correct parsing ---
    try:
        print(f"Loading data from '{DATA_SOURCE_PATH}'...")
        # food.csv equivalent is FOOD_DES.txt
        food_df = pd.read_csv(
            os.path.join(DATA_SOURCE_PATH, 'FOOD_DES.txt'),
            sep='^',
            quotechar='~',
            header=None,
            names=FOOD_DES_COLS,
            encoding='latin1' # This encoding is often needed for these files
        )

        # nutrient.csv equivalent is NUTR_DEF.txt
        nutrient_df = pd.read_csv(
            os.path.join(DATA_SOURCE_PATH, 'NUTR_DEF.txt'),
            sep='^',
            quotechar='~',
            header=None,
            names=NUTR_DEF_COLS,
            encoding='latin1'
        )

        # food_nutrient.csv equivalent is NUT_DATA.txt
        food_nutrient_df = pd.read_csv(
            os.path.join(DATA_SOURCE_PATH, 'NUT_DATA.txt'),
            sep='^',
            quotechar='~',
            header=None,
            names=NUT_DATA_COLS,
            encoding='latin1'
        )
        
    except FileNotFoundError as e:
        print(f"ERROR: Could not find required TXT files in '{DATA_SOURCE_PATH}'.")
        print("Please ensure FOOD_DES.txt, NUTR_DEF.txt, and NUT_DATA.txt are present.")
        print(f"Details: {e}")
        return

    # --- 2. Filter for only the nutrients we care about ---
    #print(f"Filtering for target nutrients: {list(TARGET_NUTRIENTS.keys())}")
    target_nutrient_ids = nutrient_df[nutrient_df['NutrDesc'].isin(TARGET_NUTRIENTS.keys())]
    
    filtered_food_nutrient_df = food_nutrient_df[
        food_nutrient_df['Nutr_No'].isin(target_nutrient_ids['Nutr_No'])
    ]

    # --- 3. Merge the tables to get food and nutrient names ---
    #print("Merging data tables...")
    merged_df = pd.merge(
        filtered_food_nutrient_df,
        food_df[['NDB_No', 'Long_Desc']],
        on='NDB_No',
        how='left'
    )
    merged_df = pd.merge(
        merged_df,
        nutrient_df[['Nutr_No', 'NutrDesc']],
        on='Nutr_No',
        how='left'
    )

    # --- 4. Pivot the table from "long" to "wide" format ---
    #print("Pivoting table to wide format...")
    # Use the original legacy column names for pivoting
    nutrition_pivot_df = merged_df.pivot_table(
        index='Long_Desc',
        columns='NutrDesc',
        values='Nutr_Val'
    ).reset_index()

    # --- 5. Clean up the final DataFrame ---
    #print("Cleaning up the final database...")
    nutrition_pivot_df = nutrition_pivot_df.rename(columns=TARGET_NUTRIENTS)
    nutrition_pivot_df = nutrition_pivot_df.rename(columns={'Long_Desc': 'food_name'})
    
    cols_to_fill = ['carbs', 'protein', 'sugar']
    for col in cols_to_fill:
        if col not in nutrition_pivot_df.columns:
            nutrition_pivot_df[col] = 0.0
    nutrition_pivot_df[cols_to_fill] = nutrition_pivot_df[cols_to_fill].fillna(0.0)
    
    nutrition_pivot_df['food_name'] = nutrition_pivot_df['food_name'].str.lower()
    
    # --- 6. Save the final database ---
    os.makedirs(os.path.dirname(OUTPUT_DB_PATH), exist_ok=True)
    nutrition_pivot_df.to_csv(OUTPUT_DB_PATH, index=False)
    
create_nutrition_database_from_txt()


--- Starting Nutrition Database Preparation (from TXT files) ---
Loading data from './sr_legacy/'...


  food_nutrient_df = pd.read_csv(


# keto clssifier 

In [20]:
!pip install thefuzz 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [23]:

try:
    from sklearn.metrics import classification_report, confusion_matrix
except ImportError:
    def classification_report(y_true, y_pred, **kwargs):
        print("Warning: scikit-learn not found. Skipping classification report.", file=sys.stderr)
        return "scikit-learn not installed."
    
    def confusion_matrix(y_true, y_pred):
        print("Warning: scikit-learn not found. Cannot generate confusion matrix.", file=sys.stderr)
        return [[0, 0], [0, 0]]


UNITS: Set[str] = {"c", "cup", "cups", "g", "gram", "grams", "kg", "kilogram", "kilograms", "lb", "lbs", "pound", "pounds", "ml", "milliliter", "milliliters", "oz", "ounce", "ounces", "pinch", "pinches", "splash", "splashes", "sprig", "sprigs", "t", "tsp", "teaspoon", "teaspoons", "T", "tbsp", "tablespoon", "tablespoons", "can", "cans", "clove", "cloves", "dash", "dashes", "drizzle", "drop", "drops", "gallon", "gallons", "handful", "handfuls", "head", "heads", "package", "packages", "packet", "packets", "pint", "pints", "quart", "quarts", "scoop", "scoops", "sheet", "sheets", "slice", "slices", "stalk", "stalks", "stick", "sticks", "strip", "strips"}

DESCRIPTORS: Set[str] = {'beaten', 'blanched', 'boiled', 'braised', 'brewed', 'brined', 'broken', 'charred', 'chilled', 'chopped', 'clarified', 'coarsely', 'crumbled', 'crushed', 'cubed', 'cut', 'deboned', 'deglazed', 'deseeded', 'deveined', 'diced', 'dissolved', 'divided', 'drained', 'finely', 'flaked', 'folded', 'grated', 'grilled', 'halved', 'heated', 'hulled', 'husked', 'infused', 'julienned', 'juiced', 'kneaded', 'marinated', 'mashed', 'melted', 'minced', 'mixed', 'parboiled', 'patted', 'peeled', 'pitted', 'poached', 'pounded', 'prepared', 'pressed', 'pureed', 'quartered', 'rinsed', 'roasted', 'rolled', 'roughly', 'scalded', 'scored', 'scrubbed', 'seared', 'seeded', 'segmented', 'shaved', 'shredded', 'shucked', 'sifted', 'skewered', 'sliced', 'slivered', 'smashed', 'soaked', 'softened', 'squeezed', 'steamed', 'stemmed', 'stewed', 'strained', 'stuffed', 'thawed', 'thinly', 'tied', 'toasted', 'torn', 'trimmed', 'whisked', 'zested', 'canned', 'cold', 'condensed', 'cooked', 'cooled', 'cored', 'creamed', 'cured', 'defrosted', 'fermented', 'firmly', 'freshly', 'frozen', 'hard', 'hot', 'instant', 'jarred', 'lean', 'leftover', 'light', 'lukewarm', 'optional', 'pasteurized', 'powdered', 'preserved', 'raw', 'ready-to-use', 'refrigerated', 'ripe', 'room', 'skin-on', 'skinless', 'soft', 'stiff', 'temperature', 'uncooked', 'undrained', 'unripe', 'warm', 'washed', 'whole', 'bite-sized', 'chunky', 'clump', 'coarse', 'fine', 'jumbo', 'large', 'long', 'medium', 'round', 'short', 'small', 'thick', 'thin', 'about', 'additional', 'approximately', 'bunch', 'extra', 'generous', 'heavy', 'heaping', 'level', 'more', 'packed', 'plus', 'scant', 'splash', 'sprig', 'sprinkle', 'bitter', 'salty', 'savory', 'sour', 'spicy', 'sweetened', 'unsalted', 'unsweetened', 'a', 'an', 'and', 'as', 'at', 'for', 'in', 'into', 'of', 'on', 'or', 'the', 'to', 'with', 'without', 'dusting', 'garnish', 'needed', 'serving', 'taste'}

def _depluralize(word: str) -> str:
    """A more robust de-pluralizer using the inflect library."""
    # Fallback to simple version if inflect is not installed
    if word.endswith('ss'): return word
    if word.endswith('s'): return word[:-1]
    return word

def parse_ingredient(ingredient_string: str) -> str:
    """An improved parser that cleans an ingredient string and de-pluralizes it."""
    if not isinstance(ingredient_string, str) or not ingredient_string: return ""
    text = ingredient_string.lower()
    text = re.sub(r'\([^)]*\)', '', text)
    text = re.sub(r'(\d+\s+)?\d+/\d+|\d+(\.\d+)?|\d+', '', text)
    text = re.sub(r'[,.;:?!"]', '', text)
    words = text.split()
    clean_words = [_depluralize(word) for word in words if word not in UNITS and word not in DESCRIPTORS]
    return ' '.join(clean_words).strip()




In [24]:
# --- Tuning Parameters & Knowledge Bases ---
CONFIDENCE_THRESHOLD = 80
LOG_FAILED_LOOKUPS = True

# FIXED: Updated keyword lists with better coverage
MANUAL_KETO_OVERRIDES = {
    'water', 'salt', 'oil', 'butter', 'avocado', 'egg', 'chicken', 'beef', 'pork', 
    'lamb', 'fish', 'salmon', 'tuna', 'shrimp', 'cheese', 'mayonnaise', 'vinegar', 
    'mustard', 'splenda', 'stevia', 'erythritol', 'garlic', 'paprika', 'olive',
    'cream', 'bacon', 'lettuce', 'spinach', 'broccoli', 'cauliflower', 'pepper',
    'onion', 'mushroom', 'lemon', 'lime', 'herbs', 'spice'
}

NON_KETO_KEYWORDS = { 
    # --- Grains, Flours & Starches ---
    'barley', 'bread', 'breadcrumb', 'cereal', 'corn', 'cornmeal', 'cornstarch', 'couscous', 
    'cracker', 'crouton', 'flour', 'granola', 'millet', 'oat', 'pasta', 'panko',
    'pretzel', 'quinoa', 'rice', 'rye', 'semolina', 'spelt', 'tapioca', 'tortilla', 'wheat',
    'baguette', 'bannock', 'ciabatta', 'noodle', 'rusk',

    # --- Starchy Vegetables & Tubers ---
    'parsnip', 'pea', 'plantain', 'potato', 'yam',

    # --- High-Sugar Fruits ---
    'banana', 'cherry', 'date', 'fig', 'grape', 'lychee', 'mango', 'pineapple', 
    'tangerine', 'apple', 'orange', 'strawberry', 'strawberrie', 
    
    # --- Alcoholic Beverages (High Carb) ---
    'vodka', 'rum', 'liqueur', 'beer', 'wine',  # FIXED: Added alcohol
    
    # --- Sugars & Syrups ---
    'agave', 'caramel', 'dextrose', 'fructose', 'glucose', 'honey', 
    'maltodextrin', 'maple', 'molasses', 'sugar', 'syrup',

    # --- Legumes ---
    'bean', 'chickpea', 'lentil', 'legume',

    # --- Processed Foods & Desserts ---
    'biscuit', 'cake', 'candy', 'chip', 'chocolate', 'cookie', 'dough',
    'doughnut', 'dumpling', 'ice cream', 'jam', 'jelly', 'ketchup', 'muffin', 'pie', 
    'pizza', 'popcorn', 'relish', 'sorbet', 'waffle',
}

# Database & Fuzzy Matching
DB_PATH = './data/nutrition_database.csv'
NUTRITION_DF: Optional[pd.DataFrame] = None
FOOD_NAME_CHOICES: Optional[List[str]] = None
SEARCH_CACHE: Dict[str, Optional[Dict]] = {}
KETO_CARB_THRESHOLD_PER_100G = 10.0




In [25]:
def _load_database() -> None:
    global NUTRITION_DF, FOOD_NAME_CHOICES
    if NUTRITION_DF is not None: return
    try:
        NUTRITION_DF = pd.read_csv(DB_PATH)
        FOOD_NAME_CHOICES = NUTRITION_DF['food_name'].tolist()
    except FileNotFoundError:
        print(f"CRITICAL ERROR: Nutritional database not found at '{DB_PATH}'. Aborting.", file=sys.stderr)
        sys.exit(1)

def find_ingredient_nutrition(ingredient_name: str) -> Optional[Dict[str, Any]]:
    """
    Searches for an ingredient in the nutritional database.
    Returns high-carb dummy data for failed lookups to ensure conservative classification.
    """
    _load_database()
    if NUTRITION_DF is None or NUTRITION_DF.empty:
        return {'food_name': f"db_load_fail: {ingredient_name}", 'carbs': 1001.0, 'protein': 0, 'sugar': 0}
        
    if ingredient_name in SEARCH_CACHE:
        return SEARCH_CACHE[ingredient_name]
        
    query_words = set(ingredient_name.split())
    if not query_words:
        result = {'food_name': f"empty_query: {ingredient_name}", 'carbs': 1001.0, 'protein': 0, 'sugar': 0}
        SEARCH_CACHE[ingredient_name] = result
        return result

    # Pre-filter choices
    filtered_choices = [name for name in FOOD_NAME_CHOICES if not query_words.isdisjoint(name.lower().split())]
    if not filtered_choices:
        if LOG_FAILED_LOOKUPS:
            with open("failed_lookups.log", "a") as f: 
                f.write(f"PREFILTER_FAIL: {ingredient_name}\n")
        result = {'food_name': f"prefilter_fail: {ingredient_name}", 'carbs': 1001.0, 'protein': 0, 'sugar': 0}
        SEARCH_CACHE[ingredient_name] = result
        return result

    # Fuzzy matching
    best_match = process.extractOne(ingredient_name, filtered_choices, scorer=fuzz.WRatio)
    
    if best_match and best_match[1] >= CONFIDENCE_THRESHOLD:
        matched_food_name = best_match[0]
        result = NUTRITION_DF[NUTRITION_DF['food_name'] == matched_food_name].iloc[0].to_dict()
    else:
        # FIXED: Failed matches get HIGH carbs (not 0.0)
        if LOG_FAILED_LOOKUPS:
            with open("failed_lookups.log", "a") as f: 
                f.write(f"SCORE_FAIL: {ingredient_name}\n")
        result = {'food_name': f"score_fail: {ingredient_name}", 'carbs': 1001.0, 'protein': 0, 'sugar': 0}

    SEARCH_CACHE[ingredient_name] = result
    return result


In [26]:
def is_ingredient_keto(ingredient: str, debug: bool = False) -> bool:
    """
    Enhanced keto checker with debug output and better logic.
    """
    clean_name = parse_ingredient(ingredient)
    
    if debug:
        print(f"  Checking: '{ingredient[:50]}...' -> cleaned: '{clean_name}'")
    
    # Empty ingredients are safe
    if not clean_name:
        if debug: print(f"    ✓ Empty ingredient - KETO")
        return True

    # Fast fail for known non-keto ingredients (CHECK THIS FIRST!)
    ingredient_words = set(clean_name.split())
    if not NON_KETO_KEYWORDS.isdisjoint(ingredient_words):
        matching_words = NON_KETO_KEYWORDS.intersection(ingredient_words)
        if debug: print(f"    ✗ Contains non-keto keyword(s): {matching_words} - NOT KETO")
        return False

    # Fast pass for known keto ingredients
    if not MANUAL_KETO_OVERRIDES.isdisjoint(ingredient_words):
        matching_words = MANUAL_KETO_OVERRIDES.intersection(ingredient_words)
        if debug: print(f"    ✓ Contains keto override(s): {matching_words} - KETO")
        return True
        
    # Database lookup for ambiguous ingredients
    nutrition_data = find_ingredient_nutrition(clean_name)
    if nutrition_data:
        carbs = nutrition_data.get('carbs', KETO_CARB_THRESHOLD_PER_100G + 1)
        is_keto_by_carbs = carbs <= KETO_CARB_THRESHOLD_PER_100G
        if debug:
            print(f"    Database: {nutrition_data['food_name']}, carbs: {carbs}g -> {'KETO' if is_keto_by_carbs else 'NOT KETO'}")
        return is_keto_by_carbs

    # Default to NOT KETO for unknown ingredients
    if debug: print(f"    ? Unknown ingredient, defaulting to NOT KETO")
    return False

def is_keto(ingredients: List[str], debug: bool = False) -> bool:
    """
    Determines if a recipe is keto with optional debug output.
    """
    if not isinstance(ingredients, list):
        return False
        
    if debug:
        print(f"\n--- Analyzing recipe with {len(ingredients)} ingredients ---")
        
    for ingredient_str in ingredients:
        if not is_ingredient_keto(ingredient_str, debug=debug):
            if debug:
                print(f"  --> Recipe Verdict: NOT KETO (Failed on: '{ingredient_str}')")
            return False
            
    if debug:
        print(f"  --> Recipe Verdict: KETO (All ingredients passed)")
    return True


In [27]:
  
df["keto"] = df["ingredients"].apply(is_keto)
df

Unnamed: 0,description,ingredients,instructions,photo_url,title,vegan,keto
0,"Bell peppers stuffed with hashbrowns, ground b...","[4 frozen hash brown patties, 4 bell peppers, ...",[Cook the hashbrown patties according to packa...,http://images.media-allrecipes.com/userphotos/...,Hash Brown Hot Dish Stuffed Bell Peppers,False,False
1,"I got this recipe from my sister, whom I nickn...",[1 (16 ounce) package fully cooked kielbasa sa...,[Cook and stir the cut-up kielbasa in a large ...,http://images.media-allrecipes.com/userphotos/...,Cheese's Baked Macaroni and Cheese,False,False
2,Chicken breasts are roasted with herbs and the...,"[1 cup red wine, 1/4 cup olive oil, 1 teaspoon...","[In a large resealable bag, combine the red wi...",http://images.media-allrecipes.com/userphotos/...,Sage Apple Chicken with Brie,False,False
3,Flatbread and chicken tenders pair with veggie...,"[2 Damascus Bakeries panini flatbread, 2 table...","[Heat George Foreman or panini grill., In a sm...",http://images.media-allrecipes.com/userphotos/...,Chicken Tender Panini Sandwiches,False,False
4,"Layers of flavors, including chili and cheese,...","[1/2 cup salsa, 1/2 teaspoon chili powder, 1 (...","[Mix salsa, chili powder and beans in 1-quart ...",http://images.media-allrecipes.com/userphotos/...,Refried Bean Roll-ups,False,False
5,This is a recipe I concocted when I feel like ...,"[2 pounds turkey tenderloins, cut into 1/2 inc...","[In a medium bowl, toss the turkey with the So...",http://images.media-allrecipes.com/userphotos/...,Spicy Turkey Wraps with Strawberry Salsa,False,False
6,I work at a coffee shop and my favorite coffee...,"[2 cups graham cracker crumbs, 1/2 cup butter,...",[Preheat oven to 350 degrees F (175 degrees C)...,http://images.media-allrecipes.com/userphotos/...,Caramel Macchiato Cheesecake,False,False
7,"This creamy pilaf incorporates the fluffy, nut...","[1/4 cup quinoa, 3 tablespoons olive oil, 2 ta...",[Bring a pot of lightly salted water to a boil...,http://images.media-allrecipes.com/userphotos/...,Cheesy Quinoa Pilaf with Spinach,True,False
8,"Deliciously rich and oh-so-garlicky. Crabmeat,...","[1 (8 ounce) package cream cheese, softened, 1...",[Heat oven to 375 degrees F. Mix all ingredien...,http://images.media-allrecipes.com/global/reci...,Roasted Garlic Crab Dip,False,False
9,It is not a holiday meal without a generous se...,"[3 pounds Yukon gold potatoes, cut into chunks...",[Heat 1-inch water to boiling in large saucepa...,http://images.media-allrecipes.com/global/reci...,Garlic and Parmesan Smashed Potatoes,False,False


For your convenience, you can sanity check your solution on a subset of labeled recipes by running `diet_classifiers.py`

In [28]:
! python diet_classifiers.py --ground_truth /usr/src/data/ground_truth_sample.csv

--- Starting Nutrition Database Preparation (from TXT files) ---
Loading data from './sr_legacy/'...
  food_nutrient_df = pd.read_csv(
INFO: Initializing Hugging Face classifier for the first time. This may take a moment...
Device set to use cpu
INFO: Classifier initialized successfully.


              precision    recall  f1-score   support

       False       0.93      0.95      0.94        60
        True       0.92      0.90      0.91        40

    accuracy                           0.93       100
   macro avg       0.93      0.93      0.93       100
weighted avg       0.93      0.93      0.93       100


--- Confusion Matrix: Keto Classifier ---
                  Predicted: Non-Keto  Predicted: Keto
Actual: Non-Keto                   57                3
Actual: Keto                        4               36
-------------------------------------------------------
  True Negatives (TN):   57 (Correctly predicted Non-Keto)
  False Positives(FP):    3 (Incorrectly predicted Keto)
  

## 2. Repository Setup
Create a **private** GitHub repository for your solution, and invite the GitHub user `argmax2025` as a collaborator. **Do not** share your implementation using a **forked** repository.

## 3. Application Form
Once you’ve completed the implementation and shared your private GitHub repository with argmax2025, please fill out the appropriate application form:
1. [US Application Form](https://forms.clickup.com/25655193/f/rexwt-1832/L0YE9OKG2FQIC3AYRR)
2.  [IL Application Form](https://forms.clickup.com/25655193/f/rexwt-1812/IP26WXR9X4P6I4LGQ6)


Your application will not be considered complete until this form is submitted.

## Evaluation process


Your submission will be assessed based on the following criteria:


1.	**Readability & Logic** – Clearly explain your approach, including your reasoning and any assumptions. If you relied on external resources (e.g., ingredient databases, nutrition datasets), be sure to cite them.
2.	**Executability** – Your code should run as is when cloned from your GitHub repository. Ensure that all paths are relative, syntax is correct, and no manual setup is required.
3.	**Accuracy** – Your classifiers will be evaluated against a holdout set of 20,000 recipes with verified labels. Performance will be compared to the ground truth.
data.


## Next steps
If your submission passes the initial review, you’ll be invited to a 3-hour live coding interview, where you’ll be asked to extend and adapt your solution in real time.

Please make sure you join from a quiet environment and have access to a Python-ready workstation capable of running your submitted project.