# Multilingual Food Item Search with LLMs

This notebook explores a potential enhancement to the Superlinked framework by adding multilingual food search capabilities.

While Superlinked already supports natural language querying, this example demonstrates my own implementation:
Using Large Language Models (LLMs) with the `Instructor` package to:

- Enable cross-language food search by translating non-English queries
- Enrich search results by inferring:
  - Standardized food categories 
  - Estimated nutritional content
- Generate structured parameters for more precise semantic matching

This approach could be integrated into Superlinked to provide more robust multilingual search functionality.


In [11]:
# You may need to install the following packages:
# pip install pandas numpy pydantic openai instructor superlinked umap-learn
# pip install -U transformers
# pip install -U torch
# pip install -U torchvision
# pip install -U torchaudio

# Then make sure your virtual environment is available as a Jupyter kernel:
# python -m ipykernel install --user --name=venv --display-name "Python (venv)"
#
# Replace "venv" with your environment name if different.




In [1]:
from dotenv import load_dotenv
import os
from pydantic import BaseModel, Field
from openai import OpenAI
import instructor
import pandas as pd
import numpy as np
from enum import Enum
from superlinked import framework as sl
import sys
sys.path.append('../code/')
from utills import build_superlinked_app, FoodItem



## Loading in data and setting up superlinked

In [13]:
#load food_db
food_df = pd.read_parquet('../data/sr_legacy_food_db_clean.parquet')
categories = food_df.food_category.drop_duplicates().to_list()
cols = ['fdc_id', 'description', 'food_category', 'calories']
food_df = food_df[cols]

df = food_df
#May take a minute to run as adding 7000 items from the df
df.head()

Unnamed: 0,fdc_id,description,food_category,calories
0,167512,"Pillsbury Golden Layer Buttermilk Biscuits, Ar...",Baked Products,307.0
1,167513,"Pillsbury, Cinnamon Rolls with Icing, refriger...",Baked Products,330.0
2,167514,"Kraft Foods, Shake N Bake Original Recipe, Coa...",Baked Products,377.0
3,167515,"George Weston Bakeries, Thomas English Muffins",Baked Products,232.0
4,167516,"Waffles, buttermilk, frozen, ready-to-heat",Baked Products,273.0


In [14]:
app, index, food_item_class, description_space, food_category_text_space, food_category_categorical_space, calorie_space = build_superlinked_app(df)

Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]

## Using LLM Structured Outputs

Below, I define a Pydantic `BaseModel` and enforce an `Enum` structure for the food categories based on the categories in the loaded database. This constrains the LLM to select a category from a predefined list.

A function is then defined to call a chat completion endpoint, allowing the input of a food item—just text in any language—and returning:

- The translated English name  
- The corresponding food category (from the enum)  
- Estimated caloric information


In [15]:
# Load environment variables from .env file
load_dotenv()

api_key = os.getenv('OPENAI_API_KEY')
# Access the environment variable
if api_key is None:
    raise ValueError("OPENAI_API_KEY is not set, replace with your own key")



class USDA_diet_categories(Enum):
    Dairy_and_Egg_Products = 'Dairy and Egg Products'
    Spices_and_Herbs = 'Spices and Herbs'
    Baby_Foods = 'Baby Foods'
    Fats_and_Oils = 'Fats and Oils'
    Poultry_Products = 'Poultry Products'
    Soups_Sauces_and_Gravies = 'Soups, Sauces, and Gravies'
    Sausages_and_Luncheon_Meats = 'Sausages and Luncheon Meats'
    Breakfast_Cereals = 'Breakfast Cereals'
    Fruits_and_Fruit_Juices = 'Fruits and Fruit Juices'
    Pork_Products = 'Pork Products'
    Vegetables_and_Vegetable_Products = 'Vegetables and Vegetable Products'
    Nut_and_Seed_Products = 'Nut and Seed Products'
    Beef_Products = 'Beef Products'
    Beverages = 'Beverages'
    Finfish_and_Shellfish_Products = 'Finfish and Shellfish Products'
    Legumes_and_Legume_Products = 'Legumes and Legume Products'
    Lamb_Veal_and_Game_Products = 'Lamb, Veal, and Game Products'
    Baked_Products = 'Baked Products'
    Sweets = 'Sweets'
    Cereal_Grains_and_Pasta = 'Cereal Grains and Pasta'
    Fast_Foods = 'Fast Foods'
    Meals_Entrees_and_Side_Dishes = 'Meals, Entrees, and Side Dishes'
    Snacks = 'Snacks'
    American_Indian_Alaska_Native_Foods = 'American Indian/Alaska Native Foods'
    Restaurant_Foods = 'Restaurant Foods'
    Branded_Food_Products_Database = 'Branded Food Products Database'
    Quality_Control_Materials = 'Quality Control Materials'
    Alcoholic_Beverages = 'Alcoholic Beverages'
    Dietary_Supplements = 'Dietary Supplements'

class PydanticFood(BaseModel):
    description: str = Field(..., 
        description="Translate the food item to English.")
    food_category: USDA_diet_categories = Field(..., 
        description="The category of the food item in English.")
    calories: int = Field(..., 
        description="The calories of the food item in kcal per 100g.")
    

client = instructor.patch(OpenAI(api_key=api_key))
gpt_model  = "gpt-4o-mini"


def get_structured_output(food_description) -> str:
    # TODO: output_format should be a class or something
    food_item = client.chat.completions.create(
        model=gpt_model, 
        messages=[{"role":"user", "content":f"Convert this food item to the given format:\n{str(food_description)}"}],
        response_model= PydanticFood,  
    )

    return food_item
    

### Translating food items and using llm to create additional fields
- Below I have a list of 3 items in 3 languages to demonstrate how the translation and imputing of extra information is done 

In [16]:
food_items = ["マンゴー", "جبنة", "Pão"]  # Japanese, Arabic, Portuguese
for item in food_items:
    food_description = item
    llm_result = get_structured_output(food_description)
        # Assuming 'result' is an instance of FoodItem
    food_description = llm_result.description
    food_category = llm_result.food_category.value
    calories = llm_result.calories

    print("Original food item: ", item)
    print(f"Translated description: {food_description}")
    print(f"Food Category: {food_category}")
    print(f"Calories: {calories}")




Original food item:  マンゴー
Translated description: Mango
Food Category: Fruits and Fruit Juices
Calories: 60
Original food item:  جبنة
Translated description: Cheese
Food Category: Dairy and Egg Products
Calories: 402
Original food item:  Pão
Translated description: Bread
Food Category: Baked Products
Calories: 265


### Enhancing Search with LLM Outputs and Superlinked

The structured results from the LLM can be used with Superlinked to perform enriched semantic search. By leveraging the inferred attributes—such as standardized description, food category, and caloric content—we enable enrich the information we can use downstream with superlinked to search the database


In [24]:
food_item = 'たまご'
llm_result = get_structured_output(food_item)
    # Assuming 'result' is an instance of FoodItem
food_description = llm_result.description
food_category = llm_result.food_category.value
calories = llm_result.calories


print("Original food item: ", food_item)

print("Performing search of ", food_description, " in ", food_category, " with ", calories, " calories")

query = (
    sl.Query(index, 
    )
    .find(food_item_class)
    .similar(food_category_text_space, sl.Param("query_categories"))
    .similar(description_space, sl.Param("query_text"))
    .similar(calorie_space, sl.Param("calories_per_100g"))
    .select_all()
)
 
search_results= app.query(query, query_categories=food_category, query_text=food_description, calories_per_100g=calories)

sl.PandasConverter.to_pandas(search_results).head()

Original food item:  たまご
Performing search of  Egg  in  Dairy and Egg Products  with  155  calories


Unnamed: 0,description,food_category,calories,id,similarity_score
0,Eggnog,Dairy and Egg Products,88.0,171258,0.920735
1,"Egg, whole, cooked, poached",Dairy and Egg Products,143.0,172186,0.884487
2,"Egg, whole, cooked, scrambled",Dairy and Egg Products,149.0,172187,0.882131
3,"Egg, whole, cooked, omelet",Dairy and Egg Products,154.0,172185,0.880628
4,"Egg, whole, raw, fresh",Dairy and Egg Products,143.0,171287,0.879458


## Natural language query
- Below I attempt to use the Natrual language query capability to do something similar

In [27]:
# fill this with your API key - this will drive param extraction
openai_config = sl.OpenAIClientConfig(api_key=api_key, model=gpt_model  )

# it is possible now to add descriptions to a `Param` to aid the parsing of information from natural language queries.
text_similar_param = sl.Param(
    "query_text",
    description=(
        "The text in the user's query that is used to search in the products' description."
        " Extract info that does not apply to other spaces or params."
    ),
)

query = (
    sl.Query(index, 
    )
    .find(food_item_class)
    .similar(food_category_text_space, sl.Param("query_categories", description="Assign a food category based off the text in the user's query."))
    .similar(description_space, sl.Param("query_text", description="Translate the food item to English of the user's query."))
    .similar(calorie_space, sl.Param("calories_per_100g", description="Estimate the calories of the food item in kcal per 100g."))
    .select_all()
    .limit(sl.Param("limit"))
    .with_natural_query(sl.Param("natural_query"), openai_config)
)
 
food_item = 'たまご'
search_results= app.query(query, natural_query=food_item, limit=10)
sl.PandasConverter.to_pandas(search_results).head()

Unnamed: 0,description,food_category,calories,id,similarity_score
0,Eggnog,Dairy and Egg Products,88.0,171258,0.80267
1,"Egg, whole, cooked, poached",Dairy and Egg Products,143.0,172186,0.766422
2,"Egg, whole, cooked, scrambled",Dairy and Egg Products,149.0,172187,0.764066
3,"Egg, whole, cooked, omelet",Dairy and Egg Products,154.0,172185,0.762563
4,"Egg, whole, raw, fresh",Dairy and Egg Products,143.0,171287,0.761393


In [28]:
search_results.metadata.search_params

{'query_categories': 'Eggs',
 'similar_filter_TextSimilaritySpace_3107_FoodItem_food_category_weight_param__': 1.0,
 'query_text': 'Egg',
 'similar_filter_TextSimilaritySpace_db50_FoodItem_description_weight_param__': 1.0,
 'calories_per_100g': 155.0,
 'similar_filter_NumberSpace_bdef_FoodItem_calories_weight_param__': 1.0,
 'select_param__': ['description', 'food_category', 'calories'],
 'limit': 10,
 'natural_query': 'たまご',
 'radius_param__': None,
 'space_weight_TextSimilaritySpace_db50_param__': 1.0,
 'space_weight_TextSimilaritySpace_3107_param__': 1.0,
 'space_weight_CategoricalSimilaritySpace_9674_param__': 1.0,
 'space_weight_NumberSpace_bdef_param__': 1.0}