# Final Group Project NLP - Part 6

* Ahmed Mohamed Elghamry Shehata
* Ahmed Mahmoud Abdelmoneim Abdelhamid
* Noureldin Mohamed Abdelsalm Mohamed Hamedo
* Sergio Rodrigo Fernandez Testa

#**Downloading the Data**

In [22]:
!gdown 1lnoaa6tE2gGDQEEz0DW2hvOnjIMK9oTo

Downloading...
From (original): https://drive.google.com/uc?id=1lnoaa6tE2gGDQEEz0DW2hvOnjIMK9oTo
From (redirected): https://drive.google.com/uc?id=1lnoaa6tE2gGDQEEz0DW2hvOnjIMK9oTo&confirm=t&uuid=a5742459-8fd1-4b77-8f93-31e0a5cf457a
To: /content/receipeData.zip
100% 621M/621M [00:15<00:00, 39.6MB/s]


In [23]:
!unzip /content/receipeData.zip

Archive:  /content/receipeData.zip
   creating: dataset/
  inflating: dataset/full_dataset.csv  


# **Setting Up the Environment**

## **Install dependencies**

In [None]:
!pip install langchain langchain_openai langchain_community python-dotenv \
 pymongo sqlalchemy rank-bm25 --quiet
!pip install nltk
!pip uninstall gensim numpy -y
!pip install --upgrade numpy==1.26.0
!pip install --upgrade gensim
!pip install scipy

In [1]:
!pip install -q python-terrier==0.11.0

## **Imports**

In [18]:
import pandas as pd
import numpy as np
import os
import numpy as np
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer


from langchain.agents import initialize_agent, Tool
from langchain.agents.agent_types import AgentType
from langchain_openai import ChatOpenAI

from pymongo import MongoClient
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

from gensim.models import Word2Vec

from scipy.spatial.distance import cosine

import pyterrier as pt
# from google.colab import drive (Just for Google Colab)

## **Download The Word2Vec Model**

In [None]:
!gdown 1EWw7GLrt_B0r8zPcubthMxNhJ37LRMjr

## Downlaoding the Categorizer

In [None]:
!gdown --folder https://drive.google.com/drive/folders/1f2vcG9ExocKL-MN-PSv-g3xvwE1hyig2

## loading the model


In [2]:
model = AutoModelForSequenceClassification.from_pretrained("recipe_classifier")
tokenizer = AutoTokenizer.from_pretrained("recipe_classifier")
trainer = Trainer(model=model)

## **Configuration**

In [19]:
# Colab config
local = True

if local:
    from dotenv import load_dotenv
    env_file = './.env'
    folder_path = './data/'

    load_dotenv(env_file)

    mongo_uri = os.getenv('MONGO_URI')
    db_url = os.getenv('DATABASE_URL')
    openai_key = os.getenv('OPENAI_API_KEY')
else:
    # Colab config
    from google.colab import userdata
    MONGO_URI=userdata.get('MONGO_URI')
    DATABASE_URL=userdata.get('DATABASE_URL')
    openai_key = userdata.get('OPENAI_API_KEY')

mongo_uri = 'mongodb://mongo:WRzNdQzjpzZLgvzFgIqtomMxOpGuFDHJ@yamanote.proxy.rlwy.net:43794'
db_url = 'postgresql://postgres:wkfMDMQIrJABGEaSjrgYyGtCNcFOqlCh@centerbeam.proxy.rlwy.net:44921/railway'
# MongoDB connection
client = MongoClient(mongo_uri)
mongo_db = client["test"]
mongo_recipes = mongo_db["recipes"]

# SQLAlchemy setup
engine = create_engine(db_url, connect_args={"check_same_thread": False})
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)

# Word2Vec model
model_path = './data/embeddings.model'


## **Loading the Indexed Documents**

In [20]:
if not pt.started():
  pt.init()

  if not pt.started():
Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
java is now started automatically with default settings. To force initialisation early, run:
pt.java.init() # optional, forces java initialisation
  pt.init()


In [21]:
index = pt.IndexFactory.of('./recipe_index/data.properties')
print(index.getCollectionStatistics().toString())

Number of documents: 2231141
Number of terms: 165126
Number of postings: 129767038
Number of fields: 3
Number of tokens: 221302602
Field names: [title, ingredients, directions]
Positions:   false



# **Agent**

## **Functions**

In [22]:
def init_pyterrier():
    if not pt.started():
        pt.init()

def create_index():
    index_path = os.path.expanduser("./recipe_index")
    if not os.path.exists(index_path):
        os.makedirs(index_path)
    index_ref = pt.IndexRef.of(index_path)
    index = pt.IndexFactory.of(index_ref)
    return index

def query_index_topk(query,k=5):
    init_pyterrier()
    index = create_index()
    bm25 = pt.BatchRetrieve(index, wmodel="BM25")
    results = bm25.search(query)
    recepies = results["docid"][0:k]
    print(recepies[0:k])
    return recepies.tolist()

#Example usage
#ecepies= query_index_topk("eggs, flour, sugar, butter")
#rint(query_index_topk("dark sweet pitted cherries,ginger ale,marshmallows,almond extract,flavor gelatin", 50))

print(query_index_topk("eggs chicken milk butter pepper onions", 5))

#print(recepies)



  if not pt.started():
  bm25 = pt.BatchRetrieve(index, wmodel="BM25")


0     250127
1     146728
2     548834
3    1337326
4     250362
Name: docid, dtype: int64
[250127, 146728, 548834, 1337326, 250362]


In [23]:
def check_ingredient_compatibility(recipe_ingredients,
                                   candidate_ingredient,
                                   similarity_threshold=0.5):
    """
    Check if a candidate ingredient is compatible with a given recipe based on
        Word2Vec similarity.

    Parameters:
        recipe_ingredients (list of str): List of ingredient names.
        candidate_ingredient (str): The new ingredient to test.
        similarity_threshold (float): Threshold above which an ingredient is
        considered compatible.

    Returns:
        dict: {
            'similarity': float,
            'compatible': bool,
            'missing_words': list of str
        }
    """
    # Load the Word2Vec model
    global embedding_model

    # Track words not in the model
    missing_words = [word for word in recipe_ingredients +
                     [candidate_ingredient] if word not in embedding_model.wv]

    # Filter valid ingredients in the model
    valid_recipe_words = [
        word for word in recipe_ingredients if word in embedding_model.wv]
    if not valid_recipe_words or candidate_ingredient not in embedding_model.wv:
        return {
            'similarity': None,
            'compatible': False,
            'missing_words': missing_words
        }

    # Compute recipe embedding as average of ingredient vectors
    recipe_vector = np.mean([embedding_model.wv[word]
                            for word in valid_recipe_words], axis=0)

    # Get candidate vector
    candidate_vector = embedding_model.wv[candidate_ingredient]

    # Compute cosine similarity
    similarity = 1 - cosine(recipe_vector, candidate_vector)

    return similarity >= similarity_threshold


In [24]:
def get_receipe_id(query,k=5):
    topk = query_index_topk(query,k)
    print("Topk:", topk)
    return topk


def get_titles_from_docids(docids):
    docids_set = set(docids)
    results = []
    # Project only recipe_id and title
    cursor = mongo_recipes.find({}, {"title": 1, "docno": 1})

    for doc in cursor:
        key_val = doc.get("docno")  # use recipe_id instead of ""
        if key_val in docids_set:
            results.append((key_val, doc.get("title", "Unknown")))

    title_map = dict(results)
    return [title_map.get(docid, "Unknown") for docid in docids]

def get_recommendation(query, k=5):
    recipe_ids = get_receipe_id(query, k) # Get the recipe IDs from the index
    titles = get_titles_from_docids(recipe_ids) # Get the titles from the recipe IDs

    return titles

def get_directions(titles):
    titles_set = set(titles)
    results = []

    # Project only title and directions fields
    cursor = mongo_recipes.find({}, {"title": 1, "directions": 1})

    for doc in cursor:
        title = doc.get("title")
        if title in titles_set:
            results.append((title, doc.get("directions", [])))  # default to empty list if directions not found

    # Build map and return in the same order as input titles
    directions_map = dict(results)
    return [directions_map.get(title, ["Directions not found"]) for title in titles]

def get_ingred(titles):
    titles_set = set(titles)
    results = []

    # Project only title and directions fields
    cursor = mongo_recipes.find({}, {"title": 1, "ingredients": 1})

    for doc in cursor:
        title = doc.get("title")
        if title in titles_set:
            results.append((title, doc.get("ingredients", [])))  # default to empty list if directions not found

    # Build map and return in the same order as input titles
    ingredients_map = dict(results)
    return [ingredients_map.get(title, ["ingredients not found"]) for title in titles]


## **Tools**

In [34]:
def search_recipes_tool(query: str, k: int = 5) -> list:
    """Search for recipes based on a query and return top k results."""

    def get_recipe_ids(query: str, k: int = 5) -> list:
        import re
        # Remove special characters that may interfere with the parser
        sanitized_query = re.sub(r'[^\w\s]', '', query)
        topk = query_index_topk(sanitized_query, k)
        print("Topk:", topk)
        return topk

    def get_recipes_from_mongo(docids: list) -> list:
        """Return titles for the given list of docids, preserving order."""
        if not docids:
            return []

         # Fetch only matching records
        cursor = mongo_recipes.find(
            {"docno": {"$in": docids}},
            {
                "docno": 1,
                "title": 1,
                "ingredients": 1,
                "directions": 1,
                "_id": 0
            }
        )

        # Build a mapping from docno to the recipe dict
        doc_map = {doc["docno"]: doc for doc in cursor}

        # Return results in the same order as the input docids
        return [
            doc_map.get(docid, {
                "docno": docid,
                "title": "Unknown",
                "ingredients": ["Ingredients not found"],
                "directions": ["Directions not found"]
            })
            for docid in docids
        ]

    
    recipe_ids = get_recipe_ids(query, k)
    recipes_from_mongo = get_recipes_from_mongo(recipe_ids)

    # You can expand the return to include more details if needed
    return recipes_from_mongo


import json

def get_similar_ingredients_tool(input_data: dict):
    """
    Check if a candidate ingredient is compatible with a given recipe based on Word2Vec similarity.

    Parameters:
        input_data (dict): {
            'recipe_ingredients': list of str,
            'candidate_ingredient': str,
            'similarity_threshold': float (optional, default=0.5)
        }

    Returns:
        dict: {
            'similarity': float or None,
            'compatible': bool,
            'missing_words': list of str
        }
    """
    input_data = json.loads(input_data)
    recipe_ingredients = input_data.get("recipe_ingredients", [])
    candidate_ingredient = input_data.get("candidate_ingredient", "")
    similarity_threshold = 0.5

    global model_path
    model = Word2Vec.load(model_path)
    print("Model loaded", model)

    missing_words = [word for word in recipe_ingredients +
                     [candidate_ingredient] if word not in model.wv]

    valid_recipe_words = [word for word in recipe_ingredients if word in model.wv]

    if not valid_recipe_words or candidate_ingredient not in model.wv:
        return {
            'similarity': None,
            'compatible': False,
            'missing_words': missing_words
        }

    recipe_vector = np.mean([model.wv[word] for word in valid_recipe_words], axis=0)
    candidate_vector = model.wv[candidate_ingredient]
    similarity = 1 - cosine(recipe_vector, candidate_vector)

    return {
        'similarity': similarity,
        'compatible': similarity >= similarity_threshold,
        'missing_words': missing_words
    }



def categorize_recipe_tool(ingredient_text: str) -> str:
    """Categorize a recipe based on its ingredients."""

    def predict_labels(text_list, tokenizer, trainer):
        # Create HF Dataset
        test_data = Dataset.from_dict({"text": text_list})

        # Tokenize
        def preprocess(examples):
            return tokenizer(examples["text"], truncation=True, padding=True)

        tokenized_test_data = test_data.map(preprocess, batched=True)
        preds_output = trainer.predict(tokenized_test_data)

        # Convert logits to predicted labels
        preds = int(np.argmax(preds_output.predictions, axis=1))
        labels = ["Main Dish 🥘", "Dessert 🍰"]

        return labels[preds]

    return predict_labels([ingredient_text], tokenizer, trainer)


In [10]:
get_similar_ingredients_tool([
    "sugar",
    "butter",
    "flour",
    "eggs"
  ], "milk", 0.5)

Model loaded Word2Vec<vocab=15280, vector_size=100, alpha=0.025>


True

In [35]:
tools = [
    Tool(
        name="search_recipes_tool",
        func=search_recipes_tool,
        description="Search for recipes based on a query."
    ),
    Tool(
        name="get_similar_ingredients_tool",
        func=get_similar_ingredients_tool,
        description="Find similar ingredients."
    ),
    Tool(
        name="categorize_recipe_tool",
        func=categorize_recipe_tool,
        description="Categorize a recipe into Dessert, Main Dish, or Other."
    )
]

## **Instance of the Agent**

In [36]:
llm = ChatOpenAI(
    temperature=0.3,
    model="gpt-4",
    openai_api_key=openai_key
)

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)


# **Simulation**

In [37]:
prompt = "Give me 3 recipes with chicken and rice, and tell me if chicken is similar to turkey and categorize each recipe" # @param {"type":"string"}
agent.run(prompt)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mFirst, I need to find 3 recipes with chicken and rice.
Action: search_recipes_tool


  if not pt.started():
  bm25 = pt.BatchRetrieve(index, wmodel="BM25")


0    1866657
1    1294393
2    1331583
3    1233941
4     319124
Name: docid, dtype: int64
Topk: [1866657, 1294393, 1331583, 1233941, 319124]

Observation: [36;1m[1;3m[{'title': 'Toasted Walnut Sauce', 'ingredients': ['2 slices day-old white artisan bread, such as pugliese, levain, sweet baguette', '2 to 3 cups whole milk', '2 cups (8 ounces) walnut halves or pieces, toasted', '4 cloves roasted garlic (page 192)', '1 tablespoon fresh thyme leaves', 'Kosher salt and freshly ground white pepper', '1/3 cup olive oil', 'Red pepper flakes (optional)'], 'directions': ['Soak the bread in enough milk to cover.', 'Add the mixture to a food processor along with the walnuts, garlic, and thyme.', 'Pulse into a paste.', 'Add salt and white pepper to taste.', 'With the machine running, add the olive oil.', 'Add milk as necessary to achieve a sauce-like consistency.', 'Add red pepper flakes to taste.', 'Use now, or cover and refrigerate for up to 1 week.'], 'docno': 1866657}, {'title': 'Barbie Quer

Map:   0%|          | 0/1 [00:00<?, ? examples/s]


Observation: [38;5;200m[1;3mMain Dish 🥘[0m
Thought:

  preds = int(np.argmax(preds_output.predictions, axis=1))


[32;1m[1;3mI now know the final answer
Final Answer: The recipe 'Chicken Rice Casserole' is a main dish and contains chicken and rice. However, chicken is not similar to turkey.[0m

[1m> Finished chain.[0m


"The recipe 'Chicken Rice Casserole' is a main dish and contains chicken and rice. However, chicken is not similar to turkey."