## <font color = 'blue'>Smart Substitute Spectrum: Elevate Your Plate, Nourish Your Well-being!</font>

####  DSE 203 Project - Team Christopher Vanhook, Vaaruni Desai, Zufeshan Imran

### Import all the required libraries

In [1]:
import os
import re
import json
import openai
import psycopg2
import collections
import numpy as np
import pandas as pd
import pandas_dedupe
from fuzzywuzzy import fuzz
from nltk.util import ngrams
from fuzzywuzzy import process
from neo4j import GraphDatabase
from jsonpath_ng.ext import parse
from nltk.stem import PorterStemmer

### All the required settings to run the notebook

In [2]:
# portstemmer to get the stem of keywords
ps = PorterStemmer()

# establishing connection to nourish_db 
connection = psycopg2.connect(
        dbname='nourish',
        user='v1desai@ucsd.edu',
        password='emfQGcx3',
        host='awesome-hw.sdsc.edu',
        port=5432
    )

# requirements to connect to neo4j
uri = "bolt://localhost:7687"
username = "neo4j"
password = "password"

# function to remove non-alpha data
def clean_text(text):
    text = re.sub('\W+',' ', str(text))
    return text.lower()

# funtion to get the required data from nourish_db for comparison in future
def get_experimental_table():
    cur = connection.cursor()
    exp_query = """SELECT fdc_id, description FROM usda_2022_food_branded_experimental"""
    cur.execute(exp_query)
    records = cur.fetchall()
    columns = [desc[0] for desc in cur.description]
    cur.close()
    
    df_experimental = pd.DataFrame(records, columns = columns)
    df_experimental.drop_duplicates(subset = 'description', keep = 'first', inplace = True)
    df_experimental.reset_index(drop = True, inplace = True)
    df_experimental['description'] = df_experimental.apply(lambda row :clean_text(row['description']), axis = 1)
    return df_experimental

# get the table and store it in a variable for future use
df_experimental = get_experimental_table()

### Loading json and csv data

In [3]:
# loading the json data

with open('ingredient_and_instructions.json') as file:
    tasty_json_ingredients = json.load(file)

In [4]:
# load in the csv file into a dataframe

df_tasty_dishes = pd.read_csv('dishes.csv',usecols=['slug', 'protein', 'fat', 'calories', 'sugar', 'carbohydrates', 'fiber']) 
df_tasty_dishes.head()

Unnamed: 0,slug,protein,fat,calories,sugar,carbohydrates,fiber
0,homemade-cinnamon-rolls,7.0,21.0,479.0,24.0,63.0,1.0
1,whipped-coffee,0.0,0.0,69.0,18.0,18.0,0.0
2,fluffy-perfect-pancakes,36.0,50.0,1102.0,12.0,123.0,3.0
3,tasty-101-cinnamon-rolls,8.0,25.0,562.0,28.0,74.0,1.0
4,healthy-banana-pancakes,7.0,4.0,184.0,9.0,30.0,4.0


### Load in the unstructured data using jsonpath_ng library

In [5]:
# unstructured data

jsonpath_expr = parse("$..['instructions']")
matches = [match.value for match in jsonpath_expr.find(tasty_json_ingredients)]

text = []
for i in range(len(matches)):
    list_text = " ".join(line['display_text'].strip() for line in matches[i])
    text.append(list_text)

### Data transformation - Load the products, ingredients data using jsonpath_ng library

In [6]:
# extract all products

expr = parse("$")
list_products = [match.value.keys() for match in expr.find(tasty_json_ingredients)]

In [7]:
# extract all ingredients from ingredient sections

jsonpath_expr = parse("$..['ingredient_sections']")
matches = [match.value for match in jsonpath_expr.find(tasty_json_ingredients)]

### Extracting list of ingredients per product

In [8]:
# extracting list of ingredients product by product

ingredients = []
ingredient_per_prod = []
for products in matches:
    for product in products:
        ing_per_prod = product['ingredients']
        for ing in ing_per_prod:
            ingredient_per_prod.append(ing['name'])
    ingredients.append(ingredient_per_prod)
    ingredient_per_prod = []

In [85]:
# extract all metric_unit for each ingredient per product

metric = []
metric_per_prod = []
for products in matches:
    for product in products:
        met_per_prod = product['ingredients']
        for met in met_per_prod:
            if met['metric_unit'] is not None:
                metric_per_prod.append((met['metric_unit']['quantity'],met['metric_unit']['display']))
            else:
                metric_per_prod.append((None,None))
    metric.append(metric_per_prod)
    metric_per_prod = []

### Loading the above data into one dataframe

In [10]:
# converting all data extracted from json file into pandas series

col_products = pd.Series(list_products[0])
col_ingredients = pd.Series(ingredients)
col_metric_unit = pd.Series(metric)
col_instructions = pd.Series(text)

In [11]:
# adding each pandas series as dataframe columns

df = pd.DataFrame()
df['products'] = col_products
df['ingredients'] = col_ingredients
df['metric_unit'] = col_metric_unit
df['instructions'] = col_instructions
df.head()

Unnamed: 0,products,ingredients,metric_unit,instructions
0,homemade-cinnamon-rolls,"[unsalted butter, whole milk, granulated sugar...","[(115, g), (480, mL), (100, g), (None, None), ...",Generously butter two disposable foil pie/cake...
1,whipped-coffee,"[hot water, sugar, instant coffee powder, milk...","[(28, g), (24, g), (12, g), (None, None), (Non...","Add the hot water, sugar, and instant coffee t..."
2,fluffy-perfect-pancakes,"[flour, baking powder, milk, butter, egg yolks...","[(500, g), (None, None), (960, mL), (170, g), ...",Whisk together the flour and baking powder in ...
3,tasty-101-cinnamon-rolls,"[whole milk, sugar, unsalted butter, active dr...","[(480, mL), (100, g), (None, None), (None, Non...","Make the dough: In a large bowl, whisk togethe..."
4,healthy-banana-pancakes,"[ripe bananas, eggs, vanilla extract, quick-co...","[(None, None), (None, None), (None, None), (70...",Mash bananas in a large bowl until smooth. Mix...


### Merging data using left join operation and cleaning

In [12]:
# merging the newly formed dataframe with the tasty_dishes dataframe that contains csv data

df_products = df.merge(df_tasty_dishes,how='inner',left_on='products',right_on='slug')

In [13]:
# cleaning the data and dropping unnecessary columns and nulls

df_products.drop('slug',axis=1,inplace=True)
df_products.dropna(inplace = True)
df_products.reset_index(inplace = True, drop = False)

### Entity resolution 
**We're using the library 'pandas_dedupe' to cluster the products based on product names.**

In [14]:
df_final = pandas_dedupe.dedupe_dataframe(df_products,['products'])

Importing data ...


  df_final = pandas_dedupe.dedupe_dataframe(df_products,['products'])


Reading from dedupe_dataframe_learned_settings
Clustering...
# duplicate sets 2855


In [16]:
# formatting the ingredients column back to a list (pandas_dedupe converts all columns into strings)

df_final['ingredients'] = df_final.apply(lambda row : [ingredient.strip()[1:-1] for ingredient in row['ingredients'][1:-1].split(', ')], axis =1)

### Adding cluster-keywords to the dataframe to make it easier to access clusters

In [18]:
# extract_keywords gets the stem of each word and returns the first 3 words based on the number of appearances

def extract_keywords(products):
    keywords = [ps.stem(prod) for prod in products.split('-')]
    counter = collections.Counter(keywords)
    return [word for word,count in counter.most_common()[:3]]

# add_keywords adds the extract_keywords to the dataframe
def add_keywords(products_df):
    cluster_groups = df_final.groupby('cluster id')['products'].agg(list)
    cluster_keywords_df = pd.DataFrame(index=cluster_groups.index)
    cluster_keywords_df['cluster_keywords'] = cluster_groups.apply(lambda products_list: ', '.join(extract_keywords('-'.join(products_list))))
    
    # Merge the cluster_keywords back to the original dataframe
    df= pd.merge(df_final, cluster_keywords_df, on='cluster id')
    
    # Display the resulting dataframe
    df = df[['products','ingredients','metric_unit','instructions','calories','protein','fat','sugar','carbohydrates','fiber','cluster id','cluster_keywords']]
    return df

In [21]:
#call the add_keywords funtion

df = add_keywords(df_final)

### Funtion to add nourish_db data into the dataframe if the cluster keywords match the product names in nourish_db

In [22]:
def add_usda_rows(df, df_experimental, cluster_id):
    clus_key = df.groupby('cluster id').get_group(cluster_id)['cluster_keywords']
    matches = process.extract(clus_key.values[0], df_experimental['description'],scorer=fuzz.token_sort_ratio)
    threshold = 95
    filtered_matches = [match for match in matches if match[1] >= threshold]
    fdc_to_match = []
    # Extract rows based on the filtered matches
    for match in filtered_matches:
        index = match[2]
        matched_row = df_experimental.iloc[index]
        fdc_to_match.append(matched_row.fdc_id)
    if len(fdc_to_match)>0:
        fdc_to_match = tuple(fdc_to_match)
        sql_query = f"""SELECT t1.fdc_id, t1.description, t2.amount, t3.name, t3.unit_name, t4.ingredients \
        FROM usda_2022_food_branded_experimental t1 \
        LEFT JOIN usda_2022_branded_food_nutrients t2 ON t1.fdc_id = t2.fdc_id \
        LEFT JOIN usda_2022_branded_food_product t4 ON t1.fdc_id = t4.fdc_id \
        LEFT JOIN usda_2022_nutrient_master t3 ON t2.nutrient_id = t3.id \
        WHERE t1.fdc_id IN {fdc_to_match}"""
        
        cur = connection.cursor()
        cur.execute(sql_query)
        records = cur.fetchall()
        columns = [desc[0] for desc in cur.description]
        cur.close()
        records_df = pd.DataFrame(records, columns = columns)
        for fdc in fdc_to_match:
            new_df = records_df.groupby('fdc_id').get_group(fdc)
            new_row_append = pd.Series([new_df['description'].values[0].lower(), \
                                re.sub("[\(\[].*?[\)\]]", "",new_df.ingredients.values[0].lower()), \
                                None, \
                                None, \
                               float(new_df[new_df['name'] == 'Energy']['amount'].values[0]), \
                               float(new_df[new_df['name'] == 'Protein']['amount'].values[0]), \
                               float(new_df[new_df['name'] == 'Total lipid (fat)']['amount'].values[0]), \
                               float(new_df[new_df['name'] == 'Sugars, Total']['amount'].values[0]), \
                               float(new_df[new_df['name'] == 'Carbohydrate, by difference']['amount'].values[0]), \
                               float(new_df[new_df['name'] == 'Fiber, total dietary']['amount'].values[0]), \
                               cluster_id, \
                               clus_key[0]], index=df.columns)
            df = pd.concat([df, pd.DataFrame([new_row_append], columns=df.columns)], ignore_index=True)
        return df
    else:
        print("No match found, records not inserted")
        return df

### Create neo4j graph using the dataframe 

In [271]:
def create_graph(tx, cluster_id, cluster_keywords, product_name, ingredients, instructions, protein, fat, calories, sugar, carbohydrates, fiber):
    protein = float(protein)
    fat = float(fat)
    calories = float(calories)
    sugar = float(sugar)
    carbohydrates = float(carbohydrates)
    fiber = float(fiber)
    # create Cluster node
    tx.run("""
        MERGE (c:Cluster {name: $cluster_id, keywords: $cluster_keywords})
    """, cluster_id=cluster_id, cluster_keywords=cluster_keywords)

    # create pRODUCT node
    tx.run("""
        MERGE (p:Product {name: $product_name, protein: $protein, fat: $fat,
                          calories: $calories, sugar: $sugar, carbohydrates: $carbohydrates, fiber: $fiber})
    """, product_name=product_name, protein=protein, fat=fat, calories=calories, sugar=sugar, carbohydrates=carbohydrates, fiber=fiber)
    
    # create HAS_PRODUCTS relationships between cluster and Product
    if product_name:
        tx.run("""MATCH (c:Cluster {name: $cluster_id})
                  MATCH (p:Product {name: $product_name})
                  MERGE (c)-[:HAS_PRODUCTS]->(p)
               """, cluster_id=cluster_id, product_name=product_name)
            
    # create Ingredient nodes
    for ingredient in ingredients:
        if ingredient:
            tx.run("""
                MERGE (i:Ingredient {name: $ingredient})
            """, ingredient=ingredient)

    # create CONTAINS relationships between Product and Ingredient
    for ingredient in ingredients:
        if ingredient:
            tx.run("""
                MATCH (p:Product {name: $product_name})
                MATCH (i:Ingredient {name: $ingredient})
                MERGE (p)-[:CONTAINS]->(i)
            """, product_name=product_name, ingredient=ingredient) 

    # create Instruction nodes
    if instructions:
        tx.run("""
            MERGE (j:Instructions {instructions: $instructions})
        """, instructions=instructions)
    # create HAS_INSTRUCTIONS relationships between Product and Ingredient
        tx.run("""
                MATCH (p:Product {name: $product_name})
                MATCH (j:Instructions {instructions: $instructions})
                MERGE (p)-[:HAS_INSTRUCTIONS]->(j)
            """, product_name=product_name, instructions=instructions)
        
# connect to the database and run the transaction
with GraphDatabase.driver(uri, auth=(username, password)) as driver:
    with driver.session() as session:
        for index, row in df.iterrows():
            session.execute_write(create_graph,
                                  row['cluster id'],
                                  row['cluster_keywords'],
                                  row['products'],
                                  row['ingredients'],
                                  row['instructions'],
                                  row['protein'],
                                  row['fat'],
                                  row['calories'],
                                  row['sugar'],
                                  row['carbohydrates'],
                                  row['fiber'])

### Queries to run on the neo4j graph

In [23]:
# function to run the query provided

def run_cypher_query(query):
    with GraphDatabase.driver(uri, auth=(username, password)) as driver:
        with driver.session() as session:
            result = session.run(query)
            return result.data()

#### Query 1 : "I have vegan margarine, soy milk, sunflower oil, plain flour, caster sugar, baking powder, salt at home. Can you suggest me a quick recipe to make a dish out of this?"

##### Function give_recipe_for_ingredients 
- searches for products that contain all the ingredients provided by the user
- sorts the products based on number of sentences in the instructions
- returns the recipe with least number of sentences

In [76]:
def give_recipe_for_ingredients(list_of_ingredients):
    #graph_query - "MATCH (p:Product)-[:CONTAINS]->(i:Ingredient) WITH p, COLLECT(i.name) AS productIngredients WHERE ALL(i IN ['vegan margarine', 'soy milk', 'sunflower oil', 'plain flour', 'caster sugar', 'baking powder', 'salt'] WHERE i IN productIngredients) MATCH (p)-[:HAS_INSTRUCTIONS]->(instr) RETURN p, instr"
    query3 = f"""MATCH (p:Product)-[:CONTAINS]->(i:Ingredient) 
            WITH p, COLLECT(i.name) AS productIngredients 
            WHERE ALL(i IN {list_of_ingredients} WHERE i IN productIngredients) 
            MATCH (p)-[:HAS_INSTRUCTIONS]->(instr) 
            RETURN p.name AS productName, instr.instructions, size(apoc.text.split(instr.instructions, "\.")) AS numsentences 
            ORDER BY numsentences ASC LIMIT 1
            """
    result3 = run_cypher_query(query3)
    print(f"With the ingredients provided, you can make {(' ').join(result3[0]['productName'].split('-')).upper()}.\nThese are the instructions to prepare - \n{result3[0]['instr.instructions']}")

In [77]:
list_of_ingredients = ['vegan margarine', 'soy milk', 'sunflower oil', 'plain flour', 'caster sugar', 'baking powder', 'salt']
give_recipe_for_ingredients(list_of_ingredients)

With the ingredients provided, you can make VEGAN DOUGHNUTS.
These are the instructions to prepare - 
gently melt the butter over a low-medium heat. add milk and 2 tablespoons of sunflower oil and mix together. once combined, take off the heat and set aside. in a separate bowl, combine the flour, half of the sugar, baking powder and salt with a fork. make a well in the center and pour in the butter mixture. combine gradually until a thick dough forms. using your hands, roll dough into little flat balls and with your thumb, press a hole in the center of each doughnut. (you may need to flour your hands for this part to avoid getting sticky!) heat up oil in a pan. to know when it's hot enough, fry a little bit of bread in the oil. if it goes brown and floats to the top, in 45-50 seconds the oil will be ready! gently lay the doughnuts into the oil using a spatula. fry for about 3-5 minutes on each side, until golden brown. transfer the doughnuts onto some tissue paper to soak up any excess

#### Query 2 : "I am a diabetes patient. Help me make some coffee while keeping my diabetes in mind"

##### Function food_for_condition 
- searches for clusters with keywords that contain the product_to_make
- for all the products under each cluster, removes the ingredient that the user needs to avoid based on the given condition
- for example - A diabetic patient needs to avoid sugar/alcohol
- sorts the products in ascending order based on calories
- gives the user option to choose between the 3 healthiest products
- provides the user with ingredients to use and instructions to make the chosen product

In [64]:
def food_for_condition (prod_to_make, user_query, food_to_avoid_file):

    #get the food to avoid data
    food_to_avoid = pd.read_csv(food_to_avoid_file, delimiter = ';')

    #from the user query fetch the condition mentioned
    for condition in food_to_avoid['Condition']:
        if condition.lower() in user_query:
            condition_to_consider = condition
            
    #search_keywords - food items to avoid for the given condition
    search_keywords = food_to_avoid[food_to_avoid['Condition']==condition_to_consider]['food_to_avoid'].values[0].split(',')

    #graph_query - "WITH ['sugar','alcohol'] AS subs  MATCH (n:Cluster)  WHERE n.keywords CONTAINS 'coffe'  WITH n as cluster, subs  MATCH (cluster)-[:HAS_PRODUCTS]->(p:Product)  MATCH (p)-[:HAS_INSTRUCTIONS]->(instr:Instructions)  MATCH (p)-[:CONTAINS]->(i:Ingredient)   WHERE p.name CONTAINS 'coffe'   AND NOT ANY(word IN subs WHERE i.name CONTAINS word)  RETURN p, instr, COLLECT(i) ORDER BY p.calories ASC  LIMIT 3"
    #Cypher query to fetch low calorie product requested by user
    query2 = f"""WITH {search_keywords} AS subs 
                MATCH (n:Cluster) 
                WHERE n.keywords CONTAINS '{prod_to_make}' 
                WITH n as cluster, subs 
                MATCH (cluster)-[:HAS_PRODUCTS]->(p:Product) 
                MATCH (p)-[:HAS_INSTRUCTIONS]->(instr:Instructions) 
                MATCH (p)-[:CONTAINS]->(i:Ingredient)  
                WHERE p.name CONTAINS '{prod_to_make}'  
                AND NOT ANY(word IN subs WHERE i.name CONTAINS word) 
                RETURN p.name as Product, instr.instructions AS instructions, COLLECT(i.name) as ingredients, p.calories AS calories 
                ORDER BY calories ASC 
                LIMIT 3
                """
    result3 = run_cypher_query(query2)
    
    #Provide the user with 3 options of the product with least calories
    recipe = input(f"I have {len(result3)} recipes which are less in calories. Please pick one recipe among {(', ').join([result3[i]['Product'] for i in range(len(result3))])}\n\n")

    #format the requested recipe to search among the result
    input_string = ('-').join(recipe.split(' ')).lower()

    #print the ingredients and instructions for the requested recipe
    for r in result3:
        if r['Product'] == input_string:
            for key in search_keywords:
                if key in r['instructions']:
                    r['instructions'] = r['instructions'].replace(f"{key}, ", '')
            print(f"\nIngredients to use : {r['ingredients']}\n\nInstructions to make {input_string} - {r['instructions']}")

In [65]:
prod_to_make = ps.stem('Coffee')
user_query = "I am a diabetes patient. Help me make some coffee while keeping my diabetes in mind"
food_for_condition (prod_to_make, user_query, 'Disease_foods_to_avoid_no_names.csv')

I have 3 recipes which are less in calories. Please pick one recipe among dalgona-coffee, whipped-coffee, cinnamon-coffee

 dalgona coffee



Ingredients to use : ['milk', 'instant coffee', 'hot water']

Instructions to make dalgona-coffee - add coffee, and water into a bowl or cup, and whip until it reaches a meringue like consistency! take a cup of milk & spoon whipped coffee on it and enjoy!


#### Query 3 : "I want to make Healthy Pancakes. Can you give me a recipe?" 

##### Function give_healthy_recipe 
- searches for cluster_keywords that contain the product provided by the user
-  gets the products under these clusters that contain the search_substring
-  sorts them in ascending order based on calories
-  returns the 3 most healthiest recipes

In [36]:
# function to provide the user with healthiest recipe of the asked product
def give_healthy_recipe(product):

    # search substring - stem of the product
    search_substring = ps.stem(product)

    # query to run
    #Graph_Query - "MATCH (n:Cluster) WHERE n.keywords CONTAINS 'pancak' WITH n as cluster MATCH (cluster)-[:HAS_PRODUCTS]->(p:Product) MATCH (p)-[:HAS_INSTRUCTIONS]->(instr:Instructions)  MATCH (p)-[:CONTAINS]->(i:Ingredient)  RETURN p, instr, COLLECT(i) as ingredient"
    query1 = f"""MATCH (n:Cluster) 
                WHERE n.keywords CONTAINS '{search_substring}' 
                WITH n as cluster 
                MATCH (cluster)-[:HAS_PRODUCTS]->(p:Product) 
                MATCH (p)-[:HAS_INSTRUCTIONS]->(instr:Instructions) 
                MATCH (p)-[:CONTAINS]->(i:Ingredient) 
                WHERE p.name CONTAINS '{search_substring}' 
                RETURN p.name as Product, instr.instructions AS instructions, COLLECT(i.name) as ingredients, p.calories AS calories 
                ORDER BY calories ASC 
                LIMIT 3
                """
    clusters_data = run_cypher_query(f"MATCH (n:Cluster) WHERE n.keywords CONTAINS '{search_substring}' WITH n as cluster MATCH (cluster)-[:HAS_PRODUCTS]->(p:Product) MATCH (p)-[:HAS_INSTRUCTIONS]->(instr:Instructions) MATCH (p)-[:CONTAINS]->(i:Ingredient) WHERE p.name CONTAINS '{search_substring}' RETURN p.name as Product, instr.instructions AS instructions, COLLECT(i.name) as ingredients, p.calories AS calories ORDER BY calories ASC LIMIT 3")
    return clusters_data, search_substring

In [37]:
# call the function and store clusters_data and search_substring into variables

clusters_data, search_substring = give_healthy_recipe('Pancake')

In [38]:
# remove the ingredients that are similar to each other using fuzzywuzzy library

ingredients = sorted(list(set([ing for i in clusters_data for ing in i['ingredients']])))
for i in ingredients:
    for j in ingredients:
        ratio = fuzz.ratio(i,j)
        if ratio>65 and i!=j:
            ingredients.remove(j)

print(ingredients)

['banana', 'blueberry', 'butter', 'cinnamon', 'eggs', 'flour', 'maple syrup', 'milk', 'topping of your choice', 'vanilla']


In [39]:
# get the instructions for the 3 healthiest products

instructions = [i['instructions'] for i in clusters_data]
instructions

['blend all ingredients in a blender until completely combined. divide among all the cups of a greased muffin tin, cups will be about  1/3  full. bake at 400degf (200degc) for 13-15 minutes. pancakes will puff up super big, then deflate when you remove them from the oven. fill with whatever you want--fresh fruit and syrup, bacon and eggs, sausage, or jam and whipped cream. enjoy!',
 'in a large bowl, mash the bananas until they reach a liquid state. whisk in the eggs. heat butter or oil in a large skillet over medium heat. pour batter into the warm skillet to cook. garnish with desired number of blueberries. after about a minute and a half (or until golden) flip to finish cooking. allow to cool for a minute. serve warm. enjoy!',
 'in a bowl, mash the banana with a fork. add eggs and cinnamon. mix until combined. heat a nonstick skillet over medium heat. add a spoonful of batter and cook for 3-4 minutes, then flip and cook for an additional 3-4 minutes. serve with maple syrup or honey. 

##### Function summarize_instructions
- uses openai to summarize the given instructions
- returns the summarized instructions to the user

In [83]:
# summarize_instructions uses openai to summarize the unstructured data (instructions) to provide to the user

def summarize_instructions(search_substring, instructions, ingredients):
    openai.api_key = "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" 
    
    prompt = """ Given the following instructions, ingredients and product to make, PLEASE summarize the instructions while STRICTLY following these rules
    1. Please get me a summary of the quickest of the recipes, where the instructions do NOT repeat any steps.
    2. Please include as many ingredients as possible in the instructions.
    3. DO NOT create fictious data.
    4. The output content should be in text format.
    5. If you will be unable to output within the token limit, please DO NOT include that entry in the response. """

    #call gpt-3.5-turbo to extract relations
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
        {"role": "system", "content": "You are a helpful assistant that gives the summarized instructions of the given recipes who does NOT repeat any steps"},
        {"role": "user", "content": f"{prompt}, instructions:{instructions}, ingredients: {ingredients}, product_to_make: {search_substring}"}
        ],
        temperature=0,
        max_tokens=1024 )
    #get the response
    print(f"Ingredients required - {ingredients}\nInstructions to prepare - {response.choices[0].message.content}")

In [84]:
summarize_instructions(search_substring, instructions, ingredients)

Ingredients required - ['banana', 'blueberry', 'butter', 'cinnamon', 'eggs', 'flour', 'maple syrup', 'milk', 'topping of your choice', 'vanilla']
Instructions to prepare - To make pancakes, start by mashing the banana in a large bowl until it reaches a liquid state. Whisk in the eggs and add cinnamon. Heat butter in a large skillet over medium heat. Pour the batter into the warm skillet and garnish with blueberries. Cook for about a minute and a half or until golden, then flip to finish cooking. Allow to cool for a minute before serving warm with maple syrup or your choice of topping. Enjoy!


#### Query 4 - "Please give me the vegan substitute of healthier chicken alfredo pasta"

##### Function get_vegan_recipe 
- gets the vegan substitutes data
- searches for non-vegan-product in the products
- gets the ingredient list and instructions for the non-vegan product
- using the vegan substitutes data, replaces the non-vegan ingredients with vegan ingredients
- returns the instructions for the new vegan product

In [78]:
def get_vegan_recipe(non_veg_prod_name, vegan_csv_file):
    #dictionary for substitutes (non_veg:veg)
    subs = {}
    
    #read csv file
    vegan_data = pd.read_csv(vegan_csv_file, delimiter = ';')
    #convert Vegan_Ingredient string to list
    vegan_data['Vegan_Ingredient'] = vegan_data.apply(lambda row : [x for x in row['Vegan_Ingredient'][1:-1].split(', ')], axis =1)

    #graph_query - "MATCH(p:Product)-[:CONTAINS]->(i:Ingredient) MATCH (p)-[:HAS_INSTRUCTIONS]->(j:Instructions) WHERE p.name = 'healthier-chicken-alfredo-pasta' RETURN p, COLLECT(i), j"
    #cypher query to match the product and return recipe and ingredients
    query4 = f"""MATCH(p:Product)-[:CONTAINS]->(i:Ingredient) 
                MATCH (p)-[:HAS_INSTRUCTIONS]->(j:Instructions) 
                WHERE p.name = '{non_veg_prod_name}' 
                RETURN p.name, COLLECT(i.name) AS list_ingredients, j.instructions AS recipe
                """
    result4 = run_cypher_query(query4)

    #get substitutes for each non vegan ingredient in the recipe
    for id,i in enumerate(result4[0]['list_ingredients']):
        match = process.extract(i,vegan_data['Non_Vegan_Ingredient'])
        for mtch in match:
           if mtch[1]>85:
               vegan_item = vegan_data[vegan_data['Non_Vegan_Ingredient']==mtch[0]]['Vegan_Ingredient'].values[0][0]
               subs[mtch[0]]= vegan_item

    #replace each non vegan ingredient in the recipe with its vegan substitute
    for k,v in subs.items():
        result4[0]['recipe'] = result4[0]['recipe'].replace(k,v)
    return result4[0]['recipe']

In [79]:
vegan_recipe = get_vegan_recipe('healthier-chicken-alfredo-pasta', 'vegan_substitutes.csv')
print(f"Recipe to make the vegan version of chicken alfredo pasta is - \n\n{vegan_recipe}")

Recipe to make the vegan version of chicken alfredo pasta is - 

heat the olive oil over a skillet and add tofu. season with salt and pepper, and cook 5-8 minutes, or until no longer pink. remove the tofu from the pan and set aside. in the same pan, add the garlic and saute for one minute over medium heat. sprinkle the flour over the garlic and slowly add in the tofu stock. quickly stir to avoid lumps. add in the skim oat milk, stir ,and allow to reach a boil to thicken sauce. season with salt and pepper. once the sauce is thickened, add in the spinach and stir until wilted. remove from heat and add in the cooked penne, tofu, and cashew cheese. stir to coat. top with fresh cashew cheese. enjoy!


#### Query 5 - "I'm craving for something delicious, give me options for heart-disease-friendly recipes."

##### Function get_recipes_for_health_condition
- gets the list food items to be avoided by a user with given health condition
- unwinds the list of food items
- checks for products that don't contain any of the items that need to be avoided
- groups the products based on the number of duplicate records
- returns the recipes of the products who have duplicate number of records = the length of the food items to avoid list
- orders data by fiber content since fiber is good for all health conditions
- returns the top 5 recipes to the user

In [80]:
def get_recipes_for_health_condition (condition,health_condition_file):

    #read csv file specifying health condition and food to avoid for the condition
    health_condition_df = pd.read_csv(health_condition_file, delimiter = ';')

    #get the list of food items to avoid for the condition
    ingredients_to_remove = health_condition_df[health_condition_df['Condition']==condition]['food_to_avoid'].to_list()[0].split(', ')

    #graph_query - "UNWIND ['butter', 'salt', 'oil', 'ghee', 'avocado', 'beef'] AS x  WITH x, size(['butter', 'salt', 'oil', 'ghee', 'avocado', 'beef']) AS no_of_items  MATCH (p:Product)-[:CONTAINS]->(ing:Ingredient)  MATCH (p:Product)-[:HAS_INSTRUCTIONS]->(ins:Instructions)  WITH COLLECT(ing) AS ing_nodes, COLLECT(ing.name) as ingredients, ins,p,x,no_of_items  WHERE NOT ANY(ingredient IN ingredients WHERE ingredient CONTAINS x)  WITH COUNT(p) AS pcount,p AS product, ing_nodes,ingredients, ins, no_of_items  WHERE no_of_items-pcount=0  RETURN product,ing_nodes,ins ORDER BY product.fiber DESC, product.calories ASC  LIMIT 5"
    #frame cypher query to get all products which don't contain any of the food items listed above
    query5 = f"""UNWIND {ingredients_to_remove} AS x 
                WITH x, size({ingredients_to_remove}) AS no_of_items 
                MATCH (p:Product)-[:CONTAINS]->(ing:Ingredient) 
                MATCH (p:Product)-[:HAS_INSTRUCTIONS]->(ins:Instructions) 
                WITH COLLECT(ing.name) as ingredients, ins,p,x,no_of_items 
                WHERE NOT ANY(ingredient IN ingredients WHERE ingredient CONTAINS x) 
                WITH COUNT(p.name) AS pcount,p.name AS product, ingredients, ins.instructions AS instructions,p.calories AS calories, p.carbohydrates AS carbs, p.sugar AS total_sugar, p.fat AS fat, p.protein AS protein, p.fiber AS fiber, no_of_items 
                WHERE no_of_items-pcount=0 
                RETURN product,ingredients,instructions,calories,carbs,total_sugar,protein,fiber 
                ORDER BY fiber DESC, calories ASC 
                LIMIT 5
                """
    #run the cypher query
    result5 = run_cypher_query(query5)

    #return the result
    return result5

In [81]:
recipes_for_health_condition = pd.DataFrame(get_recipes_for_health_condition ('Heart Disease','Disease_foods_to_avoid_no_names.csv'))
recipes_for_health_condition

#### Future work

Our current knowledge graph also leads into potential future work. 
The current average food cost, per patient, per day in a hospital is $268. 
This process requires a dietician to ensure that every patientâ€™s needs are nutritionally met. 
Expanding upon our current idea, just as AI is beginning to be utilized in provisioning patient profiles for medication dosage, AI can be utilized to ensure patients are able to enjoy food and have their nutritional needs met at a cost cut to the healthcare industry.

- Use metric-unit data to incorporate exact measurements of each ingredient
- Use more advanced techniques to do entity resolution (for eg - blocking techniques, entity matching, clustering)
- Usage of food_to_avoid data from valid sources like hospitals to give better recommendations to the user
- Improve suggestions by expanding food nutrition profile
- Use of data/knowledge graph by popular food logging applications