<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pandas" data-toc-modified-id="Pandas-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pandas</a></span></li><li><span><a href="#Fuzzy" data-toc-modified-id="Fuzzy-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Fuzzy</a></span></li><li><span><a href="#Similarity-algorithms" data-toc-modified-id="Similarity-algorithms-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Similarity algorithms</a></span></li><li><span><a href="#Merging-categories-with-recipes" data-toc-modified-id="Merging-categories-with-recipes-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Merging categories with recipes</a></span></li></ul></div>

In [25]:
import pandas as pd
import ast
import dask.dataframe as dd
from rapidfuzz import process

In [2]:
recipe_data = pd.read_csv(r"C:\Users\rishi\OneDrive - Monash University\Documents\Monash\MDS Y2 S2\IE\Iteration 2\recipe_Data\dataset\full_dataset.csv")

In [3]:
recipe_data = recipe_data.drop(["Unnamed: 0", "link", "source"], axis = 1)

In [4]:
recipe_data.shape

(2231142, 4)

In [5]:
recipe_data.head(2)

Unnamed: 0,title,ingredients,directions,NER
0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...","[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....","[""beef"", ""chicken breasts"", ""cream of mushroom..."


In [147]:
df = recipe_data[["NER"]]

In [148]:
#converting the str to text
for i in range(df.shape[0]):
    df.at[i, "NER"] = ast.literal_eval(df.iloc[i, 0].lower())

In [149]:
df.head()

Unnamed: 0,NER
0,"[brown sugar, milk, vanilla, nuts, butter, bit..."
1,"[beef, chicken breasts, cream of mushroom soup..."
2,"[frozen corn, cream cheese, butter, garlic pow..."
3,"[chicken, chicken gravy, cream of mushroom sou..."
4,"[peanut butter, graham cracker crumbs, butter,..."


I want to generate recipes based on NER
e.g.
- I have "milk", "vanilla", "chicken", "rice"
- give top 5 matches that has most of these ingredients
- show the ones that are used and not used in the recipe. 
- then when one of them is clicked --> the instructions is shown

To do code:
- I have input = ["milk", "vanilla", "chicken", "rice"]
- the 'NER' column in 'recipe_data' has '["brown sugar", "milk", "vanilla", "nuts", "butter", "bite size shredded rice biscuits"]'
- give top 5 matches that has most of input in 'NER' column(ingredients of the recipe)
- show the ones that are present and not present in the 'NER' column(ingredients) 

what are the different ways to do this:

- similarity search algorithms
- ML model. x = NER. y = Title ?
- decision tree model like thing - recipes under brown sugar. another branch of milk. now get the branch that goes down atleast a bit. 
- Fuzzy match - match your list with that NER list
- for loop. get input[i] in NER. get recipes with max count on input in NER. - for loop - search the count of items present and not present in recipe - time consuming


# Pandas

In [15]:
input_ingredients = set(["milk", "vanilla", "chicken", "rice"])

In [21]:
%%time

# Convert pandas DataFrame to dask DataFrame
ddf = dd.from_pandas(df, npartitions=10)

# Perform the same operation
ddf["present"] = ddf["NER"].apply(lambda x: list(input_ingredients.intersection(set(x))), meta=('x', 'object'))
ddf["not_present"] = ddf["NER"].apply(lambda x: list(input_ingredients.difference(set(x))), meta=('x', 'object'))
ddf["match_count"] = ddf["present"].apply(len, meta=('x', 'int'))

# Get top 5 matches
top_matches = ddf.nlargest(5, "match_count").compute()


CPU times: total: 8.67 s
Wall time: 21.8 s


In [24]:
top_matches

Unnamed: 0,NER,present,not_present,match_count
2866,"[chicken, rice, cream of mushroom soup, cream ...","[rice, milk, chicken]",[vanilla],3
3030,"[rice, sugar, raisins, vanilla, milk, milk]","[vanilla, rice, milk]",[chicken],3
3400,"[eggs, sugar, milk, cornstarch, vanilla, lemon...","[vanilla, rice, milk]",[chicken],3
4128,"[water, milk, rice, eggs, sugar, vanilla]","[vanilla, rice, milk]",[chicken],3
4532,"[chicken, bread crumbs, rice, onion, celery, p...","[rice, milk, chicken]",[vanilla],3


In [23]:
top_matches.index

Int64Index([2866, 3030, 3400, 4128, 4532], dtype='int64')

# Fuzzy

In [30]:
%%time
best_match = process.extractOne("[milk, vanilla, chicken, rice]", recipe_data["NER"])
best_match[0]

#this can only give one output

CPU times: total: 13.1 s
Wall time: 21.5 s


'["chicken", "rice"]'

In [29]:
%%time
best_match = process.extractOne("[milk, vanilla, chicken, rice]", ddf["NER"])
best_match[0]

#this can only give one output though

CPU times: total: 12.3 s
Wall time: 23.4 s


['will', ']']

# Similarity algorithms

https://spotintelligence.com/2022/12/19/text-similarity-python/ 

Jaccard index = proportion of common elements between two sets.
The Jaccard index is particularly useful when the presence or absence of elements in the sets is more important than their frequency or order. For example, it can be used to compare the similarity of two documents by considering the sets of words that appear in each document.
J(A,B) = |A ∩ B| / |A ∪ B|

When using this algorithm we will include recipe that has more items included from input in the recipe list and less items ignored in input list or recipe list. 

In [172]:
def jaccard_similarity(input_items, recipe_ingredients):
    input_items = set(input_items)
    recipe_ingredients = set(recipe_ingredients)
    # intersection of two sets
    intersection = len(input_items.intersection(recipe_ingredients))
    # Unions of two sets
    union = len(input_items.union(recipe_ingredients))
    return intersection / union

In [175]:
%%time
input_ingredients = set(["milk", "bread", "chicken", "rice"])
df["similarity_score"] = df['NER'].apply(lambda x: jaccard_similarity(input_ingredients, x))
indices = df.nlargest(n=5, columns = "similarity_score").index

CPU times: total: 2.23 s
Wall time: 5.72 s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [179]:
i = 4
print("Items from food inventry: ", input_ingredients)
print("Items available to do this recipe: ", input_ingredients.intersection(df.iloc[indices[i],0]))
#print(input_ingredients.difference(df.iloc[indices[i],0]))
print("Items not available to do this recipe: ", set(df.iloc[indices[i],0]).difference(input_ingredients))
print("\n")
print("Recipe name:", recipe_data.iloc[indices[i], 0])
print("Ingredients:", recipe_data.iloc[indices[i], 1])
print("Recipe: ", recipe_data.iloc[indices[i], 2])

Items from food inventry:  {'bread', 'rice', 'milk', 'chicken'}
Items available to do this recipe:  {'bread', 'rice', 'chicken'}
Items not available to do this recipe:  {'beef'}


Recipe name: Surprise Rice And Gravy
Ingredients: ["10 c. rice", "2 c. roast beef", "1 chicken leg", "bread"]
Recipe:  ["Cut the chicken and roast beef.", "Stir it; put in rice.", "Add some bread and cook for a while.", "Yum!"]


In [210]:
%%time
input_ingredients = set(["white toast soft"])
df["similarity_score"] = df['NER'].apply(lambda x: jaccard_similarity(input_ingredients, x))
indices = df.nlargest(n=5, columns = "similarity_score").index

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


CPU times: total: 1.52 s
Wall time: 5.39 s


In [211]:
i = 4
print("Items from food inventry: ", input_ingredients)
print("Items available to do this recipe: ", input_ingredients.intersection(df.iloc[indices[i],0]))
#print(input_ingredients.difference(df.iloc[indices[i],0]))
print("Items not available to do this recipe: ", set(df.iloc[indices[i],0]).difference(input_ingredients))
print("\n")
print("Recipe name:", recipe_data.iloc[indices[i], 0])
print("Ingredients:", recipe_data.iloc[indices[i], 1])
print("Recipe: ", recipe_data.iloc[indices[i], 2])

Items from food inventry:  {'white toast soft'}
Items available to do this recipe:  set()
Items not available to do this recipe:  {'butter', 'peanut butter', 'chocolate chips', 'graham cracker crumbs', 'powdered sugar'}


Recipe name: Reeses Cups(Candy)  
Ingredients: ["1 c. peanut butter", "3/4 c. graham cracker crumbs", "1 c. melted butter", "1 lb. (3 1/2 c.) powdered sugar", "1 large pkg. chocolate chips"]
Recipe:  ["Combine first four ingredients and press in 13 x 9-inch ungreased pan.", "Melt chocolate chips and spread over mixture. Refrigerate for about 20 minutes and cut into pieces before chocolate gets hard.", "Keep in refrigerator."]


In [212]:
%%time
input_ingredients = set(["toast"])
df["similarity_score"] = df['NER'].apply(lambda x: jaccard_similarity(input_ingredients, x))
indices = df.nlargest(n=5, columns = "similarity_score").index
i = 4
print("Items from food inventry: ", input_ingredients)
print("Items available to do this recipe: ", input_ingredients.intersection(df.iloc[indices[i],0]))
#print(input_ingredients.difference(df.iloc[indices[i],0]))
print("Items not available to do this recipe: ", set(df.iloc[indices[i],0]).difference(input_ingredients))
print("\n")
print("Recipe name:", recipe_data.iloc[indices[i], 0])
print("Ingredients:", recipe_data.iloc[indices[i], 1])
print("Recipe: ", recipe_data.iloc[indices[i], 2])

Items from food inventry:  {'toast'}
Items available to do this recipe:  {'toast'}
Items not available to do this recipe:  {'cheese', 'bacon', 'eggs'}


Recipe name: baconator the right way
Ingredients: ["2 eggs", "8 slice bacon", "3 slice toast", "2 slice cheese"]
Recipe:  ["coook off the bacon as instructed on the package", "take your 3 eggs and mix them for scrambled eggs.", "when the eggs are almost done add the cheese so it melts on top.", "when all infredients are done make a triple decker sandwich with 4 pieces of bacon on each slice if bread, and 1 egg each"]
CPU times: total: 3.62 s
Wall time: 4.97 s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [207]:
%%time
input_ingredients = set(["butter chicken simmer sauce mild"])
df["similarity_score"] = df['NER'].apply(lambda x: jaccard_similarity(input_ingredients, x))
indices = df.nlargest(n=5, columns = "similarity_score").index

CPU times: total: 3.67 s
Wall time: 4.93 s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [209]:
i = 1
print("Items from food inventry: ", input_ingredients)
print("Items available to do this recipe: ", input_ingredients.intersection(df.iloc[indices[i],0]))
#print(input_ingredients.difference(df.iloc[indices[i],0]))
print("Items not available to do this recipe: ", set(df.iloc[indices[i],0]).difference(input_ingredients))
print("\n")
print("Recipe name:", recipe_data.iloc[indices[i], 0])
print("Ingredients:", recipe_data.iloc[indices[i], 1])
print("Recipe: ", recipe_data.iloc[indices[i], 2])

Items from food inventry:  {'butter chicken simmer sauce mild'}
Items available to do this recipe:  set()
Items not available to do this recipe:  {'cream of mushroom soup', 'chicken breasts', 'beef', 'sour cream'}


Recipe name: Jewell Ball'S Chicken
Ingredients: ["1 small jar chipped beef, cut up", "4 boned chicken breasts", "1 can cream of mushroom soup", "1 carton sour cream"]
Recipe:  ["Place chipped beef on bottom of baking dish.", "Place chicken on top of beef.", "Mix soup and cream together; pour over chicken. Bake, uncovered, at 275\u00b0 for 3 hours."]


# Merging categories with recipes

In [88]:
import itertools

In [150]:
items_present = list(itertools.chain.from_iterable(list(df["NER"])))

In [100]:
len(items_present)

234059

In [103]:
items_present.sort()

In [108]:
items_present[1:15]

['"Cracker Barrel',
 '"Great Garlic',
 '"Great Guacamole Spice',
 '"Kirsch Liqueur',
 '"Natures Seasoning"',
 '#',
 '#\tOnion',
 '#\tmozzarella cheese',
 '# - Ham',
 '# - boneless flank',
 '# Andouille Sausage',
 '# Ap flour',
 '# BACON',
 '# Bacon']

In [109]:
items_present[400:415]

["'s Chili Cocktail Sauce",
 "'s Chili Mix",
 "'s Chili Pinto Beans",
 "'s Chili Powder Seasoning",
 "'s Chili Seasoning",
 "'s Chili Seasoning Mix",
 "'s Chili mix",
 "'s Chili seasoning",
 "'s Chips",
 "'s Choice",
 "'s Choice California",
 "'s Chunky",
 "'s Chunky Mexican",
 "'s Chunky Salsa",
 "'s Chunky chili"]

In [112]:
items_present[2200:2215]

['Al Capone roast from',
 'Al Dente',
 'Al Fresco',
 'Al Fresco Sweet Italian Style Chicken',
 'Al Fresco Tomato',
 'Al Purpose',
 'Al pastor',
 "Al's barbecue sauce",
 "Al's chow",
 'AlDente',
 'Alabama',
 'Alabama White',
 'Alaga syrup',
 'Alaskan crabmeat',
 'Alaskan halibut']

In [91]:
items_present

{'E', 'N', 'R'}

In [151]:
items_count = pd.Series(items_present).value_counts()

In [196]:
sum(items_count > 0.01*234059) #items present in 5% of the recipes

717

In [197]:
items_count[0:717]

salt                   1013708
sugar                   662832
butter                  539978
flour                   488086
eggs                    422212
                        ...   
peppermint                2351
pimientos                 2349
orange peel               2348
ground white pepper       2342
roma tomatoes             2341
Length: 717, dtype: int64

In [202]:
with open("product_groups.txt", "w") as f:
    f.write(" | ".join(list(items_count[0:350].index)))

In [166]:
import numpy as np

In [199]:
np.save("product_groups", np.array(items_count[0:717].index))

In [141]:
prod_groups = " | ".join(list(items_count[0:33411].index))

In [145]:
re.sub('\x95', "")

'ipe | \x95Salt | chunky'

In [139]:
f = open("product_groups.txt", "a")
f.write()
f.close()

UnicodeEncodeError: 'charmap' codec can't encode character '\x95' in position 162456: character maps to <undefined>