# Google News Embeddings for Ingredient Suggestions
BS"D

In this notebook I will get the Google News Embeddings ready to use for ingredient suggestions. I will use the gensim library to load the embeddings and then I will use the embeddings to suggest ingredients for a given ingredient.

The following things need to be done to get the embeddings ready for use:
1. Filter out non-food items (the embeddings offer a lot of non-food words)
2. Setup the I/O for multi-word ingredients (e.g. "green beans"). The embeddings can handle multi-word ingredients, but they concatenate the words with underscores. We need to do the same when we want to use the embeddings, and we need to split the words when we want to display the suggestions.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from gensim.models import Word2Vec
from gensim import downloader

from tqdm import tqdm

## Load the Data

In [8]:
filepath = 'data/dataset_2.json'

raw_recipes = pd.read_json(filepath, orient='table')

raw_recipes

Unnamed: 0,ingredients
0,"[romaine lettuce, black olives, grape tomatoes..."
1,"[plain flour, ground pepper, salt, tomatoes, g..."
2,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,"[water, vegetable oil, wheat, salt]"
4,"[black pepper, shallots, cornflour, cayenne pe..."
...,...
39769,"[light brown sugar, granulated sugar, butter, ..."
39770,"[KRAFT Zesty Italian Dressing, purple onion, b..."
39771,"[eggs, citrus fruit, raisins, sourdough starte..."
39772,"[boneless chicken skinless thigh, minced garli..."


## Load the Embeddings

In [5]:
# Download Word2Vec model
google_model = downloader.load("word2vec-google-news-300")

In [6]:
google_model.most_similar('chicken')

[('meat', 0.6799129843711853),
 ('Chicken', 0.6726199388504028),
 ('chickens', 0.6597973704338074),
 ('poultry', 0.6559159755706787),
 ('pork', 0.6541998386383057),
 ('grilled_herbed', 0.651627242565155),
 ('pasta_fazool', 0.6511082649230957),
 ('boneless_chicken', 0.6347483396530151),
 ('turkey', 0.6282519102096558),
 ('rotisserie_roasted', 0.6275515556335449)]

In [7]:
google_model.most_similar("parmesan")

[('Parmesan_cheese', 0.7763434648513794),
 ('pecorino', 0.7724556922912598),
 ('Parmesan', 0.7678955793380737),
 ('ricotta', 0.7483644485473633),
 ('Gruyere_cheese', 0.748071014881134),
 ('pancetta', 0.7410364747047424),
 ('fontina', 0.7386741042137146),
 ('parmesan_cheese', 0.7345759868621826),
 ('mascarpone', 0.7343435883522034),
 ('toasted_pine_nuts', 0.7295423150062561)]

## Filter out Non-Food Items
This will be done by creating a list of all the food items in the dataset and using it to filter the suggestions offered by the embeddings.

This filtering should be done on the words as formatted by the embeddings, i.e. with underscores instead of spaces. Therefore, the list used for filtering should also have underscores instead of spaces.

Furthermore, to prevent the filter from discarding too many suggestions, it should be case-insensitive. Every ingredient will be converted to lowercase before being checked against the list.

A complication this introduces is that we will end up with fewer suggestiongs than requested. To solve this, we will request many more suggestions, filter them, and then return the top n.

In [5]:
all_ingredients_raw = set()

for recipe in raw_recipes['ingredients']:
    all_ingredients_raw.update(recipe)

print(all_ingredients_raw)

{'double smoked bacon', 'ground cashew', 'sugar', 'nori paper', 'au jus mix', 'szechwan peppercorns', 'whole wheat baguette', 'condensed cream', 'linguica', 'carrots', 'goose fat', 'french style sandwich rolls', 'ginger ale', 'low fat tortilla chip', 'mirin', 'zesty italian dressing', 'meat loaf mixture', 'hummus', 'dried fig', 'Niçoise olives', 'grill seasoning', 'low fat chunky mushroom pasta sauce', 'reduced fat cream cheese', 'cross rib roast', 'oysters', 'cooking liquid', 'chili oil', 'lingcod', 'biscuit mix', 'fine sea salt', 'lavender', 'turbinado', 'Kahlua Liqueur', 'gumbo file powder', 'baking powder', 'chinese cabbage', 'soft shelled crabs', 'vegan bouillon cubes', 'drippings', 'Spice Islands Oregano', 'mixed mushrooms', 'instant banana cream pudding', 'langoustines', 'garlic bulb', 'hard salami', 'korean vermicelli', 'nonfat dry milk', 'champagne vinegar', 'condensed fiesta nacho cheese soup', 'pure vanilla extract', 'spice cake mix', 'ti leaves', 'Neufchâtel', 'shao hsing w

In [6]:
# Format the ingredients to have underscores instead of spaces
# Also lowercase all the ingredients
all_ingredients = set()

for ingredient in all_ingredients_raw:
    all_ingredients.add(ingredient.lower().replace(' ', '_'))


In [38]:
print(len(all_ingredients))
print(all_ingredients)

6703
{'vegetable_demi-glace', 'pizza_crust_mix', 'beef_carpaccio', 'gluten-free_pie_crust', 'chile_powder', 'chile_puree', 'darjeeling_tea_leaves', 'bass_fillets', 'caster', 'new_york_style_panetini®_toasts', 'grated_nutmeg', 'refried_beans', 'sausages', 'baby_goat', 'flanken_short_ribs', 'crushed_pistachio', 'dairy_free_coconut_ice_cream', 'virgin_coconut_oil', 'coco', 'lamb_leg_steaks', 'store_bought_low_sodium_vegetable_stock', 'chocolate_sticks', 'baked_beans', 'unbleached_flour', 'hot_green_chile', 'pork_blood', 'ground_tumeric', 'long_buns', 'chicharron', 'preserved_lemon', 'red_wine_vinegar', 'kohlrabi', 'pace_chunky_salsa', 'malt', 'splenda_no_calorie_sweetener', 'fresh_parsley', 'fruit_juice', "soft_goat's_cheese", 'sliced_mango', 'tamari_soy_sauce', 'white_miso', 'candy_bar', 'low-fat_cream_cheese', 'gluten-free_flour', 'thai_chili_paste', 'dried_oysters', 'jambon_de_bayonne', 'frying_oil', 'jimmy_dean_pork_sausage', 'coleslaw_dressing', 'watercress', 'turnips', 'mixed_greens

In [60]:
def get_filtered_substitutions(ingredient, top_n=10):

    # Increase the top_n to compensate for the filtering
    top_n = top_n * 10

    suggestions = google_model.most_similar(ingredient, topn=top_n)
    
    filtered_suggestions = []

    for suggestion, score in suggestions:
        if suggestion.lower().replace(' ', '_') in all_ingredients:
            filtered_suggestions.append((suggestion, score))
    
    # Remove the underscores from the suggestions
    filtered_suggestions = [(suggestion.replace('_', ' '), score) for suggestion, score in filtered_suggestions]

    # Return the top_n suggestions
    top_n = top_n // 10
    return filtered_suggestions[:top_n]

In [61]:
get_filtered_substitutions('parmesan', 20)

[('Parmesan cheese', 0.7763434648513794),
 ('ricotta', 0.7483644485473633),
 ('Gruyere cheese', 0.748071014881134),
 ('pancetta', 0.7410364747047424),
 ('fontina', 0.7386741042137146),
 ('parmesan cheese', 0.7345759868621826),
 ('mascarpone', 0.7343435883522034),
 ('toasted pine nuts', 0.7295423150062561),
 ('goat cheese', 0.7287119030952454),
 ('pesto', 0.7269451022148132),
 ('fontina cheese', 0.714117169380188),
 ('prosciutto', 0.7138954401016235),
 ('pecorino cheese', 0.7137478590011597),
 ('gorgonzola', 0.7126846313476562),
 ('crème fraîche', 0.7000254988670349),
 ('gremolata', 0.699065625667572),
 ('provolone cheese', 0.6965590119361877),
 ('roasted garlic', 0.6964260935783386),
 ('Parmigiano', 0.6908614039421082),
 ('toasted almonds', 0.6900662779808044)]

In the above function, I filtered out non-food items *and* I set up the I/O to handle multi-word ingredients.

Now the filtering list needs to be saved and the function needs to be packaged up nicely. The list will be saved here and the function will be packaged up in a separate file.

In [7]:
# Save the filter list to a json file
import json

file_name = "all_ingredients.json"

with open(file_name, 'w') as f:
    json.dump(list(all_ingredients), f)
