# Ingredient Embeddings
BS"D

In this notebook, we will create embeddings for ingredients in the dataset. We will first attempt to use the gensim library.

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from gensim.models import Word2Vec

## Load Data
We have two datasets, labeled `dataset_1.json` and `dataset_2.json`. We will initially only use `dataset_2.json` for the embeddings since it is seemingly more precise.

In [29]:
filepath = 'data/dataset_2.json'

raw_recipes = pd.read_json(filepath, orient='table')

raw_recipes

Unnamed: 0,ingredients
0,"[romaine lettuce, black olives, grape tomatoes..."
1,"[plain flour, ground pepper, salt, tomatoes, g..."
2,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,"[water, vegetable oil, wheat, salt]"
4,"[black pepper, shallots, cornflour, cayenne pe..."
...,...
39769,"[light brown sugar, granulated sugar, butter, ..."
39770,"[KRAFT Zesty Italian Dressing, purple onion, b..."
39771,"[eggs, citrus fruit, raisins, sourdough starte..."
39772,"[boneless chicken skinless thigh, minced garli..."


## Prepare data
The recipes have to be concatenated into a single string for each recipe. We will then tokenize the recipes into a list of words.

However, first the ingredients that have multiple tokens have to be concatenated into a single token. For example, `green onions` should be `green_onions`.

In [4]:
def preprocess_text(ingredients):
    '''
    This function takes a list of ingredients and concatenates them into a single string.
    However, it first prepares any ingredients that have multiple words by concatenating them with an underscore.

    Parameters
    ----------
    ingredients : list
        A list of ingredients.

    Returns
    -------
    str
        A single string of ingredients.
    '''


    # Prepare ingredients with multiple words
    ingredients = [ingredient.replace(' ', '_') for ingredient in ingredients]

    return ingredients

recipes = raw_recipes.copy()
# recipes['ingredients'] = recipes['ingredients'].apply(preprocess_text)

recipes

Unnamed: 0,ingredients
0,"[romaine lettuce, black olives, grape tomatoes..."
1,"[plain flour, ground pepper, salt, tomatoes, g..."
2,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,"[water, vegetable oil, wheat, salt]"
4,"[black pepper, shallots, cornflour, cayenne pe..."
...,...
39769,"[light brown sugar, granulated sugar, butter, ..."
39770,"[KRAFT Zesty Italian Dressing, purple onion, b..."
39771,"[eggs, citrus fruit, raisins, sourdough starte..."
39772,"[boneless chicken skinless thigh, minced garli..."


## Train Embeddings

In [5]:
embedding_size = 100
window_size = 10
min_count = 1
workers = 4

model = Word2Vec(recipes['ingredients'], vector_size=embedding_size, window=window_size, min_count=min_count, workers=workers)

In [6]:
model.wv.most_similar('milk')

[('melted butter', 0.8466442823410034),
 ('mashed potatoes', 0.8283244371414185),
 ('shortening', 0.8156949877738953),
 ('evaporated milk', 0.8120179176330566),
 ('leftover gravy', 0.8001434803009033),
 ('self rising flour', 0.791695773601532),
 ('elbow macaroni', 0.7891352772712708),
 ('bread crumbs', 0.7812182307243347),
 ('popcorn', 0.7784833908081055),
 ('honey glazed ham', 0.777350902557373)]

Well, this is not working...

The idea I have to improve it is to include the recipes from `dataset_1.json` as well. This will increase the number of recipes and hopefully improve the embeddings.

In [7]:
filepath = 'data/dataset_1.json'

additional_recipes = pd.read_json(filepath, orient='table')

additional_recipes

Unnamed: 0,ingredients
0,"[whole chicken, kosher salt, acorn squash, uns..."
1,"[egg white, new potato, kosher salt, pepper]"
2,"[evaporated milk, whole milk, garlic powder, o..."
3,"[round, loaf, olive oil, sausage, unsalted but..."
4,"[dark brown sugar, hot water, fresh lemon juic..."
...,...
13496,"[all-purpose flour, unsweetened cocoa powder, ..."
13497,"[lemon, squash, olive oil, onion, couscous, ac..."
13498,"[katsuo bushi, dried bonito flake, dashi, sake..."
13499,"[unsalted butter, baby spinach, phyllo]"


In [8]:
full_recipes = pd.concat([recipes, additional_recipes], ignore_index=True)

full_recipes

Unnamed: 0,ingredients
0,"[romaine lettuce, black olives, grape tomatoes..."
1,"[plain flour, ground pepper, salt, tomatoes, g..."
2,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,"[water, vegetable oil, wheat, salt]"
4,"[black pepper, shallots, cornflour, cayenne pe..."
...,...
53270,"[all-purpose flour, unsweetened cocoa powder, ..."
53271,"[lemon, squash, olive oil, onion, couscous, ac..."
53272,"[katsuo bushi, dried bonito flake, dashi, sake..."
53273,"[unsalted butter, baby spinach, phyllo]"


In [9]:
model_2 = Word2Vec(full_recipes['ingredients'], vector_size=embedding_size, window=window_size, min_count=min_count, workers=workers)

In [10]:
model_2.wv.most_similar('bread crumbs')

[('sausages', 0.8894049525260925),
 ('dried sage', 0.8628635406494141),
 ('louisiana hot sauce', 0.8434188961982727),
 ('pork sausages', 0.831591784954071),
 ('dried parsley', 0.820760190486908),
 ('Burgundy wine', 0.818422794342041),
 ('back bacon rashers', 0.8089631795883179),
 ('beef stock', 0.806433379650116),
 ('marjoram', 0.7972792387008667),
 ('Italian seasoned breadcrumbs', 0.7968472838401794)]

In [11]:
model_2.wv.most_similar('milk')

[('evaporated milk', 0.7722955346107483),
 ('shortening', 0.7642921209335327),
 ('melted butter', 0.7598229646682739),
 ('mashed potatoes', 0.7140193581581116),
 ('bread crumbs', 0.7137837409973145),
 ('pork sausages', 0.6934083104133606),
 ('cream of potato soup', 0.6927796602249146),
 ('self rising flour', 0.6874091625213623),
 ('vegetable shortening', 0.6867200136184692),
 ('luke warm water', 0.6821605563163757)]

This didn't really help. We probably need to look into a different approach such as using a pre-trained model.

## Approach 2: Pre-trained model
I will use the GloVe embeddings to find similar ingredients.

### Load GloVe
I will load the GloVe embeddings into a Gensim model.

In [None]:
!curl -O https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
!pwd
!unzip glove.6B.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  822M  100  822M    0     0  5272k      0  0:02:39  0:02:39 --:--:-- 5108k57  0:00:06  0:01:51 7907k378k      0  0:02:12  0:00:10  0:02:02 5200k  0     0  5904k      0  0:02:22  0:00:17  0:02:05 5219k      0  0:02:28  0:00:25  0:02:03 5223k     0  5650k      0  0:02:29  0:00:27  0:02:02 5198k5k      0  0:02:36  0:01:15  0:01:21 5206k00:49 5194k76M    0     0  5320k      0  0:02:38  0:01:51  0:00:47 5214k     0  0:02:38  0:02:08  0:00:30 5217k    0  5293k      0  0:02:39  0:02:21  0:00:18 5120k02:39  0:02:24  0:00:15 5128k0 5105k
/Users/tuvyamacklin/Documents/Repos/Ingredient-Substitution-Capstone/models/distance_model
Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       

In [1]:
from gensim.scripts.glove2word2vec import glove2word2vec

glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

  glove2word2vec(glove_input_file, word2vec_output_file)


(400000, 100)

In [2]:
from gensim.models import KeyedVectors

glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

### Test it out

In [4]:
glove_model.most_similar('milk')

[('dairy', 0.7612762451171875),
 ('meat', 0.7481759786605835),
 ('sugar', 0.7345505952835083),
 ('yogurt', 0.6953763365745544),
 ('juice', 0.694653332233429),
 ('cream', 0.6850671172142029),
 ('egg', 0.6832371950149536),
 ('soda', 0.6767032742500305),
 ('foods', 0.6745815873146057),
 ('butter', 0.6701311469078064)]

In [23]:
glove_model.most_similar('green onions')

KeyError: "Key 'green onions' not present in vocabulary"

So this works better but it has a limitation. It can't do ingredients that are multiple words. For example, `green onions` is not in the GloVe embeddings. We will have to find a way to handle this.

Another issue is that the suggestions contain words that are not ingredients. We will have to filter these out. My idea for how to do this is to collect a list of every ingredient in the dataset, then filter out any words that are not in this list. The filtering will be done by requesting a large number of similar words and then filtering out the ones that are not in the list.

### Filter out non-ingredients

In [25]:
def get_all_ingredients(recipes):
    '''
    This function takes a DataFrame of recipes and returns a list of all ingredients.

    Parameters
    ----------
    recipes : DataFrame
        A DataFrame of recipes.

    Returns
    -------
    list
        A list of all ingredients.
    '''

    all_ingredients = set()

    for ingredients in recipes['ingredients']:
        all_ingredients.update(ingredients)

    all_ingredients = [ingredient for ingredient in all_ingredients]

    return all_ingredients

In [30]:
all_ingredients = get_all_ingredients(raw_recipes)

all_ingredients

['fresh thyme leaves',
 'myzithra',
 'sherry wine',
 'cod roe',
 'light pancake syrup',
 'dark soy',
 "tony chachere's seasoning",
 'tartlet shells',
 'Ragu Sauce',
 'chinese chili paste',
 'mandarin orange juice',
 'bone in skinless chicken thigh',
 'ground peanut',
 'scones',
 'dhaniya powder',
 'Conimex Wok Olie',
 'chicken gravy',
 'guajillo chile powder',
 'anise',
 'papalo',
 'oil',
 'black salt',
 'chili sauce',
 'low-fat chicken broth',
 'lamb rib roast',
 'freshly ground pepper',
 'sugarcane sticks',
 'rotini',
 'Johnsonville Mild Italian Sausage Links',
 'condiments',
 'salad leaves',
 'Italian turkey sausage',
 'chocolate candy bars',
 'vanilla ice cream',
 'brown ale',
 'fresh lemon juice',
 'tart cherries',
 'bacon drippings',
 'sweet biscuit crumbs',
 'ground Italian sausage',
 'dry fettuccine',
 'Heinz Ketchup',
 'Emmenthal',
 'marshmallows',
 'glace cherries',
 'roasted almond oil',
 'sweet mini bells',
 'wine vinegar',
 'blacan',
 'yellow miso',
 'instant potato flakes

In [None]:
def filter_ingredient(ingredient):
    '''
    This function takes an ingredient and returns whether it is in the list of all ingredients.

    Parameters
    ----------
    ingredient : str
        An ingredient.

    Returns
    -------
    bool
        Whether the ingredient is in the list of all ingredients.
    '''

    return ingredient in all_ingredients

def get_filtered_similar_ingredients(model, ingredient, filter=filter_ingredient, topn=10, words_to_search=1000):
    '''
    This function takes a Word2Vec model, an ingredient, a list of all ingredients, and returns the most similar ingredients.

    Parameters
    ----------
    model : Word2Vec
        A Word2Vec model.
    ingredient : str
        An ingredient.
    filter : function
        A filter function. The default is filter_ingredient.
    topn : int
        The number of similar ingredients to return.
    words_to_search : int
        The number of words to search in the model.

    Returns
    -------
    list
        A list of similar ingredients.
    '''

    similar_ingredients = []

    try:
        similar_ingredients = model.most_similar(ingredient, topn=topn)
    except KeyError:
        print(f'{ingredient} not in vocabulary')

    similar_ingredients = [similar_ingredient for similar_ingredient, _ in similar_ingredients]

    filtered_similar_ingredients = [similar_ingredient for similar_ingredient in similar_ingredients if filter(similar_ingredient)]

    return filtered_similar_ingredients

get_filtered_similar_ingredients(glove_model, 'milk')

['meat', 'sugar', 'juice', 'cream', 'soda', 'butter']

In [37]:
# Demo
print("Without the filter:")
print(get_filtered_similar_ingredients(glove_model, 'milk', filter=lambda x: True))
print("\nWith the filter:")
print(get_filtered_similar_ingredients(glove_model, 'milk'))

Without the filter:
['dairy', 'meat', 'sugar', 'yogurt', 'juice', 'cream', 'egg', 'soda', 'foods', 'butter']

With the filter:
['meat', 'sugar', 'juice', 'cream', 'soda', 'butter']


This is problematic. The filter should have included "yogurt", and "egg". This probably is happening because the filter only knows the word "eggs" (plural). Maybe using a lemmatizer would help.

In [40]:
# find all ingredients starting with "yo"
yo_ingredients = [ingredient for ingredient in all_ingredients if ingredient.startswith('yo')]
yo_ingredients

['yolk',
 'yogurt low fat',
 'yoghurt natural low fat',
 'young nettle',
 'yogurt dressing',
 'young coconut meat',
 'yoghurt',
 'young leeks',
 'yogurt cheese',
 'yoplait']

### Multi-word ingredients

To address the issue of multiple words, maybe taking the average of the embeddings of the words in the ingredient would work.

In [41]:
ingredient = "green onions"

# compute the average of the embeddings of the tokens in the ingredient
tokens = ingredient.split()
embeddings = [glove_model[token] for token in tokens]
average_embedding = np.mean(embeddings, axis=0)

# find the most similar words to the average embedding
similar_words = glove_model.similar_by_vector(average_embedding)
similar_words

[('onions', 0.8584727644920349),
 ('green', 0.8236134052276611),
 ('onion', 0.7655924558639526),
 ('peppers', 0.7291814088821411),
 ('olive', 0.7284407615661621),
 ('potatoes', 0.7228022813796997),
 ('garlic', 0.7122361660003662),
 ('carrots', 0.7114325165748596),
 ('pepper', 0.7009285688400269),
 ('brown', 0.6937432289123535)]

That actually worked really well. Let's do a few more tests to see how well it works.

In [42]:
def get_similar_ingredients_multi_token(model, ingredient, topn=10):
    '''
    This function takes a Word2Vec model and an ingredient with multiple tokens and returns the most similar ingredients.

    Parameters
    ----------
    model : Word2Vec
        A Word2Vec model.
    ingredient : str
        An ingredient.
    topn : int
        The number of similar ingredients to return.

    Returns
    -------
    list
        A list of similar ingredients.
    '''

    # compute the average of the embeddings of the tokens in the ingredient
    tokens = ingredient.split()
    embeddings = [model[token] for token in tokens]
    average_embedding = np.mean(embeddings, axis=0)

    # find the most similar words to the average embedding
    similar_words = model.similar_by_vector(average_embedding, topn=topn)

    return similar_words

In [43]:
ingredients = [
    "bread crumbs",
    "green onions",
    "whole milk",
    "soy sauce",
    "black pepper",
    "garlic cloves",
    "olive oil"
]

similar_ingredients = {}

for ingredient in ingredients:
    similar_ingredients[ingredient] = get_similar_ingredients_multi_token(glove_model, ingredient)

pd.DataFrame(similar_ingredients)

Unnamed: 0,bread crumbs,green onions,whole milk,soy sauce,black pepper,garlic cloves,olive oil
0,"(bread, 0.9062777757644653)","(onions, 0.8584727644920349)","(milk, 0.873590886592865)","(sauce, 0.9212967753410339)","(pepper, 0.8429782390594482)","(cloves, 0.9522916674613953)","(oil, 0.8819882869720459)"
1,"(crumbs, 0.9026351571083069)","(green, 0.8236134052276611)","(whole, 0.8114242553710938)","(soy, 0.9028911590576172)","(black, 0.8321093320846558)","(garlic, 0.9475840330123901)","(olive, 0.8097949624061584)"
2,"(flour, 0.7854564189910889)","(onion, 0.7655924558639526)","(meat, 0.7294365763664246)","(vinegar, 0.7880284786224365)","(white, 0.7249667048454285)","(shallots, 0.8133406043052673)","(vegetable, 0.6754191517829895)"
3,"(butter, 0.7850861549377441)","(peppers, 0.7291814088821411)","(food, 0.7082301378250122)","(chili, 0.7876059412956238)","(brown, 0.7179739475250244)","(minced, 0.8043843507766724)","(sugar, 0.6473937034606934)"
4,"(cake, 0.7734878659248352)","(olive, 0.7284407615661621)","(sugar, 0.6913132667541504)","(tomato, 0.7620853781700134)","(red, 0.7128376364707947)","(clove, 0.8002690672874451)","(salt, 0.642488956451416)"
5,"(dough, 0.750171422958374)","(potatoes, 0.7228022813796997)","(egg, 0.689320981502533)","(mayonnaise, 0.7500492334365845)","(green, 0.7086984515190125)","(onion, 0.7846805453300476)","(garlic, 0.6273001432418823)"
6,"(cheese, 0.7412525415420532)","(garlic, 0.7122361660003662)","(chicken, 0.675452709197998)","(juice, 0.7486647963523865)","(blue, 0.6977642774581909)","(cumin, 0.7751067876815796)","(fresh, 0.6158108711242676)"
7,"(baking, 0.7250702381134033)","(carrots, 0.7114325165748596)","(taste, 0.6752119660377502)","(beans, 0.7398375868797302)","(orange, 0.695431649684906)","(chopped, 0.7692131400108337)","(water, 0.615207314491272)"
8,"(baked, 0.7221238613128662)","(pepper, 0.7009285688400269)","(bread, 0.6621983051300049)","(sauces, 0.7380563616752625)","(yellow, 0.6739792823791504)","(pepper, 0.7645142674446106)","(juice, 0.6122676134109497)"
9,"(loaf, 0.7158355712890625)","(brown, 0.6937432289123535)","(dairy, 0.661857545375824)","(tofu, 0.7380213737487793)","(olive, 0.664969265460968)","(onions, 0.7609161734580994)","(oils, 0.6121081113815308)"


## Word2Vec Embeddings
My next approach is to try using the Word2Vec embeddings trained on the Google News Site. I may also look into fine-tuning the embeddings on our dataset.

### Load Word2Vec

In [47]:
from gensim import downloader

# Download Word2Vec model
google_model = downloader.load("word2vec-google-news-300")



In [48]:
# Find similar words
google_model.most_similar('milk')

[('dairy', 0.7323603630065918),
 ('cow_milk', 0.686015784740448),
 ('milk_powder', 0.6646486520767212),
 ('camels_Nancy_Riegler', 0.6561244130134583),
 ('powdered_milk', 0.6497933268547058),
 ('raw_milk', 0.6309322118759155),
 ('goat_milk', 0.6260649561882019),
 ('apple_juice', 0.6173228621482849),
 ('whey', 0.6159117817878723),
 ('chocolate_caramel_mousse', 0.6145175099372864)]

In [50]:
google_model.most_similar('green_onions')

[('scallions', 0.8018365502357483),
 ('spinach', 0.6734451055526733),
 ('Roma_tomatoes', 0.6642917990684509),
 ('tomatoes', 0.6593194007873535),
 ('fresh_spinach', 0.6574292182922363),
 ('bagged_spinach', 0.6486890316009521),
 ('lettuce', 0.6351801156997681),
 ('cilantro', 0.6311642527580261),
 ('jalapeno_peppers', 0.6137291193008423),
 ('fresh_bagged_spinach', 0.6132345795631409)]

In [51]:
google_model.most_similar('bread_crumbs')

[('breadcrumbs', 0.8321340084075928),
 ('grated_cheese', 0.6606205701828003),
 ('bread_cubes', 0.6587638854980469),
 ('cracker_crumbs', 0.6426653861999512),
 ('breadcrumb_mixture', 0.6280759572982788),
 ('melted_butter', 0.6277748346328735),
 ('panko', 0.6233553886413574),
 ('chopped_onion', 0.6210940480232239),
 ('chopped_parsley', 0.6194372773170471),
 ('chopped_nuts', 0.61697918176651)]

Woah! This model is working much better! It is able to handle multi-word ingredients and it provides better suggestions.