# Ingredient Embeddings
BS"D

In this notebook, we will create embeddings for ingredients in the dataset. We will first attempt to use the gensim library.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from gensim.models import Word2Vec

## Load Data
We have two datasets, labeled `dataset_1.json` and `dataset_2.json`. We will initially only use `dataset_2.json` for the embeddings since it is seemingly more precise.

In [39]:
filepath = 'data/dataset_2.json'

raw_recipes = pd.read_json(filepath, orient='table')

raw_recipes

Unnamed: 0,ingredients
0,"[romaine lettuce, black olives, grape tomatoes..."
1,"[plain flour, ground pepper, salt, tomatoes, g..."
2,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,"[water, vegetable oil, wheat, salt]"
4,"[black pepper, shallots, cornflour, cayenne pe..."
...,...
39769,"[light brown sugar, granulated sugar, butter, ..."
39770,"[KRAFT Zesty Italian Dressing, purple onion, b..."
39771,"[eggs, citrus fruit, raisins, sourdough starte..."
39772,"[boneless chicken skinless thigh, minced garli..."


## Prepare data
The recipes have to be concatenated into a single string for each recipe. We will then tokenize the recipes into a list of words.

However, first the ingredients that have multiple tokens have to be concatenated into a single token. For example, `green onions` should be `green_onions`.

In [40]:
def preprocess_text(ingredients):
    '''
    This function takes a list of ingredients and concatenates them into a single string.
    However, it first prepares any ingredients that have multiple words by concatenating them with an underscore.

    Parameters
    ----------
    ingredients : list
        A list of ingredients.

    Returns
    -------
    str
        A single string of ingredients.
    '''


    # Prepare ingredients with multiple words
    ingredients = [ingredient.replace(' ', '_') for ingredient in ingredients]

    return ingredients

recipes = raw_recipes.copy()
# recipes['ingredients'] = recipes['ingredients'].apply(preprocess_text)

recipes

Unnamed: 0,ingredients
0,"[romaine lettuce, black olives, grape tomatoes..."
1,"[plain flour, ground pepper, salt, tomatoes, g..."
2,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,"[water, vegetable oil, wheat, salt]"
4,"[black pepper, shallots, cornflour, cayenne pe..."
...,...
39769,"[light brown sugar, granulated sugar, butter, ..."
39770,"[KRAFT Zesty Italian Dressing, purple onion, b..."
39771,"[eggs, citrus fruit, raisins, sourdough starte..."
39772,"[boneless chicken skinless thigh, minced garli..."


## Train Embeddings

In [44]:
embedding_size = 100
window_size = 10
min_count = 1
workers = 4

model = Word2Vec(recipes['ingredients'], vector_size=embedding_size, window=window_size, min_count=min_count, workers=workers)

In [46]:
model.wv.most_similar('milk')

[('melted butter', 0.8506343364715576),
 ('mashed potatoes', 0.8329598903656006),
 ('cream', 0.8165110945701599),
 ('shortening', 0.8154274225234985),
 ('evaporated milk', 0.812747597694397),
 ('bread crumbs', 0.7776293754577637),
 ('self rising flour', 0.7747952938079834),
 ('seasoned flour', 0.768001914024353),
 ('biscuits', 0.7614127397537231),
 ('poultry seasoning', 0.7577057480812073)]

Well, this is not working...

The idea I have to improve it is to include the recipes from `dataset_1.json` as well. This will increase the number of recipes and hopefully improve the embeddings.

In [49]:
filepath = 'data/dataset_1.json'

additional_recipes = pd.read_json(filepath, orient='table')

additional_recipes

Unnamed: 0,ingredients
0,"[whole chicken, kosher salt, acorn squash, uns..."
1,"[egg white, new potato, kosher salt, pepper]"
2,"[evaporated milk, whole milk, garlic powder, o..."
3,"[round, loaf, olive oil, sausage, unsalted but..."
4,"[dark brown sugar, hot water, fresh lemon juic..."
...,...
13496,"[all-purpose flour, unsweetened cocoa powder, ..."
13497,"[lemon, squash, olive oil, onion, couscous, ac..."
13498,"[katsuo bushi, dried bonito flake, dashi, sake..."
13499,"[unsalted butter, baby spinach, phyllo]"


In [48]:
full_recipes = pd.concat([recipes, additional_recipes], ignore_index=True)

full_recipes

Unnamed: 0,ingredients
0,"[romaine lettuce, black olives, grape tomatoes..."
1,"[plain flour, ground pepper, salt, tomatoes, g..."
2,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,"[water, vegetable oil, wheat, salt]"
4,"[black pepper, shallots, cornflour, cayenne pe..."
...,...
53270,"[all-purpose flour, unsweetened cocoa powder, ..."
53271,"[lemon, squash, olive oil, onion, couscous, ac..."
53272,"[katsuo bushi, dried bonito flake, dashi, sake..."
53273,"[unsalted butter, baby spinach, phyllo]"


In [50]:
model_2 = Word2Vec(full_recipes['ingredients'], vector_size=embedding_size, window=window_size, min_count=min_count, workers=workers)

In [54]:
model_2.wv.most_similar('bread crumbs')

[('sausages', 0.8712576031684875),
 ('lasagne', 0.8651108741760254),
 ('dried parsley', 0.8599311113357544),
 ('Italian seasoned breadcrumbs', 0.8477048873901367),
 ('lasagna noodle', 0.8269242644309998),
 ('pork sausages', 0.823008120059967),
 ('garlic bread', 0.8159353137016296),
 ('ricotta cheese', 0.8115701079368591),
 ('dried sage', 0.8112384676933289),
 ('parmesan cheese', 0.8015791177749634)]