# TasteBud: GAN Based Recipe Generation with Graph

Introductory words about this project...

----
## Data Processing

Data processing here has two main goals each with smaller milestones: tokenizing recipe data and creating the ingredients graph.
Tokenizing data requires parsing the RecipeNGL dataset, which will be subsetted due to its large size.
Creating the ingredients graph first requires a list of ingredients. A raw list will be obtained from the What's Cooking and RecipeNGL datasets. Then, the list will be filtered into a smaller list. The filtered list will be used for indexing ingredients, finding related/close ingredients, and creating the graph.

In [1]:
import csv
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
import torchtext
import pandas as pd
import matplotlib.pyplot as plt
import os.path
import json
import ast
import glob
import re

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
pre_processing = True
glove = torchtext.vocab.GloVe(name='6B', dim=50)

### What's Cooking Data

In [3]:
# loading the What's Cooking dataset from the .json file
wc_train_path = './data/whats_cooking/train.json'
wc_train_data = json.load(open(wc_train_path, 'r'))
print(wc_train_data[0])

{'id': 10259, 'cuisine': 'greek', 'ingredients': ['romaine lettuce', 'black olives', 'grape tomatoes', 'garlic', 'pepper', 'purple onion', 'seasoning', 'garbanzo beans', 'feta cheese crumbles']}


In [4]:
if pre_processing == True:
    # creating a list of unique ingredients from the datasets
    ingredients_set = set()

    for i in range(len(wc_train_data)):
        ingredients_set = ingredients_set | set(wc_train_data[i]['ingredients'])
    print(list(ingredients_set)[0:25])

['mustard sauce', 'yams', 'blanco tequila', 'back bacon rashers', 'gluten-free flour', 'whole wheat rotini pasta', 'barbecue rub', 'chili con carne', 'hazelnut flour', 'celery', 'anchovy fillets', 'dry milk powder', 'hard salami', 'cut up chicken', 'wish bone guacamol ranch dress', 'hot pork sausage', 'San Marzano tomatoes', 'low-fat balsamic vinaigrette', "Quorn Chik''n Tenders", 'curry leaves', 'sloe gin', 'saffron powder', 'passata', 'red curry paste', 'low sodium store bought chicken stock']


### Data Subsetting
Since RecipeNGL contains 2.23 million recipes, select a subset to use:

In [5]:
full_ngl_path = './data/recipe_ngl/full_dataset.csv'
# if the full recipeNGL dataset csv file exists, read a subset of it
if os.path.exists(full_ngl_path):
    ngl_subset = [0, 5000]
    ngl_df = pd.read_csv(full_ngl_path, skiprows=ngl_subset[0], nrows=ngl_subset[1], index_col=0)
    ngl_df.to_csv(f'./data/recipe_ngl/dataset_{ngl_subset[0]}_{ngl_subset[1]}.csv')

for file in glob.glob('./data/recipe_ngl/dataset*.csv'):
    print(file)
    ngl_df = pd.read_csv(file, index_col=0, 
                         converters={'ingredients':pd.eval, 'directions':pd.eval, 'NER':pd.eval})

./data/recipe_ngl\dataset_0_100.csv
./data/recipe_ngl\dataset_0_5000.csv


In [6]:
ngl_df[:3]

Unnamed: 0,title,ingredients,directions,link,source,NER
0,No-Bake Nut Cookies,"[1 c. firmly packed brown sugar, 1/2 c. evapor...","[In a heavy 2-quart saucepan, mix brown sugar,...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[brown sugar, milk, vanilla, nuts, butter, bit..."
1,Jewell Ball'S Chicken,"[1 small jar chipped beef, cut up, 4 boned chi...","[Place chipped beef on bottom of baking dish.,...",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[beef, chicken breasts, cream of mushroom soup..."
2,Creamy Corn,"[2 (16 oz.) pkg. frozen corn, 1 (8 oz.) pkg. c...","[In a slow cooker, combine all ingredients. Co...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[frozen corn, cream cheese, butter, garlic pow..."


### Create List of Ingredients

In [7]:
if pre_processing == True:
    for i in range(len(ngl_df["NER"])):
        ingredients_set = ingredients_set | set(ngl_df["NER"][i])
    print(list(ingredients_set)[0:50])

['mustard sauce', 'back bacon rashers', 'gluten-free flour', 'whole wheat rotini pasta', 'red raspberry jello', 'hazelnut flour', 'celery', 'anchovy fillets', 'San Marzano tomatoes', 'wish bone guacamol ranch dress', 'low-fat balsamic vinaigrette', "Quorn Chik''n Tenders", 'sloe gin', 'saffron powder', 'passata', 'red curry paste', 'low sodium store bought chicken stock', 'fresno pepper', 'leaf parsley', "Campbell's cream", 'brown basmati rice', 'gumbo file powder', 'amaretti', 'reduced fat coconut milk', 'condensed reduced fat reduced sodium cream of chicken soup', 'frozen mustard greens', 'pineapple pie filling', 'king oyster mushroom', 'linguine pasta', 'Tabasco sauce', 'flour tortillas (not low fat)', 'asian noodles', 'low-fat parmesan cheese', 'sausage casings', 'new york strip steaks', 'beef rib roast', 'apple juice', 'cherry jello', 'White Pepper', 'caramel icing', 'Tia Maria', 'meat sauce', 'baby okra', 'green garlic', 'onion buns', 'white tuna', 'dry vermouth', 'small potatoes

In [8]:
if pre_processing == True:
    cw = csv.writer(open("data/raw_ingredients_list.csv",'w'))
    cw.writerow(list(ingredients_set))
    pd.read_csv('data/raw_ingredients_list.csv', header=None).T.to_csv('data/raw_ingredients_list_transpose.csv', header=False, index=False)

From a list of 8500 ingredients, we manually removed duplicated items that had spelling errors, semantic similarities, and extra numeric or qualitative descriptors (e.g. chopped, 2% fat, shredded, unsweetened), giving 955 ingredients. Then, each is paired its gloVe embedding.

In [9]:
# param: token_list - a list of tokens, no spaces or symbols
# return: a tensor of the averaged GloVe embeddings of each token
def glove_average(token_list):
    embeds_list = []
    for token in token_list:
        embeds_list.append(glove[token])
    embeds_average = torch.mean(torch.stack(embeds_list), dim=0)
    return embeds_average

In [10]:
if pre_processing == True:
    filtered_ingredients_df = pd.read_csv('data/filtered_ingredients_list_transpose.csv', header=None, names=["ingredient"])
    ingredient_embeddings = []

    for i, row in filtered_ingredients_df.iterrows():
        token_list = re.sub(r"[^a-zA-Z ]+", '', filtered_ingredients_df['ingredient'][i].lower()).split(' ')
        embed_list = []
        ingredient_embeddings.append(glove_average(token_list).tolist())
        
    filtered_ingredients_df['embedding'] = ingredient_embeddings
    filtered_ingredients_df.to_csv('data/glove_ingredients_list.csv', header=False, index=False)

In [24]:
ingredients_df = pd.read_csv('data/glove_ingredients_list.csv', header=None, names=["ingredient", "embedding"],
                         converters={'embedding':pd.eval})
ingredients_df['ingredient'] = ingredients_df['ingredient'].apply(lambda x: x.rstrip())

In [25]:
ingredients_df[:3]

Unnamed: 0,ingredient,embedding
0,abalone,"[0.34318000078201294, -0.8134999871253967, -0...."
1,absinthe,"[-0.06579200178384781, -0.009294699877500534, ..."
2,acai,"[0.2497600018978119, 0.08926700055599213, -0.4..."


Since GloVe give embeddings for out-of-vocabulary words, the words with null tensors `[0,0,0,...,0]` were removed from the ingredients list, giving a total of 840 ingredients. Using these embeddings will allow for ingredients not in the list to be mapped to the closest ingredient, then put in the ingredient graph.

In [None]:
# dictionary pairing ingredient with index value
ingredient_index_dict = {}
for i in range(ingredients_df.shape[0]):
    if ingredients_df['ingredient'][i] not in ingredient_index_dict:
        ingredient_index_dict[ ingredients_df['ingredient'][i] ] = i

### Recipe Data Cleaning

In [26]:
# filter for only "Gathered" sources, since those have more consistent format
gathered_ngl_df = ngl_df[ngl_df.source == 'Gathered']
# remove unnecessary columns
filtered_ngl_df = gathered_ngl_df[['title', 'ingredients', 'directions', 'NER']]

In [27]:
filtered_ngl_df[:7]

Unnamed: 0,title,ingredients,directions,NER
0,No-Bake Nut Cookies,"[1 c. firmly packed brown sugar, 1/2 c. evapor...","[In a heavy 2-quart saucepan, mix brown sugar,...","[brown sugar, milk, vanilla, nuts, butter, bit..."
1,Jewell Ball'S Chicken,"[1 small jar chipped beef, cut up, 4 boned chi...","[Place chipped beef on bottom of baking dish.,...","[beef, chicken breasts, cream of mushroom soup..."
2,Creamy Corn,"[2 (16 oz.) pkg. frozen corn, 1 (8 oz.) pkg. c...","[In a slow cooker, combine all ingredients. Co...","[frozen corn, cream cheese, butter, garlic pow..."
3,Chicken Funny,"[1 large whole chicken, 2 (10 1/2 oz.) cans ch...","[Boil and debone chicken., Put bite size piece...","[chicken, chicken gravy, cream of mushroom sou..."
4,Reeses Cups(Candy),"[1 c. peanut butter, 3/4 c. graham cracker cru...",[Combine first four ingredients and press in 1...,"[peanut butter, graham cracker crumbs, butter,..."
5,Cheeseburger Potato Soup,"[6 baking potatoes, 1 lb. of extra lean ground...",[Wash potatoes; prick several times with a for...,"[baking potatoes, extra lean ground beef, butt..."
6,Rhubarb Coffee Cake,"[1 1/2 c. sugar, 1/2 c. butter, 1 egg, 1 c. bu...","[Cream sugar and butter., Add egg and beat wel...","[sugar, butter, egg, buttermilk, flour, salt, ..."


### Recipe Data Tokenizing

In [15]:
tokenized_titles = []
tokenized_ingredients = []
tokenized_directions = []
tokenized_NER = []

for i, row in filtered_ngl_df.iterrows():
    # tokenize titles
    tokens_list = filtered_ngl_df.title.values[i].split(' ')
    tokenized_titles.append(tokens_list)
    
    # tokenize ingredients
    tokens_list = []
    for ingredient_item in row.ingredients:
        tokens_list.append(ingredient_item.split(' '))
    tokenized_ingredients.append(tokens_list)
    
    # tokenize directions
    tokens_list = []
    for direction_item in row.directions:
        tokens_list.append(direction_item.split(' '))
    tokenized_directions.append(tokens_list)
    
    # tokenize ingredients
    tokens_list = []
    for NER_item in row.NER:
        tokens_list.append( re.sub(r"[^a-zA-Z ]+", '', NER_item.lower()).split(' ') )
    tokenized_NER.append(tokens_list)

print(tokenized_titles[0])
print(tokenized_ingredients[0])
print(tokenized_directions[0])
print(tokenized_NER[0])

['No-Bake', 'Nut', 'Cookies']
[['1', 'c.', 'firmly', 'packed', 'brown', 'sugar'], ['1/2', 'c.', 'evaporated', 'milk'], ['1/2', 'tsp.', 'vanilla'], ['1/2', 'c.', 'broken', 'nuts', '(pecans)'], ['2', 'Tbsp.', 'butter', 'or', 'margarine'], ['3', '1/2', 'c.', 'bite', 'size', 'shredded', 'rice', 'biscuits']]
[['In', 'a', 'heavy', '2-quart', 'saucepan,', 'mix', 'brown', 'sugar,', 'nuts,', 'evaporated', 'milk', 'and', 'butter', 'or', 'margarine.'], ['Stir', 'over', 'medium', 'heat', 'until', 'mixture', 'bubbles', 'all', 'over', 'top.'], ['Boil', 'and', 'stir', '5', 'minutes', 'more.', 'Take', 'off', 'heat.'], ['Stir', 'in', 'vanilla', 'and', 'cereal;', 'mix', 'well.'], ['Using', '2', 'teaspoons,', 'drop', 'and', 'shape', 'into', '30', 'clusters', 'on', 'wax', 'paper.'], ['Let', 'stand', 'until', 'firm,', 'about', '30', 'minutes.']]
[['brown', 'sugar'], ['milk'], ['vanilla'], ['nuts'], ['butter'], ['bite', 'size', 'shredded', 'rice', 'biscuits']]


In [16]:
tokenized_ngl_df = filtered_ngl_df.copy()
tokenized_ngl_df['token_title'] = tokenized_titles
tokenized_ngl_df['token_ingredients'] = tokenized_ingredients
tokenized_ngl_df['token_directions'] = tokenized_directions
tokenized_ngl_df['token_NER'] = tokenized_NER

In [17]:
tokenized_ngl_df[:7]

Unnamed: 0,title,ingredients,directions,NER,token_title,token_ingredients,token_directions,token_NER
0,No-Bake Nut Cookies,"[1 c. firmly packed brown sugar, 1/2 c. evapor...","[In a heavy 2-quart saucepan, mix brown sugar,...","[brown sugar, milk, vanilla, nuts, butter, bit...","[No-Bake, Nut, Cookies]","[[1, c., firmly, packed, brown, sugar], [1/2, ...","[[In, a, heavy, 2-quart, saucepan,, mix, brown...","[[brown, sugar], [milk], [vanilla], [nuts], [b..."
1,Jewell Ball'S Chicken,"[1 small jar chipped beef, cut up, 4 boned chi...","[Place chipped beef on bottom of baking dish.,...","[beef, chicken breasts, cream of mushroom soup...","[Jewell, Ball'S, Chicken]","[[1, small, jar, chipped, beef,, cut, up], [4,...","[[Place, chipped, beef, on, bottom, of, baking...","[[beef], [chicken, breasts], [cream, of, mushr..."
2,Creamy Corn,"[2 (16 oz.) pkg. frozen corn, 1 (8 oz.) pkg. c...","[In a slow cooker, combine all ingredients. Co...","[frozen corn, cream cheese, butter, garlic pow...","[Creamy, Corn]","[[2, (16, oz.), pkg., frozen, corn], [1, (8, o...","[[In, a, slow, cooker,, combine, all, ingredie...","[[frozen, corn], [cream, cheese], [butter], [g..."
3,Chicken Funny,"[1 large whole chicken, 2 (10 1/2 oz.) cans ch...","[Boil and debone chicken., Put bite size piece...","[chicken, chicken gravy, cream of mushroom sou...","[Chicken, Funny]","[[1, large, whole, chicken], [2, (10, 1/2, oz....","[[Boil, and, debone, chicken.], [Put, bite, si...","[[chicken], [chicken, gravy], [cream, of, mush..."
4,Reeses Cups(Candy),"[1 c. peanut butter, 3/4 c. graham cracker cru...",[Combine first four ingredients and press in 1...,"[peanut butter, graham cracker crumbs, butter,...","[Reeses, Cups(Candy), , ]","[[1, c., peanut, butter], [3/4, c., graham, cr...","[[Combine, first, four, ingredients, and, pres...","[[peanut, butter], [graham, cracker, crumbs], ..."
5,Cheeseburger Potato Soup,"[6 baking potatoes, 1 lb. of extra lean ground...",[Wash potatoes; prick several times with a for...,"[baking potatoes, extra lean ground beef, butt...","[Cheeseburger, Potato, Soup]","[[6, baking, potatoes], [1, lb., of, extra, le...","[[Wash, potatoes;, prick, several, times, with...","[[baking, potatoes], [extra, lean, ground, bee..."
6,Rhubarb Coffee Cake,"[1 1/2 c. sugar, 1/2 c. butter, 1 egg, 1 c. bu...","[Cream sugar and butter., Add egg and beat wel...","[sugar, butter, egg, buttermilk, flour, salt, ...","[Rhubarb, Coffee, Cake]","[[1, 1/2, c., sugar], [1/2, c., butter], [1, e...","[[Cream, sugar, and, butter.], [Add, egg, and,...","[[sugar], [butter], [egg], [buttermilk], [flou..."


### Closest Ingredients

In [30]:
# param: ingredient - list of strings of an ingredient (tokenized)
#        ingredient_df - dataframe containing ingredient vocabulary and
#                        their corresponding GloVe embedding
# return: string of closest ingredient in vocabulary
#         if none (e.g. ingredient has is OOV in GloVe), returns empty string
def get_closest_ingredient (ingredient, ingredients_df):
    # check if ingredient is in ingredients_df
    if len(ingredient) == 1 and ingredient[0] in ingredients_df['ingredient'].values:
        return ingredient[0]
    
    closest_ingredient = ''
    smallest_distance = float('inf')
    
    # compute the GloVe embedding of the ingredient
    ingredient_embedding = glove_average(ingredient)
    
    if torch.count_nonzero(ingredient_embedding) == 0:
        return ''
    
    # compute distances between embeddings, choose the smallest distance
    for _, row in ingredients_df.iterrows():
        difference = ingredient_embedding - torch.FloatTensor(row['embedding'])
        distance = torch.sum(torch.square(difference))
        if distance < smallest_distance:
            smallest_distance = distance
            closest_ingredient = row['ingredient']
    
    return closest_ingredient

In [33]:
if pre_processing == True:
    # changes the ingredients list into their corresponding closest ingredients
    # in the vocabulary using the GloVe embeddings
    NER_closest_ingredients = []

    for i, row in tokenized_ngl_df.iterrows():
        NER_closest_list = []
        for ingredient_tokens in row.token_NER:
            NER_closest_list.append( get_closest_ingredient(ingredient_tokens, ingredients_df) )
        NER_closest_ingredients.append(NER_closest_list)

    print(NER_closest_ingredients[0])

['brown sugar', 'milk', 'vanilla', 'soy nut', 'butter', 'pasta']


In [41]:
if pre_processing == True:
    processed_ngl_df = tokenized_ngl_df.copy()
    processed_ngl_df['closest_ingredients'] = NER_closest_ingredients
    processed_ngl_df.to_csv('data/processed_ngl.csv', header=True, index=True)

In [53]:
processed_ngl_df = pd.read_csv('data/processed_ngl.csv', index_col=0,
                               converters={'ingredients':pd.eval, 'directions':pd.eval, 'NER':pd.eval, 
                                           'token_title':pd.eval, 'token_ingredients':pd.eval, 
                                           'token_directions':pd.eval,' token_NER':pd.eval, 'closest_ingredients':pd.eval})

In [55]:
processed_ngl_df[["token_NER", "closest_ingredients"]][:7]

Unnamed: 0,token_NER,closest_ingredients
0,"[['brown', 'sugar'], ['milk'], ['vanilla'], ['...","[brown sugar, milk, vanilla, soy nut, butter, ..."
1,"[['beef'], ['chicken', 'breasts'], ['cream', '...","[beef, chicken, passion fruit, sour cream]"
2,"[['frozen', 'corn'], ['cream', 'cheese'], ['bu...","[corn, cheese, butter, garlic, salt, pepper]"
3,"[['chicken'], ['chicken', 'gravy'], ['cream', ...","[chicken, chicken, passion fruit, cheese]"
4,"[['peanut', 'butter'], ['graham', 'cracker', '...","[peanut butter, graham cracker, butter, sugar,..."
5,"[['baking', 'potatoes'], ['extra', 'lean', 'gr...","[baking mix, crescent roll, butter, milk, salt..."
6,"[['sugar'], ['butter'], ['egg'], ['buttermilk'...","[sugar, butter, egg, buttermilk, flour, salt, ..."


In [None]:
# outputs the ingredient to closest ingredient pairings for What's Cooking data
if pre_processing == True:
    wc_ingredients, wc_closest_ingredients = [], []
    
    for i in range(len(wc_train_data)):
        wc_ingredients.append(wc_train_data[i]['ingredients'])
        
        item_closest_ingredients = []
        for ingredient in wc_train_data[i]['ingredients']:
            token_list = re.sub(r"[^a-zA-Z ]+", '', ingredient.lower()).split(' ')
            closest_ingredient = get_closest_ingredient(token_list, ingredients_df)
            item_closest_ingredients.append(closest_ingredient)
        wc_closest_ingredients.append(item_closest_ingredients)

    print(wc_closest_ingredients[0])
    
    wc_ingredients_df = pd.DataFrame({'ingredients': wc_ingredients, 'closest_ingredients': wc_closest_ingredients})
    wc_ingredients_df.to_csv('data/whats_cooking/closest_ingredients.csv', header=True, index=True)

### Ingredient Frequency

In [84]:
if pre_processing == True:
    # get frequency based on appearences in RecipeNGL recipes
    ingredient_frequency = torch.zeros(len(ingredients_df)).tolist()

    for i in range(processed_ngl_df.shape[0]):
        for closest_ingredient in processed_ngl_df["closest_ingredients"][i]:
            index = ingredient_index_dict.get(closest_ingredient)
            if (index != None):
                ingredient_frequency[index] += 1
    

In [85]:
if pre_processing == True:
    # get frequency based on appearences in What's Cooking recipes
    wc_ingredients_df = pd.read_csv('data/whats_cooking/closest_ingredients.csv', index_col=0,
                                    converters={'ingredients':pd.eval, 'closest_ingredients':pd.eval})
    
    for i in range(wc_ingredients_df.shape[0]):
        for closest_ingredient in wc_ingredients_df["closest_ingredients"][i]:
            index = ingredient_index_dict.get(closest_ingredient)
            if (index != None):
                ingredient_frequency[index] += 1
    """
    wc_limit = 5000
    for i in range(len(wc_train_data)):
        for ingredient in wc_train_data[i]['ingredients']:
            token_list = re.sub(r"[^a-zA-Z ]+", '', ingredient.lower()).split(' ')
            closest_ingredient = get_closest_ingredient(token_list, ingredients_df)
            index = ingredient_index_dict.get(closest_ingredient)
            if (index != None):
                ingredient_frequency[index] += 1
                
        if i > wc_limit:
            break
    """

In [86]:
if pre_processing == True:
    ingredient_frequency = ( torch.FloatTensor(ingredient_frequency) / max(ingredient_frequency) ).tolist()
    ingredients_frequency_df = (ingredients_df.copy()).drop('embedding', axis=1)
    ingredients_frequency_df['frequency'] = ingredient_frequency
    ingredients_frequency_df.to_csv('data/ingredients_frequency.csv', header=False, index=False)

In [88]:
ingredients_frequency_df = pd.read_csv('data/ingredients_frequency.csv', header=None, names=["ingredient", "frequency"])

In [89]:
ingredients_frequency_df[:10]

Unnamed: 0,ingredient,frequency
0,abalone,0.000234
1,absinthe,0.0
2,acai,0.0
3,acorn,0.000234
4,adobo,0.0
5,agar,0.0
6,aioli,0.00117
7,albacore,0.0
8,alcohol,0.000468
9,ale,0.007959


Using the ingredient frequency, we further filter the ingredients with less than 0.1% frequency. This yields 381 ingredients.

In [92]:
if pre_processing == True:
    filtered_ingredients_frequency_df = ingredients_frequency_df[ingredients_frequency_df['frequency'] > 0.001]
    print(filtered_ingredients_frequency_df.shape[0])
    ingredients_frequency_df.drop('frequency', axis=1).to_csv('data/frequency_filtered_ingredients.csv', header=False, index=False)

381


### Ingredient Graph

To create the graph, each time an ingredient appears in a recipe with another ingredient, their compatibility is increased. This is implemented with an adjacency matrix. The compatibilities will be normalized by each row of the matrix.

In [None]:
if pre_processing == True:
    ingredient_graph = torch.zeros(len(ingredients_df), len(ingredients_df)).tolist()

    for i in range(processed_ngl_df.shape[0]):
        for ingredient_1 in processed_ngl_df["closest_ingredients"][i]:
            index_1 = ingredient_index_dict[ingredient_1]
            for ingredient_2 in processed_ngl_df["closest_ingredients"][i]:
                index_2 = ingredient_index_dict[ingredient_2]
                
                ingredient_graph[index_1][index_2] += 1
    
    ingredient_graph = ( torch.FloatTensor(ingredient_graph) / max(ingredient_graph) ).tolist()

----
## Primary Model

### Ingredient Selector

### GAN RNN

### Training

----
## Baseline Model

----
## Results and Comparison