# GrocerySub - Prototype Notebook
## Chris Kimber
### Started June 1 2020

### Introduction

The purpose of this notebook is to hold exploratory code for the data acquisition and modelling of possible grocery substitutions through recipes.

### Loading recipe information

The first recipe dataset is the cleaned simplified-1M+ dataset from https://github.com/schmidtdominik/RecipeNet. The dataset available has recipe ingredient lists using ingredients that have been cleaned through a couple of simplification approaches and pruned to the most common 3500 ingredients using curated ingredient lists.

The data comes in the form of a .npz numpy string array. The format is explained here: https://dominikschmidt.xyz/simplified-recipes-1M/. Initially we will load the array and explore the available information/structure.

In [4]:
import numpy as np

In [5]:
with np.load('/Users/chrki23/Documents/Insight_Project/data/raw/simplified-recipes-1M.npz', allow_pickle = True) as data:
    recipes = data['recipes']
    ingredients = data['ingredients']

In [6]:
ingredients

array(['salt', 'pepper', 'butter', ..., 'watercress leaves',
       'emerils essence', 'corn flakes cereal'], dtype='<U39')

In [7]:
recipes[0]

array([ 233, 2754,   42,  120,  560,  345,  150, 2081,   12,   21])

In [8]:
ingredients[recipes[0]]

array(['basil leaves', 'focaccia', 'leaves', 'mozzarella', 'pesto',
       'plum tomatoes', 'rosemary', 'sandwiches', 'sliced', 'tomatoes'],
      dtype='<U39')

In [23]:
ingredients[recipes[1067556]]

array(['basil', 'bay leaves', 'boneless', 'butter', 'canned chicken',
       'canned chicken broth', 'chicken', 'chicken broth', 'cloves',
       'dry white wine', 'freshly ground pepper', 'garlic', 'ground',
       'ground pepper', 'leaves', 'olive', 'olive oil', 'parsley',
       'pepper', 'rosemary', 'thyme', 'turkey', 'white', 'white wine',
       'wine'], dtype='<U39')

In [10]:
print(len(ingredients))
print(len(recipes))

3500
1067557


A quick exploration of the dataset following the notes from the source website shows that the numpy string array contains two elements. The 'ingredients' element contains the 3500 most common ingredients with cleaned names, while the 'recipes' element contains the lists of cleaned ingredients in each recipe; only the ingredients in 'ingredients' element are included. Recipe ingredients are referred to using the numerical index from the 'ingredients' list. To see ingredient names, you can call them from ingredients using the recipe index.

### Fit a preliminary word2vec model

First order of business is to try to fit a first NLP model to the recipe data, to begin the process of clustering ingredients with similar use cases together. To begin, I will use the word2vec word embedding algorithm from the gensim package. I'm using the tutorial here: https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92

In [2]:
from gensim.models import Word2Vec

Note that word2vec wants a corpus to be made of lists of lists of tokens, rather than the current structure where the corpus is already indexed rather than made up of strings. While word2vec goes through this process to make a dictionary of tokens, there's no easy facility for skipping that process. Need to reverse the indexing and make recipe lists of ingredient strings.

First test looping through the recipes using the ingredient indices and print.

In [14]:
test_recipes = []
for r in recipes[:10]:
    print(ingredients[r])

['basil leaves' 'focaccia' 'leaves' 'mozzarella' 'pesto' 'plum tomatoes'
 'rosemary' 'sandwiches' 'sliced' 'tomatoes']
['balsamic vinegar' 'boiling water' 'butter' 'cooking spray'
 'crumbled gorgonzola' 'currants' 'gorgonzola' 'grated orange' 'kosher'
 'kosher salt' 'orange rind' 'parsley' 'pine nuts' 'polenta' 'toasted'
 'vinegar' 'water']
['bottle' 'bouillon' 'carrots' 'celery' 'chicken bouillon' 'cilantro'
 'clam juice' 'cloves' 'fish' 'garlic' 'medium shrimp' 'olive' 'olive oil'
 'onion' 'pepper' 'pepper flakes' 'red pepper' 'red pepper flakes' 'salt'
 'sherry' 'shrimp' 'stewed tomatoes' 'tomatoes' 'water' 'white'
 'white wine']
['grand marnier' 'kahlua']
['black pepper' 'coarse sea salt' 'fresh lemon' 'fresh lemon juice'
 'ground' 'ground black pepper' 'lemon' 'lemon juice' 'lime' 'lime peel'
 'mayonaise' 'pepper' 'sea salt' 'shallots' 'sherry wine'
 'sherry wine vinegar' 'vinegar' 'wine vinegar']
['black pepper' 'blue cheese' 'buttermilk' 'cheese' 'chives'
 'cider vinegar' 'crack

Write all the recipes to a list of list using ingredients as strings.

In [67]:
text_recipes = []
for r in recipes:
    text_recipes.append(ingredients[r])

IndexError: arrays used as indices must be of integer (or boolean) type

Apparently soemwhere within the full corpus of recipes, there is at least one recipe without appropriate indices. We can find it manually through trial and error.

In [74]:
recipes[727892].dtype

dtype('float64')

In [66]:
print(recipes[727892])

[]


In [75]:
recipes[727891].dtype, recipes[727893].dtype

(dtype('int64'), dtype('int64'))

In [76]:
print(recipes[727891], recipes[727893])

[  77   14    2  135   78   46   19   99   64    5  537  218    9  235
  364  138  376   65    6   31    1   81  157  117    0   12  354  469
   94  166   23  239 1365  858   97] [ 219  212   14    2  135   46   82 2050    5    3    9   30   53   48
   42   65    6   31    1  837   20  687   56  633   70   21   39   27
   35]


Clearly there is an issue with recipe 727892, which is empty. Can remove it, but wiser to automate a process of omitting all recipes containing non-int or non-bool indices.

In [86]:
cleaned_recipes = []
for r in recipes:
    if r.dtype.type is np.int_:
        cleaned_recipes.append(r)

In [87]:
len(cleaned_recipes)

1067556

Based on the length of the cleaned recipes, there was one empty list and it has been removed

In [89]:
text_recipes = []
for r in cleaned_recipes:
    text_recipes.append(ingredients[r])

In [90]:
text_recipes[0]

array(['basil leaves', 'focaccia', 'leaves', 'mozzarella', 'pesto',
       'plum tomatoes', 'rosemary', 'sandwiches', 'sliced', 'tomatoes'],
      dtype='<U39')

In [96]:
type(text_recipes[0])

numpy.ndarray

In [101]:
list_recipes = []
for r in text_recipes:
    list_recipes.append(r.tolist())

In [102]:
list_recipes[0]

['basil leaves',
 'focaccia',
 'leaves',
 'mozzarella',
 'pesto',
 'plum tomatoes',
 'rosemary',
 'sandwiches',
 'sliced',
 'tomatoes']

In [103]:
assert type(list_recipes[0]) is list

In [107]:
model = Word2Vec(list_recipes, min_count = 5, size = 100, workers = 3, window = 5, sg = 0)

In [108]:
model[ingredients[0]]

  """Entry point for launching an IPython kernel.


array([ 1.0585777 ,  0.4850875 , -1.4707472 ,  2.497749  ,  1.4412118 ,
        0.27811313,  1.5404923 ,  1.1948029 ,  0.2329819 , -0.08550996,
        0.44077128,  0.913686  ,  0.03075444, -1.7546394 , -0.1847868 ,
       -1.0058036 ,  0.10467979, -0.05081251, -3.057463  ,  0.2947511 ,
        2.7611332 , -1.0053027 , -2.078197  , -3.3024101 ,  0.55176747,
       -0.42651176, -0.91138643,  0.14629453, -0.34244934,  0.39473736,
       -0.60392606, -0.2406422 ,  0.1045967 ,  0.52240556,  0.1780885 ,
       -1.8578379 , -0.23963502, -1.7731903 , -1.034301  , -0.7699568 ,
       -1.7443256 ,  1.6858759 , -0.3828404 ,  0.96851957,  2.2337818 ,
       -0.8679948 ,  2.1282547 , -0.9511316 , -1.4064693 ,  0.70405763,
        0.34233823, -0.80627894, -2.032351  ,  0.01732059, -0.7112261 ,
       -0.92520565,  2.0927186 , -1.9403362 ,  1.7594867 , -1.5237436 ,
       -0.22815631,  0.47786748,  0.07252172,  0.61004233,  1.1692137 ,
        1.034396  ,  1.8622104 ,  1.4482703 ,  0.25655073,  1.63

In [109]:
model.similarity(ingredients[0], ingredients[1])

  """Entry point for launching an IPython kernel.


0.31152964

In [110]:
model.most_similar(ingredients[0])[:10]

  """Entry point for launching an IPython kernel.


[('other', 0.6590076684951782),
 ('plus', 0.5933234691619873),
 ('raw', 0.5520408749580383),
 ('peas', 0.5440870523452759),
 ('potato', 0.5384942293167114),
 ('salt and ground black pepper', 0.5342700481414795),
 ('pepper flakes', 0.5176024436950684),
 ('paste', 0.5124271512031555),
 ('organic', 0.507232666015625),
 ('plain flour', 0.5055059194564819)]

In [111]:
model.most_similar(ingredients[1])[:10]

  """Entry point for launching an IPython kernel.


[('more', 0.5804999470710754),
 ('peas', 0.5798527002334595),
 ('minced parsley', 0.578831672668457),
 ('olives', 0.5770426988601685),
 ('paprika', 0.5755941867828369),
 ('medium zucchini', 0.5653430223464966),
 ('medium shrimp', 0.5589482188224792),
 ('other', 0.5543408393859863),
 ('medium tomatoes', 0.5524414777755737),
 ('okra', 0.5511312484741211)]