# GrocerySub - Prototype Notebook
## Chris Kimber
### Started June 1 2020

### Introduction

The purpose of this notebook is to hold exploratory code for the data acquisition and modelling of possible grocery substitutions through recipes.

### Loading recipe information

The first recipe dataset is the cleaned simplified-1M+ dataset from https://github.com/schmidtdominik/RecipeNet. The dataset available has recipe ingredient lists using ingredients that have been cleaned through a couple of simplification approaches and pruned to the most common 3500 ingredients using curated ingredient lists.

The data comes in the form of a .npz numpy string array. The format is explained here: https://dominikschmidt.xyz/simplified-recipes-1M/. Initially we will load the array and explore the available information/structure.

In [3]:
import numpy as np

In [4]:
with np.load('/Users/chrki23/Documents/Insight_Project/data/raw/simplified-recipes-1M.npz', allow_pickle = True) as data:
    recipes = data['recipes']
    ingredients = data['ingredients']

In [5]:
ingredients

array(['salt', 'pepper', 'butter', ..., 'watercress leaves',
       'emerils essence', 'corn flakes cereal'], dtype='<U39')

In [6]:
recipes[0]

array([ 233, 2754,   42,  120,  560,  345,  150, 2081,   12,   21])

In [7]:
ingredients[recipes[0]]

array(['basil leaves', 'focaccia', 'leaves', 'mozzarella', 'pesto',
       'plum tomatoes', 'rosemary', 'sandwiches', 'sliced', 'tomatoes'],
      dtype='<U39')

In [8]:
ingredients[recipes[1067556]]

array(['basil', 'bay leaves', 'boneless', 'butter', 'canned chicken',
       'canned chicken broth', 'chicken', 'chicken broth', 'cloves',
       'dry white wine', 'freshly ground pepper', 'garlic', 'ground',
       'ground pepper', 'leaves', 'olive', 'olive oil', 'parsley',
       'pepper', 'rosemary', 'thyme', 'turkey', 'white', 'white wine',
       'wine'], dtype='<U39')

In [9]:
print(len(ingredients))
print(len(recipes))

3500
1067557


A quick exploration of the dataset following the notes from the source website shows that the numpy string array contains two elements. The 'ingredients' element contains the 3500 most common ingredients with cleaned names, while the 'recipes' element contains the lists of cleaned ingredients in each recipe; only the ingredients in 'ingredients' element are included. Recipe ingredients are referred to using the numerical index from the 'ingredients' list. To see ingredient names, you can call them from ingredients using the recipe index.

### Fit a preliminary word2vec model

First order of business is to try to fit a first NLP model to the recipe data, to begin the process of clustering ingredients with similar use cases together. To begin, I will use the word2vec word embedding algorithm from the gensim package. I'm using the tutorial here: https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92

In [10]:
from gensim.models import Word2Vec

Note that word2vec wants a corpus to be made of lists of lists of tokens, rather than the current structure where the corpus is already indexed rather than made up of strings. While word2vec goes through this process to make a dictionary of tokens, there's no easy facility for skipping that process. Need to reverse the indexing and make recipe lists of ingredient strings.

First test looping through the recipes using the ingredient indices and print.

In [11]:
test_recipes = []
for r in recipes[:10]:
    print(ingredients[r])

['basil leaves' 'focaccia' 'leaves' 'mozzarella' 'pesto' 'plum tomatoes'
 'rosemary' 'sandwiches' 'sliced' 'tomatoes']
['balsamic vinegar' 'boiling water' 'butter' 'cooking spray'
 'crumbled gorgonzola' 'currants' 'gorgonzola' 'grated orange' 'kosher'
 'kosher salt' 'orange rind' 'parsley' 'pine nuts' 'polenta' 'toasted'
 'vinegar' 'water']
['bottle' 'bouillon' 'carrots' 'celery' 'chicken bouillon' 'cilantro'
 'clam juice' 'cloves' 'fish' 'garlic' 'medium shrimp' 'olive' 'olive oil'
 'onion' 'pepper' 'pepper flakes' 'red pepper' 'red pepper flakes' 'salt'
 'sherry' 'shrimp' 'stewed tomatoes' 'tomatoes' 'water' 'white'
 'white wine']
['grand marnier' 'kahlua']
['black pepper' 'coarse sea salt' 'fresh lemon' 'fresh lemon juice'
 'ground' 'ground black pepper' 'lemon' 'lemon juice' 'lime' 'lime peel'
 'mayonaise' 'pepper' 'sea salt' 'shallots' 'sherry wine'
 'sherry wine vinegar' 'vinegar' 'wine vinegar']
['black pepper' 'blue cheese' 'buttermilk' 'cheese' 'chives'
 'cider vinegar' 'crack

Write all the recipes to a list of list using ingredients as strings.

In [12]:
text_recipes = []
for r in recipes:
    text_recipes.append(ingredients[r])

IndexError: arrays used as indices must be of integer (or boolean) type

Apparently soemwhere within the full corpus of recipes, there is at least one recipe without appropriate indices. We can find it manually through trial and error.

In [13]:
recipes[727892].dtype

dtype('float64')

In [14]:
print(recipes[727892])

[]


In [15]:
recipes[727891].dtype, recipes[727893].dtype

(dtype('int64'), dtype('int64'))

In [16]:
print(recipes[727891], recipes[727893])

[  77   14    2  135   78   46   19   99   64    5  537  218    9  235
  364  138  376   65    6   31    1   81  157  117    0   12  354  469
   94  166   23  239 1365  858   97] [ 219  212   14    2  135   46   82 2050    5    3    9   30   53   48
   42   65    6   31    1  837   20  687   56  633   70   21   39   27
   35]


Clearly there is an issue with recipe 727892, which is empty. Can remove it, but wiser to automate a process of omitting all recipes containing non-int or non-bool indices.

In [17]:
cleaned_recipes = []
for r in recipes:
    if r.dtype.type is np.int_:
        cleaned_recipes.append(r)

In [18]:
len(cleaned_recipes)

1067556

Based on the length of the cleaned recipes, there was one empty list and it has been removed

In [19]:
text_recipes = []
for r in cleaned_recipes:
    text_recipes.append(ingredients[r])

In [20]:
text_recipes[0]

array(['basil leaves', 'focaccia', 'leaves', 'mozzarella', 'pesto',
       'plum tomatoes', 'rosemary', 'sandwiches', 'sliced', 'tomatoes'],
      dtype='<U39')

In [21]:
type(text_recipes[0])

numpy.ndarray

In [22]:
list_recipes = []
for r in text_recipes:
    list_recipes.append(r.tolist())

In [23]:
list_recipes[0]

['basil leaves',
 'focaccia',
 'leaves',
 'mozzarella',
 'pesto',
 'plum tomatoes',
 'rosemary',
 'sandwiches',
 'sliced',
 'tomatoes']

In [24]:
assert type(list_recipes[0]) is list

In [25]:
model = Word2Vec(list_recipes, min_count = 5, size = 100, workers = 3, window = 5, sg = 0)

In [26]:
model[ingredients[0]]

  """Entry point for launching an IPython kernel.


array([ 1.2160673 , -0.12984042, -1.886507  , -3.8187292 ,  0.2943414 ,
       -1.6767488 ,  0.63865805,  1.49091   , -0.01775747, -2.0751526 ,
       -0.3806102 , -1.9652799 , -3.930794  ,  4.30601   ,  2.1170402 ,
        1.6899687 , -0.9467472 ,  1.12574   ,  0.23056045, -1.611426  ,
       -3.3378751 ,  2.612072  , -0.69928527, -0.45738202,  1.6472564 ,
        1.081074  , -0.8120978 , -0.09280349,  3.1050208 ,  1.8500291 ,
       -0.56413156, -2.082214  , -0.09067021, -0.04519631, -1.6954857 ,
        1.050231  ,  1.4114212 , -3.2119138 , -1.5701051 , -0.32631353,
        0.23533545, -0.7077582 ,  1.9139322 , -3.7266355 ,  0.20405076,
       -2.2814717 ,  0.0098802 ,  4.4964423 ,  0.8857798 ,  1.6583738 ,
       -1.3082092 ,  0.90956247,  1.7435179 , -0.38278997,  0.15653719,
       -1.9615612 , -2.7696075 ,  1.6250622 ,  0.48924702, -0.612364  ,
       -1.7783903 ,  1.6029813 ,  0.04330054, -1.0454836 , -2.5077395 ,
        1.3289728 , -4.30793   , -3.0288746 , -2.256761  , -2.60

In [27]:
model.similarity(ingredients[0], ingredients[1])

  """Entry point for launching an IPython kernel.


0.3072774

In [28]:
model.most_similar(ingredients[0])[:10]

  """Entry point for launching an IPython kernel.


[('salt and ground black pepper', 0.6500079035758972),
 ('other', 0.6026499271392822),
 ('plus', 0.594636082649231),
 ('pepper sauce', 0.544674277305603),
 ('raw', 0.5218769311904907),
 ('red bell pepper', 0.5072269439697266),
 ('paste', 0.5055068731307983),
 ('red', 0.4955565929412842),
 ('organic', 0.48495015501976013),
 ('parsley leaves', 0.4767211079597473)]

In [29]:
model.most_similar(ingredients[1])[:10]

  """Entry point for launching an IPython kernel.


[('other', 0.5976935029029846),
 ('more', 0.5717605352401733),
 ('paprika', 0.5643827319145203),
 ('medium zucchini', 0.5610072016716003),
 ('minced parsley', 0.5593796968460083),
 ('low sodium chicken broth', 0.5514507293701172),
 ('parsley leaves', 0.5352662801742554),
 ('peppercorns', 0.5311011672019958),
 ('medium tomatoes', 0.5269858837127686),
 ('peas', 0.5219717025756836)]

In [30]:
model.similarity(ingredients[1], 'peas')

  """Entry point for launching an IPython kernel.


0.52197164

In [31]:
ingredients[1]

'pepper'

In [32]:
ingredients[0]

'salt'

In [33]:
ingredients[10]

'olive'

In [34]:
model.most_similar(ingredients[10])

  """Entry point for launching an IPython kernel.


[('medium zucchini', 0.5520144104957581),
 ('more', 0.5360108017921448),
 ('medium tomatoes', 0.5193006992340088),
 ('marjoram', 0.4985307455062866),
 ('mint', 0.498519629240036),
 ('minced parsley', 0.4984644949436188),
 ('kale', 0.4958930015563965),
 ('other', 0.4956119954586029),
 ('parsley', 0.48405665159225464),
 ('pancetta', 0.47830459475517273)]

In [36]:
model.wv.most_similar(ingredients[5])

[('evaporated milk', 0.49695873260498047),
 ('hot water', 0.49244439601898193),
 ('melted butter', 0.48003941774368286),
 ('cornstarch', 0.46315228939056396),
 ('glaze', 0.4627484679222107),
 ('ground nutmeg', 0.4598190188407898),
 ('halfandhalf', 0.44377824664115906),
 ('heavy whipping cream', 0.44305965304374695),
 ('extra', 0.44218361377716064),
 ('granulated sugar', 0.4406670331954956)]

In [37]:
ingredients[5]

'flour'

In [38]:
model.wv.most_similar('coffee')

[('espresso', 0.7049605250358582),
 ('coffee granules', 0.6738904714584351),
 ('espresso beans', 0.6235960721969604),
 ('coffee powder', 0.6215568780899048),
 ('espresso powder', 0.6128979921340942),
 ('frangelico', 0.6084223985671997),
 ('creamer', 0.604565441608429),
 ('chocolate shavings', 0.6017354726791382),
 ('coffee liqueur', 0.5933762788772583),
 ('kahlua', 0.5888408422470093)]

Let's play around with some of the word2vec parameters. First we want to widen the window size to make sure all ingredients in a recipe are included. 

In [39]:
recipe_length = []
for r in recipes:
    recipe_length.append(len(r))

In [40]:
max(recipe_length)

104

In [43]:
np.argmax(recipe_length)

772771

In [44]:
ingredients[recipes[np.argmax(recipe_length)]]

array(['alfredo sauce', 'all-purpose flour', 'artichoke hearts', 'basil',
       'basil leaves', 'bell pepper', 'black pepper', 'boneless',
       'boneless skinless chicken',
       'boneless skinless chicken breast halves',
       'boneless skinless chicken breasts', 'breadcrumbs', 'butter',
       'cajun seasoning', 'cayenne', 'cayenne pepper', 'cheese',
       'chicken', 'chicken breast', 'chicken breast halves',
       'chicken breasts', 'chicken cutlets', 'cilantro',
       'cilantro leaves', 'coarse salt', 'cooked chicken',
       'crushed red pepper', 'diced tomatoes', 'dried parsley', 'eggs',
       'extra-virgin olive oil', 'fettucine', 'flour', 'fresh',
       'fresh basil', 'fresh ginger', 'fresh parsley',
       'freshly ground pepper', 'frozen artichoke hearts', 'garlic',
       'garlic cloves', 'garlic powder', 'ginger',
       'grated parmesan cheese', 'green pepper', 'ground',
       'ground black pepper', 'ground pepper', 'kosher', 'kosher salt',
       'large eggs', 

In [45]:
def get_longest_recipe (recipes):
    recipe_length = []
    for r in recipes:
        recipe_length.append(len(r))
    return max(recipe_length)

In [46]:
get_longest_recipe(recipes)

104

In [54]:
window_size = get_longest_recipe(recipes)
model2 = Word2Vec(list_recipes, min_count = 5, size = 100, workers = 3, window = window_size, sg = 0)

In [49]:
model2.wv.most_similar('coffee')

[('black tea', 0.6247246265411377),
 ('black mission figs', 0.5961823463439941),
 ('black forest ham', 0.5756633281707764),
 ('black beans', 0.5723584294319153),
 ('black rice', 0.5432049632072449),
 ('black mustard seeds', 0.5334600210189819),
 ('black sesame seeds', 0.5306253433227539),
 ('black bass', 0.5167346596717834),
 ('espresso beans', 0.5025407075881958),
 ('brewed espresso', 0.4917338490486145)]

In [52]:
model2.wv.most_similar(ingredients[1])

[('chocolate chips', 0.3121936619281769),
 ('unsweetened', 0.3071199953556061),
 ('yellow bell pepper', 0.3043505549430847),
 ('ground white pepper', 0.2717072069644928),
 ('yellow pepper', 0.26964089274406433),
 ('hot red pepper flakes', 0.26312533020973206),
 ('seasoned salt', 0.2623046636581421),
 ('red chili pepper', 0.2523878514766693),
 ('oats', 0.23632235825061798),
 ('marjoram', 0.22881484031677246)]

In [55]:
window_size = get_longest_recipe(recipes)
model3 = Word2Vec(list_recipes, min_count = 5, size = 300, workers = 3, window = window_size, sg = 0)

In [56]:
model3.wv.most_similar('coffee')

[('black mission figs', 0.4134988784790039),
 ('black tea', 0.41199514269828796),
 ('brewed espresso', 0.39682382345199585),
 ('espresso beans', 0.3680141568183899),
 ('black mustard seeds', 0.36351361870765686),
 ('black forest ham', 0.35438230633735657),
 ('instant espresso', 0.34857529401779175),
 ('black rice', 0.3423813581466675),
 ('espresso powder', 0.34043288230895996),
 ('black beans', 0.3351942300796509)]

In [57]:
model4 = Word2Vec(list_recipes, min_count = 5, size = 300, workers = 3, window = 5, sg = 0)

model4.wv.most_similar('coffee')

In [59]:
model5 = Word2Vec(list_recipes, min_count = 5, size = 300, workers = 3, window = 5, sg = 1)

In [60]:
model5.wv.most_similar('coffee')

[('espresso', 0.7138556241989136),
 ('coffee granules', 0.6950829029083252),
 ('brewed coffee', 0.686455488204956),
 ('coffee powder', 0.6757040023803711),
 ('coffee beans', 0.6634635925292969),
 ('brewed espresso', 0.6594467759132385),
 ('coffee liqueur', 0.651513934135437),
 ('espresso powder', 0.6289201974868774),
 ('espresso beans', 0.6025258302688599),
 ('chocolate shavings', 0.6001182198524475)]

In [61]:
model5.wv.most_similar('yogurt')

[('yoghurt', 0.5822219848632812),
 ('yogurt cheese', 0.5784868001937866),
 ('vanilla soy milk', 0.5330559015274048),
 ('vanilla lowfat yogurt', 0.5262773633003235),
 ('vanilla yogurt', 0.4982944130897522),
 ('unsweetened soymilk', 0.4869977831840515),
 ('soy yogurt', 0.48266148567199707),
 ('whey protein', 0.4802514910697937),
 ('unsweetened vanilla almond milk', 0.47807183861732483),
 ('plain yogurt', 0.4733242690563202)]

In [62]:
model5.wv.most_similar(ingredients[1])

[('parsley', 0.4103143811225891),
 ('onions', 0.40506625175476074),
 ('minced garlic', 0.40300098061561584),
 ('rib', 0.4024224281311035),
 ('onion', 0.39498627185821533),
 ('minced parsley', 0.3933371305465698),
 ('scallion', 0.38600027561187744),
 ('paprika', 0.3816184997558594),
 ('red bell peppers', 0.37911564111709595),
 ('portobello caps', 0.37586045265197754)]

In [63]:
model5.wv.most_similar('shallot')

[('shallots', 0.7983604669570923),
 ('sliced shallots', 0.6157399415969849),
 ('radicchio leaves', 0.5654714107513428),
 ('watercress', 0.5613769292831421),
 ('tarragon sprigs', 0.5487103462219238),
 ('tarragon', 0.541921854019165),
 ('toasted baguette', 0.5380358695983887),
 ('watercress leaves', 0.5352039337158203),
 ('shellfish', 0.5350548028945923),
 ('wild salmon', 0.5334708094596863)]

In [64]:
model5.wv.most_similar('shallots')

[('shallot', 0.7983604669570923),
 ('sliced shallots', 0.61521315574646),
 ('shiitake', 0.578506350517273),
 ('shiitake mushrooms', 0.5595210790634155),
 ('shellfish', 0.5533171892166138),
 ('tarragon sprigs', 0.5497821569442749),
 ('sliced leeks', 0.5448044538497925),
 ('watercress', 0.5261567831039429),
 ('scallops', 0.5250446796417236),
 ('thyme sprig', 0.5189806222915649)]

In [65]:
model6 = Word2Vec(list_recipes, min_count = 5, size = 300, workers = 3, window = window_size*1e7, sg = 0)

In [66]:
model6.wv.most_similar('coffee')

[('black tea', 0.44019243121147156),
 ('black mission figs', 0.4280414581298828),
 ('black beans', 0.3904173970222473),
 ('black forest ham', 0.37936240434646606),
 ('black bass', 0.3678361773490906),
 ('black rice', 0.3669908046722412),
 ('black sesame seeds', 0.363450288772583),
 ('treacle', 0.36084526777267456),
 ('black mustard seeds', 0.3542540371417999),
 ('brewed espresso', 0.34990209341049194)]

In [67]:
model6.wv.most_similar('mushrooms')

[('white button mushrooms', 0.43785637617111206),
 ('mushroom', 0.2608409523963928),
 ('madeira wine', 0.19623635709285736),
 ('pure vanilla', 0.18856610357761383),
 ('okra', 0.18066561222076416),
 ('portobello caps', 0.17346245050430298),
 ('oxtails', 0.1706920564174652),
 ('herbes de provence', 0.1667000651359558),
 ('italian pork sausage', 0.156202033162117),
 ('sangiovese', 0.15370628237724304)]

In [68]:
model6.wv.most_similar('mushroom')

[('shiitake mushroom caps', 0.6203939914703369),
 ('condensed golden mushroom soup', 0.2944466173648834),
 ('mushrooms', 0.2608409523963928),
 ('golden mushroom soup', 0.2511514723300934),
 ('oyster sauce', 0.227003812789917),
 ('cremini mushrooms', 0.22580403089523315),
 ('condensed cream of potato soup', 0.2188304364681244),
 ('condensed cream of celery soup', 0.2114531248807907),
 ('condensed cream of chicken soup', 0.20852962136268616),
 ('frozen green beans', 0.17412789165973663)]

In [69]:
model5.wv.most_similar('mushroom')

[('mushroom soup', 0.5826168060302734),
 ('cream of celery soup', 0.5491560101509094),
 ('mushrooms', 0.5449013113975525),
 ('mixed mushrooms', 0.5419656038284302),
 ('condensed golden mushroom soup', 0.5125951766967773),
 ('golden mushroom soup', 0.507859468460083),
 ('sliced mushrooms', 0.4991070032119751),
 ('cream of mushroom soup', 0.49063780903816223),
 ('cream of chicken soup', 0.4856976568698883),
 ('macaroni and cheese dinner', 0.47939327359199524)]

In [70]:
model5.wv.most_similar('spaghetti')

[('spaghettini', 0.6600141525268555),
 ('thin spaghetti', 0.6508853435516357),
 ('ziti', 0.615624189376831),
 ('rigatoni', 0.6082541942596436),
 ('pasta sauce', 0.5644978284835815),
 ('short pasta', 0.564304530620575),
 ('penne', 0.5575358271598816),
 ('tagliatelle', 0.5453338623046875),
 ('penne rigate', 0.5407254099845886),
 ('penne pasta', 0.5404622554779053)]

In [71]:
model4.wv.most_similar('spaghetti')

[('pasta sauce', 0.6220312118530273),
 ('penne', 0.6028362512588501),
 ('rigatoni', 0.600810170173645),
 ('ravioli', 0.5822018384933472),
 ('pizza sauce', 0.5759572386741638),
 ('romano cheese', 0.526602566242218),
 ('penne pasta', 0.5224217176437378),
 ('tortellini', 0.5099421739578247),
 ('seasoned bread crumbs', 0.5084934830665588),
 ('romano', 0.4826919734477997)]

In [72]:
model4.wv.most_similar('celery')

[('dried celery', 0.4342176914215088),
 ('frozen okra', 0.41190212965011597),
 ('cabbage', 0.40134549140930176),
 ('frozen peas', 0.39975273609161377),
 ('cream of mushroom soup', 0.3895207643508911),
 ('dried thyme', 0.38480284810066223),
 ('cooked white rice', 0.3760124146938324),
 ('gumbo', 0.3730083703994751),
 ('dry sherry', 0.3693051338195801),
 ('cooked rice', 0.3687528371810913)]

In [73]:
model4.wv.most_similar('thyme')

[('rib', 0.5926571488380432),
 ('thyme leaves', 0.5837690830230713),
 ('stock', 0.5584437847137451),
 ('savory', 0.5455754995346069),
 ('parsley sprigs', 0.5343202352523804),
 ('sage', 0.5275074243545532),
 ('sage leaves', 0.5144556760787964),
 ('shallots', 0.5085753202438354),
 ('rabbit', 0.49856045842170715),
 ('tarragon', 0.49356481432914734)]

In [74]:
model4.wv.most_similar('pineapple')

[('orange juice', 0.4284343123435974),
 ('mangoes', 0.390312135219574),
 ('mandarin orange segments', 0.3620172441005707),
 ('limes', 0.3605644702911377),
 ('mangos', 0.35693642497062683),
 ('pear juice', 0.35233616828918457),
 ('lime juice', 0.3479768633842468),
 ('fruit cocktail', 0.3459216356277466),
 ('mango', 0.3389469087123871),
 ('mandarin oranges', 0.3387060761451721)]