## Introduction

Using my web scraper, I've scraped recipes from Bon Appetit and saved them into json files. In this notebook, I'll put together these json files into a pandas dataframe and process it into a format I can use for my analyses (in subsequent notebooks).

### Load required packages and libraries

In [1]:
# import necessary pacakages
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# import data (read json files)
json_list = ['basically.json', 'healthyish.json', 'recipes.json']

dfs = [] # an empty list to store the data frames
for json in json_list:
    data = pd.read_json(json, orient='records') # read data frame from json file
    dfs.append(data) #append the data frame to the list

# concatenate all the data frames in the list
raw = pd.concat(dfs, ignore_index=True)
raw

Unnamed: 0,title,author,date,ingredients,rating,ratings_count,review_count,tags,url
0,Brussels Sprouts Nasi Goreng,Meera Sodha,2020-11-15,"[2, Tbsp. kecap manis or 3 Tbsp. agave nectar ...",3.6,26,1,"[Basically, brussels sprout, Agave, Soy Sauce,...",https://www.bonappetit.com/recipe/brussels-spr...
1,Huevos Rancheros con Rajas y Champiñones,Rick Martinez,2021-01-03,"[1, serrano chile, 16, oz. cherry tomatoes (ab...",4.6,12,1,"[Basically, Egg, Serrano Chiles, Cherry Tomato...",https://www.bonappetit.com/recipe/huevos-ranch...
2,Green Seasoning Baked Cod,Brigid Washington,2021-01-10,"[¼, Vidalia or other sweet onion, 4, 6-oz. ski...",4.2,19,6,"[Basically, Fish, Seafood, Onion, Cod, Kosher ...",https://www.bonappetit.com/recipe/green-season...
3,Black-Eyed Pea Masala With Kale,Rachel Gurjar,2020-12-27,"[1, large white onion, 4, garlic cloves, 1, 1""...",4.5,55,22,"[Basically, Black-Eyed Peas, Indian Food, Kale...",https://www.bonappetit.com/recipe/black-eyed-p...
4,Pot Roast Brisket With Harissa and Spices,Sabrina Ghayour,2020-11-22,"[5½, lb. untrimmed flat-cut beef brisket, pref...",4.8,11,3,"[Basically, Brisket, Harissa, Kosher Salt, Cin...",https://www.bonappetit.com/recipe/pot-roast-br...
...,...,...,...,...,...,...,...,...,...
4692,Spiced Pear Upside-Down Cake,Claire Saffitz,2015-09-22,"[2, tablespoons unsalted butter, plus more for...",4.2,21,13,"[Cake, Cardamom, Dessert, Olive Oil, Orange, P...",https://www.bonappetit.com/recipe/spiced-pear-...
4693,Double Ginger Sticky Toffee Pudding,Claire Saffitz,2015-09-22,"[Cake, ½, cup (1 stick) unsalted butter, room ...",4.8,13,7,"[Cake, Cream, date, Dessert, Ginger, Pudding, ...",https://www.bonappetit.com/recipe/double-ginge...
4694,Apple Caramels,Rick Martinez,2015-10-29,"[Nonstick vegetable oil spray, ½, cup blanched...",4.0,3,2,"[Apple, Apple Cider, Brandy, Calvados, Caramel...",https://www.bonappetit.com/recipe/apple-caramels
4695,"Chicken Skin with Peanuts, Chiles, and Lime","Eli Dahlin, Damn the Weather, Seattle, WA",2015-09-22,"[¼, cup peanut or vegetable oil, 8, garlic clo...",,0,0,"[chicken skin, Jalapeno, Lime, Peanut, Green O...",https://www.bonappetit.com/recipe/chicken-skin...


## Cleaning data

### Duplicated results

There are some duplicated recipes in the dataframe. This is because my spider crawled through results categorised by magazine issue date (recipes.json) before crawling through recipes from the Basically (basically.json) and Healthyish pages (healthyish.json) for any recipes that were not published in the magazine. So I'll start by removing the duplicates.

In [3]:
# drop duplicated rows (based on whether they share a url or not)
raw = raw.drop_duplicates(subset=['url'], keep='first')

# check change in dimensions of df after dropping raws
raw.shape

(3731, 9)

### Some considerations

I want to clean the ingredients column down to the 'base'/'stem' ingredients because the ingredients column currently consists of many descriptives that are unnecessary to the analyses I intend to perform.   

Ingredients present a slight challenge because I'm not sure of how much I should 'trim' the ingredients. For instance, should 'Parmesan cheese' be trimmed to 'cheese'? (Probably not.) Or should 'English hothouse cucumber' be trimmed to 'cucumber'? (Probably fine.)  

In a similar project, the author used reductive approach to cleaning, where she removed all unnecessary stopwords and other words related to measurements and instructions (e.g. finely diced) before changing plural words to their singular forms. 

I've decided to depart from the reductive approach as I want to use the tags scraped from the recipe pages (contained in the tags column) to clean my ingredients column instead. The tags are a good place to start because they encapsulate the key ingredients used in a recipe. 

In [4]:
# change elements in tags columns to lowercase
raw['tags'] = raw['tags'].map(lambda x: list(map(str.lower, x)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [5]:
# get unique elements from tags column
import itertools
set(itertools.chain.from_iterable(raw.tags)) # itertools concatenates all the lists; set keeps the unique elements
# but i want to count the number of occurences

{'bread pudding',
 'pork',
 'leafy greens',
 'ghee',
 'flatbread',
 'prosciutto',
 'sumac',
 'cardamom',
 'cold soda',
 'feta',
 'beef tenderloin',
 'hot pepper',
 'orecchiette',
 'jewish',
 'chowder',
 'lasagna',
 'summer desserts',
 'achiote paste',
 'low fat',
 'ina garten',
 'napa cabbage',
 'chai',
 'confit',
 'pork loin',
 'cognac/armagnac',
 'turnover',
 'lamb shank',
 'cucumber',
 'russet potato',
 'cacao nib',
 'southern',
 'palm sugar',
 'basics',
 'currant',
 'grape tomatoes',
 'bobby flay',
 'peanut',
 'meatloaf',
 'olive',
 'lunch al desko',
 'cinnamon',
 'potato',
 'cayenne',
 'peppadew',
 'cornbread',
 'star anise',
 'cherry tomatoes',
 'dark chocolate',
 'macaroons',
 'bread flour',
 'low sodium',
 'linguine',
 'tomato paste',
 'hominy',
 'roast beef',
 'sherry wine vinegar',
 'hard-boiled eggs',
 'green onion scallion',
 'vinaigrette',
 'fettuccine',
 'kitchen sketches',
 'custard',
 'ground pork',
 'date',
 'weeknight',
 'tartare',
 'oxtail',
 'syrup',
 'melon',
 'dan

In [6]:
# obtain counts for each unique tag from the tags columns
tags_count = pd.Series(list(itertools.chain.from_iterable(raw.tags))).value_counts()
tags_count

healthyish         1327
garlic             1044
lemon               621
web recipe          517
sugar               482
                   ... 
schnitzel             1
semolina              1
pasta maker           1
tuscan                1
root vegetables       1
Length: 1016, dtype: int64

In [7]:
tags_count.head(50)

healthyish              1327
garlic                  1044
lemon                    621
web recipe               517
sugar                    482
chile                    454
egg                      444
onion                    426
ginger                   401
cilantro                 391
olive oil                378
green onion scallion     359
dessert                  356
salad                    346
butter                   340
lime                     320
vegetarian               319
shallot                  294
honey                    292
kosher salt              284
lemon juice              270
chicken recipes          265
tomato                   260
mint                     258
soy sauce                250
dinner                   237
basically                229
flour                    227
red pepper               215
cream                    206
parmesan                 201
cocktail                 195
bread                    191
cinnamon                 191
sesame seed   

There are some tags which are not ingredients, such as 'web recipe', 'basically', 'healthyish'. But they can be removed.

Instead of using the ingredients column, the tags could possibly give a good enough representation of the key ingredients in a recipe. To check if this is the case, I'll randomly sample 30 rows to see if the tags sufficiently capture the ingredients in a recipe.

In [8]:
pd.set_option('display.max_colwidth', None) 
raw[['title', 'ingredients', 'tags', 'date']].sample(n=30)

Unnamed: 0,title,ingredients,tags,date
4082,Stuffed Shells with Marinara,"[12, ounces jumbo pasta shells, Kosher salt, 2, large egg yolks, 1, large egg, 2, cups whole-milk fresh ricotta, 3, ounces Parmesan, finely grated, plus more for serving, ¼, cup finely chopped parsley, 8, ounces low-moisture mozzarella, coarsely grated, divided, Freshly ground black pepper, 3, cups , Classic Marinara Sauce, , divided, Dried oregano and olive oil (for serving)]","[pasta, egg, ricotta, parmesan, mozzarella, sauce, oregano, best new restaurants 2017]",2017-08-15
4393,The Byrrh Special,"[1½, oz. Byrrh, 1½, oz. London dry gin, Lemon twist (for serving)]","[gin, inaki aizpitarte, cocktail]",2014-03-20
784,Sheet Pan Salmon and Squash with Miso Mojo,"[1, delicata squash (about 1 lb.), halved, 1, small head cauliflower, cut into florets, 1, red onion, cut into 8 wedges, 5, Tbsp. extra-virgin olive oil, divided, plus more for drizzling, Kosher salt, 1, lb. boneless salmon fillet, ⅓, cup raw pumpkin seeds (pepitas), ¼, cup fresh orange juice (from about 1 small orange), 2, Tbsp. fresh lime juice, 2, Tbsp. unseasoned rice vinegar, 2, Tbsp. white miso, 2, small serrano chiles, sliced into thin rings]","[healthyish, web recipe, squash, cauliflower, red onion, salmon, pumpkin seed, orange juice, lime juice, rice vinegar, miso, serrano chiles]",2018-03-18
4654,King Trumpet Yakitori,"[1, scallion, thinly sliced, ⅓, cup mirin, ⅓, cup sake, ⅓, cup soy sauce, ⅓, cup zarame sugar or raw sugar, 4, small king trumpet mushrooms, trimmed, halved lengthwise, cut crosswise into 2-inch pieces, 1, teaspoon vegetable oil, Kosher salt, Special Equipment, Eight 6-inch bamboo skewers, soaked at least 15 minutes]","[hot 10 2015, mirin, mushroom, sake, soy sauce, yakitori, vegetarian]",2015-08-18
1990,Spicy Cavatelli with Zucchini and Leeks,"[½, pound cavatelli, ¼, cup olive oil, 1, large leek, white and pale-green parts only, chopped, ¾, teaspoon crushed red pepper flakes, Kosher salt and freshly ground black pepper, 2, large zucchini, grated, ⅓, cup grated Pecorino]","[hot pepper, leek, pasta, spicy, summer, zucchini]",2014-05-14
2551,Cranberry and Cornmeal Upside-Down Cake,"[11, tablespoons unsalted butter, room temperature, divided, ½, cup (packed) dark brown sugar, 4, cups fresh cranberries, 1, cup granulated sugar, 3, large eggs, ½, cup sour cream, 1¼, cups all-purpose flour, ½, cup cornmeal, 1, tablespoon baking powder, 1, teaspoon kosher salt]","[cake, cornmeal, cranberry, dessert, egg, sour cream, thanksgiving]",2014-10-21
916,Pan Bagnat,"[2, oil-packed anchovy fillets, drained, finely chopped, 2, small garlic cloves, finely grated, 2, tablespoons capers, drained, chopped, 2, tablespoons red wine vinegar, 1, tablespoon Dijon mustard, 1, small red onion, very thinly sliced, 1, cup mixed olives (such as niçoise, kalamata, and/or Castelvetrano), coarsely chopped, 4, tablespoons extra-virgin olive oil, divided, plus more for drizzling, Kosher salt, freshly ground pepper, 1, 6–7-ounce jar oil-packed tuna, drained, ½, lemon, ½, baguette, lightly toasted, 1½, cups basil leaves, torn, 1½, cups parsley leaves with tender stems, coarsely chopped, 2, large , hard-boiled eggs, , peeled, thinly sliced, 1, large tomato, sliced, 2, jarred roasted red peppers, sliced]","[web recipe, healthyish, anchovy, garlic, capers, vinegar, mustard, onion, olive, tuna, lemon, baguette, basil, parsley, egg, tomato, pepper]",2017-09-05
4277,Real-Deal Aioli,"[1, large egg yolk, 4, medium garlic cloves, finely grated, ½, teaspoon kosher salt, ½, cup olive oil]","[egg, garlic, aioli]",2015-12-15
1921,The Slipway Crab Roll,"[8, oz. fresh Jonah or peekytoe crabmeat, 2, tablespoons mayonnaise, Kosher salt, 2, tablespoons unsalted butter, room temperature, 2, New England–style split-top hot dog buns, Green-leaf lettuce leaves (for serving), Freshly ground white pepper]","[crab, mayonnaise, sandwich, summer]",2014-06-24
4620,Classic Potato Gratin,"[5, garlic cloves, divided, 1, tablespoon unsalted butter, room temperature, 2, medium shallots, quartered through root ends, 2½, cups heavy cream, 1, tablespoon kosher salt, 1, teaspoon freshly ground black pepper, 1, tablespoon thyme leaves, plus more, 4, pounds russet potatoes, scrubbed, very thinly sliced on a mandoline, 3, ounces Gruyère, finely grated, 1, ounce Parmesan, finely grated]","[cheese, cream, garlic, casserole, gruyere, parmesan, potato, shallot, thanksgiving, thyme]",2015-10-20


Based on my inspection of the randomly selected columns, it looks like the older recipes in 2014 and 2015 tend to have fewer tags. Consequently, the tags are more likely to miss out on key ingredients. For instance, the Chicory-Apple Salad with Brown Butter Dressing does not include chicory in the tags.

In [9]:
# check if my hunch about the tags are right
# create new tags column
raw['tags_count'] = raw['tags'].str.len()

# add column approximating length of ingredients lints
raw['ingred_elem_count'] = raw['ingredients'].str.len()

# group by year
raw.groupby(pd.DatetimeIndex(raw['date']).year).agg({'tags_count': 'mean',
                                                     'ingred_elem_count': 'mean'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0_level_0,tags_count,ingred_elem_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2012.0,9.625,13.625
2013.0,7.792453,16.075472
2014.0,8.312693,18.930341
2015.0,6.2912,19.992
2016.0,8.479279,20.522523
2017.0,10.179612,20.144013
2018.0,10.87395,20.55042
2019.0,12.240541,21.535135
2020.0,12.09,22.78
2021.0,13.725806,24.16129


In [10]:
# reset display
pd.reset_option('all')


: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.



: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.



In [11]:
raw['date'].groupby(pd.DatetimeIndex(raw['date']).year).agg('count')

date
2012.0      8
2013.0     53
2014.0    646
2015.0    625
2016.0    555
2017.0    618
2018.0    476
2019.0    370
2020.0    300
2021.0     62
Name: date, dtype: int64

The above result indeed shows that recipes published in 2017 and after have more tags. Still, the larger number of tags could partly be due to the increased number of ingredients in a recipe. (Bearing in mind that the 'ingred_elem_count' is only an approximation of the number of ingredients since I have not cleaned the 'ingredients' column.)  

Given that the early years have a larger number of recipes, it would not be would not be possible to disregard them so that I can focus on the years which have more complete tags.

### Cleaning ingredients

#### Regex of tags over ingredients column

Since I'm assuming that the tags on the website contain all key ingredients used in the recipes, I can search the tags in the ingredients column.

In [12]:
# change ingredients column to lowercase 
raw['ingredients']= raw['ingredients'].map(lambda x: list(map(str.lower, x)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [13]:
tags_query = r"\b(?:%s)" % "|".join(list(set(itertools.chain.from_iterable(raw.tags))))
tags_query

"\\b(?:bread pudding|pork|leafy greens|ghee|flatbread|prosciutto|sumac|cardamom|cold soda|feta|beef tenderloin|hot pepper|orecchiette|jewish|chowder|lasagna|summer desserts|achiote paste|low fat|ina garten|napa cabbage|chai|confit|pork loin|cognac/armagnac|turnover|lamb shank|cucumber|russet potato|cacao nib|southern|palm sugar|basics|currant|grape tomatoes|bobby flay|peanut|meatloaf|olive|lunch al desko|cinnamon|potato|cayenne|peppadew|cornbread|star anise|cherry tomatoes|dark chocolate|macaroons|bread flour|low sodium|linguine|tomato paste|hominy|roast beef|sherry wine vinegar|hard-boiled eggs|green onion scallion|vinaigrette|fettuccine|kitchen sketches|custard|ground pork|date|weeknight|tartare|oxtail|syrup|melon|danny bowien|strawberry|bars|thai basil|white wine|asparagus|schnitzel|casserole|tequila|shellfish|paprika|kaffir lime leaves|apple pie|anchovy|octopus|mascarpone|banana|pimms|miso|matcha|romaine|nutmeg|pinto beans|grill|popsicle|kimchi|loaf|porcini|egg yolks|crudite|potato

Within the regex search, some adjustments need to be made:
- a few tags have slashes in them, which may hinder the search (cognac/armagnac and sweet potato/yam)
- green onion scallion --> green onion | scallion | spring onion (since spring onion is not within the tags)
- tea should have a '\\b' because the string 'tea' may be found within words that are not tea

In [14]:
for r in (("sweet potato/yam", "sweet potato|yam"), ("cognac/armagnac", "cognac|armagnac"), ("chicken recipes", "chicken"), ("green onion scallion", "green onion|scallion|spring onion"), ("|tea|", "|tea\\b|")):
    tags_query = tags_query.replace(*r)

In [15]:
# check if words from tags_query list are found in the ingredients column 
# and add a column showing the tag(s) from the query that show a match
# by joining the elements of the list into a string and then using str.findall
raw.loc[:,'tags_in_ingred'] = raw['ingredients'].str.join(' ').str.findall( pat = '({})'.format(tags_query))
raw

# but this might not be the best approach since it takes a long time

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


Unnamed: 0,title,author,date,ingredients,rating,ratings_count,review_count,tags,url,tags_count,ingred_elem_count,tags_in_ingred
0,Brussels Sprouts Nasi Goreng,Meera Sodha,2020-11-15,"[2, tbsp. kecap manis or 3 tbsp. agave nectar ...",3.6,26,1,"[basically, brussels sprout, agave, soy sauce,...",https://www.bonappetit.com/recipe/brussels-spr...,15,26,"[agave, soy sauce, rice, brussels sprout, red ..."
1,Huevos Rancheros con Rajas y Champiñones,Rick Martinez,2021-01-03,"[1, serrano chile, 16, oz. cherry tomatoes (ab...",4.6,12,1,"[basically, egg, serrano chiles, cherry tomato...",https://www.bonappetit.com/recipe/huevos-ranch...,13,26,"[serrano, chile, cherry tomatoes, onion, cilan..."
2,Green Seasoning Baked Cod,Brigid Washington,2021-01-10,"[¼, vidalia or other sweet onion, 4, 6-oz. ski...",4.2,19,6,"[basically, fish, seafood, onion, cod, kosher ...",https://www.bonappetit.com/recipe/green-season...,18,26,"[onion, cod, kosher salt, black pepper, olive,..."
3,Black-Eyed Pea Masala With Kale,Rachel Gurjar,2020-12-27,"[1, large white onion, 4, garlic cloves, 1, 1""...",4.5,55,22,"[basically, black-eyed peas, indian food, kale...",https://www.bonappetit.com/recipe/black-eyed-p...,13,25,"[onion, garlic, clove, pie, ginger, tuscan, ka..."
4,Pot Roast Brisket With Harissa and Spices,Sabrina Ghayour,2020-11-22,"[5½, lb. untrimmed flat-cut beef brisket, pref...",4.8,11,3,"[basically, brisket, harissa, kosher salt, cin...",https://www.bonappetit.com/recipe/pot-roast-br...,8,15,"[beef, brisket, roll, rose, harissa, kosher sa..."
...,...,...,...,...,...,...,...,...,...,...,...,...
4692,Spiced Pear Upside-Down Cake,Claire Saffitz,2015-09-22,"[2, tablespoons unsalted butter, plus more for...",4.2,21,13,"[cake, cardamom, dessert, olive oil, orange, p...",https://www.bonappetit.com/recipe/spiced-pear-...,7,28,"[butter, flour, orange, juice, pomegranate, mo..."
4693,Double Ginger Sticky Toffee Pudding,Claire Saffitz,2015-09-22,"[cake, ½, cup (1 stick) unsalted butter, room ...",4.8,13,7,"[cake, cream, date, dessert, ginger, pudding, ...",https://www.bonappetit.com/recipe/double-ginge...,7,33,"[cake, butter, flour, medjool, date, soda, kos..."
4694,Apple Caramels,Rick Martinez,2015-10-29,"[nonstick vegetable oil spray, ½, cup blanched...",4.0,3,2,"[apple, apple cider, brandy, calvados, caramel...",https://www.bonappetit.com/recipe/apple-caramels,8,21,"[oil, hazelnut, cinnamon, apple cider, sugar, ..."
4695,"Chicken Skin with Peanuts, Chiles, and Lime","Eli Dahlin, Damn the Weather, Seattle, WA",2015-09-22,"[¼, cup peanut or vegetable oil, 8, garlic clo...",,0,0,"[chicken skin, jalapeno, lime, peanut, green o...",https://www.bonappetit.com/recipe/chicken-skin...,7,16,"[peanut, oil, garlic, clove, scallion, green, ..."


In [16]:
# count number of time each tag appears
pd.Series(list(itertools.chain.from_iterable(raw.tags_in_ingred))).value_counts().head(50)

kosher salt         3600
oil                 2966
olive               1857
garlic              1603
clove               1482
pie                 1429
black pepper        1198
sugar               1148
butter              1082
seed                 995
lemon                840
pepper               813
egg                  785
chile                766
lemon juice          689
flour                654
vinegar              622
onion                566
green                558
cilantro             526
chicken              511
toast                495
ginger               482
parsley              480
red pepper           471
milk                 444
scallion             429
honey                375
shallot              367
mint                 366
lime juice           364
bread                348
zest                 347
rice                 342
sea salt             317
brown sugar          317
white wine           316
lime                 314
soy sauce            309
mustard              306


Based on this initial count, some ingredient tags have unusually large counts. For instance, the spice 'clove' is mixed up with results for 'garlic cloves' due to the way my regex search has been done.   
Furthermore, unsurprisingly, more general tags have higher counts (e.g. flour, juice, vinegar) instead of the specific tags (e.g. wholewheat flour, lemon juice, red wine vinegar).

In [17]:
# value counts as dataframe
tags_ingred_counts = pd.Series(list(itertools.chain.from_iterable(raw.tags_in_ingred))).value_counts().rename_axis('tags').reset_index(name='counts')
tags_ingred_counts

Unnamed: 0,tags,counts
0,kosher salt,3600
1,oil,2966
2,olive,1857
3,garlic,1603
4,clove,1482
...,...,...
690,passion fruit,1
691,blackberry,1
692,scrambled eggs,1
693,omelet,1


In [18]:
# manually inspect to remove irrelevant tags
tags_ingred_counts.tags.unique()

array(['kosher salt', 'oil', 'olive', 'garlic', 'clove', 'pie',
       'black pepper', 'sugar', 'butter', 'seed', 'lemon', 'pepper',
       'egg', 'chile', 'lemon juice', 'flour', 'vinegar', 'onion',
       'green', 'cilantro', 'chicken', 'toast', 'ginger', 'parsley',
       'red pepper', 'milk', 'scallion', 'honey', 'shallot', 'mint',
       'lime juice', 'bread', 'zest', 'rice', 'sea salt', 'brown sugar',
       'white wine', 'lime', 'soy sauce', 'mustard', 'sauce', 'cream',
       'corn', 'sesame seed', 'coconut', 'tomato', 'thyme', 'red onion',
       'red wine vinegar', 'bean', 'parmesan', 'roast', 'carrot',
       'cinnamon', 'cucumber', 'juice', 'celery', 'cumin', 'basil',
       'orange', 'rice vinegar', 'vanilla extract', 'coriander', 'grape',
       'almond', 'cheese', 'chive', 'fennel', 'rib', 'spice',
       'greek yogurt', 'soda', 'dill', 'fish', 'mushroom', 'broth',
       'apple cider', 'serrano', 'mayonnaise', 'salt', 'radish', 'pork',
       'potato', 'peanut', 'sesame

From my manual inspection, I can see quite a few tags that should not be considered ingredients. These include but are not limited to:
- overly general terms (e.g. juice, zest, poultry)
- food or components of a dish (e.g. waffle, turnover, breakfast)
- special equipment (e.g. instant pot)
- descriptive terms (e.g. thai, persian, low sodium)

In [19]:
remove_tags = ['juice', 'green', 'zest', 'pepper', 'fry', 'rib', 'japanese', 'italian', 'wine',
               'persian', 'dough', 'spring', 'tart', 'syrup', 'chinese', 'liqueur', 'side', 'flan',
               'salsa verde', 'breakfast', 'barbecue', 'porridge', 'gravy', 'pie crust',
               'waffle', 'frosting', 'pancake', 'dairy', 'grains', 'liver', 'braise', 'winter',
               'shortbread', 'shake', 'thanksgiving', 'torte', 'poultry', 'smoker',
               'snack', 'burgers', 'panna cotta', 'macaroons', 'instant pot', 'dumplings', 
               'easter', 'pressure cooker', 'hash', 'gluten free', 'guacamole', 'tosca',
               'brownies', 'fall', 'easy', 'frittata', 'coleslaw', 'chowder', 'pho',
               'posole', 'low sodium', 'crepe', 'omelet', 'brittle', 'dessert',
               'carbonara', 'cast iron', 'turnover', 'clove', 'american', 'asian', 'cake',
               'crisp', 'crust', 'french', 'grill', 'herb', 'korean', 'mandoline', 'middle eastern',
               'nut', 'pie', 'quick', 'roast', 'rub', 'sauce', 'skewer', 'spanish', 'thai', 'tuscan', 'vegetables', 'seed', 'tex-mex']
tags_ingred_clean = tags_ingred_counts[~tags_ingred_counts.tags.isin(remove_tags)]
tags_ingred_clean

Unnamed: 0,tags,counts
0,kosher salt,3600
1,oil,2966
2,olive,1857
3,garlic,1603
6,black pepper,1198
...,...,...
689,string beans,1
690,passion fruit,1
691,blackberry,1
692,scrambled eggs,1


Having the tag counts alone is not great because the other data of the recipe is lost. What I need to do is to create dummy columns containing the tags and then remove any dummy columns containing irrelevant tags.  

In [20]:
# convert column into dummy columns?
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

ingred_dummy = pd.DataFrame(mlb.fit_transform(raw['tags_in_ingred']), columns=mlb.classes_, index=raw.index)

# remove unwanted cols (see remove_tags list)

ingred_dummy.drop(remove_tags, axis=1, inplace=True)
ingred_dummy

# merge dummy cols with raw df
raw_dummy = raw.merge(ingred_dummy, left_index=True, right_index=True)


In [21]:
# quick check to see if merge was performed correctly
raw_dummy.loc[0]['brussels sprout']

1

The resulting dataframe contains the binary columns for 

In [22]:
raw_dummy

Unnamed: 0,title,author,date_x,ingredients,rating,ratings_count,review_count,tags,url,tags_count,...,whole wheat,wild rice,worcestershire sauce,yam,yeast,yogurt,yukon gold,yuzu,ziti,zucchini
0,Brussels Sprouts Nasi Goreng,Meera Sodha,2020-11-15,"[2, tbsp. kecap manis or 3 tbsp. agave nectar ...",3.6,26,1,"[basically, brussels sprout, agave, soy sauce,...",https://www.bonappetit.com/recipe/brussels-spr...,15,...,0,0,0,0,0,0,0,0,0,0
1,Huevos Rancheros con Rajas y Champiñones,Rick Martinez,2021-01-03,"[1, serrano chile, 16, oz. cherry tomatoes (ab...",4.6,12,1,"[basically, egg, serrano chiles, cherry tomato...",https://www.bonappetit.com/recipe/huevos-ranch...,13,...,0,0,0,0,0,0,0,0,0,0
2,Green Seasoning Baked Cod,Brigid Washington,2021-01-10,"[¼, vidalia or other sweet onion, 4, 6-oz. ski...",4.2,19,6,"[basically, fish, seafood, onion, cod, kosher ...",https://www.bonappetit.com/recipe/green-season...,18,...,0,0,0,0,0,0,0,0,0,0
3,Black-Eyed Pea Masala With Kale,Rachel Gurjar,2020-12-27,"[1, large white onion, 4, garlic cloves, 1, 1""...",4.5,55,22,"[basically, black-eyed peas, indian food, kale...",https://www.bonappetit.com/recipe/black-eyed-p...,13,...,0,0,0,0,0,0,0,0,0,0
4,Pot Roast Brisket With Harissa and Spices,Sabrina Ghayour,2020-11-22,"[5½, lb. untrimmed flat-cut beef brisket, pref...",4.8,11,3,"[basically, brisket, harissa, kosher salt, cin...",https://www.bonappetit.com/recipe/pot-roast-br...,8,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4692,Spiced Pear Upside-Down Cake,Claire Saffitz,2015-09-22,"[2, tablespoons unsalted butter, plus more for...",4.2,21,13,"[cake, cardamom, dessert, olive oil, orange, p...",https://www.bonappetit.com/recipe/spiced-pear-...,7,...,0,0,0,0,0,0,0,0,0,0
4693,Double Ginger Sticky Toffee Pudding,Claire Saffitz,2015-09-22,"[cake, ½, cup (1 stick) unsalted butter, room ...",4.8,13,7,"[cake, cream, date, dessert, ginger, pudding, ...",https://www.bonappetit.com/recipe/double-ginge...,7,...,0,0,0,0,0,0,0,0,0,0
4694,Apple Caramels,Rick Martinez,2015-10-29,"[nonstick vegetable oil spray, ½, cup blanched...",4.0,3,2,"[apple, apple cider, brandy, calvados, caramel...",https://www.bonappetit.com/recipe/apple-caramels,8,...,0,0,0,0,0,0,0,0,0,0
4695,"Chicken Skin with Peanuts, Chiles, and Lime","Eli Dahlin, Damn the Weather, Seattle, WA",2015-09-22,"[¼, cup peanut or vegetable oil, 8, garlic clo...",,0,0,"[chicken skin, jalapeno, lime, peanut, green o...",https://www.bonappetit.com/recipe/chicken-skin...,7,...,0,0,0,0,0,0,0,0,0,0


In [27]:
# save the raw_dummy dataframe into a pickle
raw_dummy.to_pickle("recipes_processed.pkl")