# Data Preprocessing

This notebook is part of the `Fried Chicken Cost Analysis` project and contains the steps taken to clean and transform the webscraped HTML data into structured tabular data. The cleaned data will then be used to build the cost comparison chart.

Goal: Extract ingredients and amounts from web-scrape

# Import Packages and Define Functions

In [1]:
# General data processing
import numpy as np
import pandas as pd

# Packages for pre-processing text
import nltk                       # Natural Language Tool Kit
nltk.download('wordnet')          # For lemmification
nltk.download('stopwords')        # For processing stop words (words too common to hold significant meaning)
from nltk.corpus import stopwords # Import above downloaded stopwords
import re                         # Regular Expression
import string                     # For identifying punctuation

# Converting lists and dictionaries stored as strings within DataFrames back to lists and dictionaries
import ast

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Load Data

In [2]:
# Load the scraped data from allrecipes.com
df = pd.read_csv("../11_raw_data/20231031-2328_scraped_fc_recipes.csv", index_col = 0)

# Examine DataFrame
df.head()

Unnamed: 0,recipe_url,contents
0,https://www.allrecipes.com/recipe/8805/crispy-...,"{'@context': 'http://schema.org', '@type': ['R..."
1,https://www.allrecipes.com/recipe/8841/oven-fr...,"{'@context': 'http://schema.org', '@type': ['R..."
2,https://www.allrecipes.com/recipe/89268/triple...,"{'@context': 'http://schema.org', '@type': ['R..."
3,https://www.allrecipes.com/recipe/220128/chef-...,"{'@context': 'http://schema.org', '@type': ['R..."
4,https://www.allrecipes.com/recipe/150306/the-b...,"{'@context': 'http://schema.org', '@type': ['R..."


The loaded data consists of 2 columns:
- `recipe_url`: the Uniform Resource Locator(URL) of each recipe.
- `contents`: the data scraped from each recipe's URL stored as a dictionary.

To better understand the scraped data, the keys of the dictionary within `contents` were explicitly examined.

In [3]:
# Examine keys in JSON dictionary containing data within `contents`
sorted_dict_keys = sorted(list(ast.literal_eval(df.loc[0,"contents"]).keys()))

print(f"Dictionary Keys for Scraped Recipe Data:\n")
for index, key in enumerate(sorted_dict_keys):
    print(f"{str(index + 1).rjust(2,'0')}: {key} ")

Dictionary Keys for Scraped Recipe Data:

01: @context 
02: @type 
03: about 
04: aggregateRating 
05: author 
06: cookTime 
07: dateModified 
08: datePublished 
09: description 
10: headline 
11: image 
12: mainEntityOfPage 
13: name 
14: nutrition 
15: prepTime 
16: publisher 
17: recipeCategory 
18: recipeCuisine 
19: recipeIngredient 
20: recipeInstructions 
21: recipeYield 
22: review 
23: totalTime 
24: video 


Specific to material cost analysis, keys `19: recipeIngredient` and `21: recipeYield` are most likely to contain information on the materials used in each recipe and their portioning. This was confirmed below when examining the values associated with each key.

In [4]:
# Examine recipeIngredient
recipe_name        = ast.literal_eval(df.loc[3,"contents"])["name"]
recipe_portion     = ast.literal_eval(df.loc[3,"contents"])["recipeYield"]
recipe_ingredients = ast.literal_eval(df.loc[3,"contents"])["recipeIngredient"]

print(f"Ingredients for {recipe_name}, yields {recipe_portion} portions.\n")

for index, ing in enumerate(recipe_ingredients):
    print(f"{str(index + 1).rjust(2,'0')}: {ing} ")

Ingredients for Chef John&#39;s Buttermilk Fried Chicken, yields ['4'] portions.

01: 1 (3 1/2) pound chicken, cut into 8 pieces 
02: 1 teaspoon black pepper 
03: 1 teaspoon salt 
04: 1 teaspoon paprika 
05: 0.5 teaspoon white pepper 
06: 0.25 teaspoon dried rosemary 
07: 0.25 teaspoon ground thyme 
08: 0.25 teaspoon dried oregano 
09: 0.25 teaspoon dried sage 
10: 0.25 teaspoon cayenne pepper 
11: 2 cups buttermilk 
12: 2 cups flour 
13: 1 teaspoon salt 
14: 0.5 teaspoon paprika 
15: 0.5 teaspoon cayenne pepper 
16: 0.5 teaspoon garlic powder 
17: 0.5 teaspoon white pepper 
18: 0.5 teaspoon onion powder 
19: 2.5 quarts peanut oil for frying 


Observing Chef John's Buttermilk Fried Chicken recipe's ingredient list:
- ingredient amounts are listed first, followed by the unit of measurement and the ingredient name itself.
- given Allrecipes.com is an American recipe website, the units were assumed to be American imperial units.
- units switch between volume and mass.
- no system to separate ingredients into subprocesses (seasoning the chicken vs. preparing the batter).

Thus, below steps were taken to process the ingredient lists for each recipe:
1) Extract ingredient amounts
2) Extract unit of measurement
3) Extract ingredient name

## Identify only Fried Chicken Recipes

Among the 120 scraped URLs, not every URL refers to fried chicken as the scraped URLs were from a search query in Allrecipes.com. Thus, before extracting ingredients, the scraped URLs need to be filtered out such that only fried chicken recipes remain. This brings about the question: what is considered a fried chicken recipe?

### Definition of a `fried chicken recipe`:

- the star of the dish must be fried chicken
    - mixtures like `fried chicken fried rice` were not considered fried chicken dishes
- the protein must be whole chicken
    - recipes using only specific parts of chicken, such as the wings and breasts, were excluded
- the chicken must be fried in oil
    - recipes that use ovens to bake the chicken were excluded

### Remove Non-recipe URLs

The first step taken was to remove any URLs that are not recipes. Specific to Allrecipes, any URL that does not contain the string `\recipe\` refer to other types of sites, such as guides or recipe collections.

In [5]:
# Keep only recipes, exclude articles, recipe repositories, and others
cond = df["recipe_url"].str.contains("/recipe/")

print(f"Before dropping non-recipe URLs: {df.shape}")
df = df.loc[cond]
print(f"After dropping non-recipe URLs: {df.shape}")

Before dropping non-recipe URLs: (120, 2)
After dropping non-recipe URLs: (102, 2)


### Filter by Rating?

Next, the average rating and rating count for each URL was extracted using the `aggregateRating` key.

In [6]:
# Examine the aggregate rating key
ast.literal_eval(df.loc[3,"contents"])["aggregateRating"]

{'@type': 'AggregateRating', 'ratingValue': '4.5', 'ratingCount': '487'}

In [7]:
# Initiate blank lists to store values and counts
rating_values = []
rating_counts = []

# Iterate through the dataframe, extracting ratings and rating counts
for row in df.itertuples():
    # Extract rating values and counts
    rating_values.append(ast.literal_eval(row[2])["aggregateRating"]["ratingValue"])
    rating_counts.append(ast.literal_eval(row[2])["aggregateRating"]["ratingCount"])

In [8]:
# Add the extracted values back to the DataFrame
df["rating_value"] = rating_values
df["rating_count"] = rating_counts

# Convert types
df["rating_value"] = df["rating_value"].astype('float')
df["rating_count"] = df["rating_count"].astype('float')

# Examine dataframe
df.head()

Unnamed: 0,recipe_url,contents,rating_value,rating_count
0,https://www.allrecipes.com/recipe/8805/crispy-...,"{'@context': 'http://schema.org', '@type': ['R...",4.6,743.0
1,https://www.allrecipes.com/recipe/8841/oven-fr...,"{'@context': 'http://schema.org', '@type': ['R...",4.3,1076.0
2,https://www.allrecipes.com/recipe/89268/triple...,"{'@context': 'http://schema.org', '@type': ['R...",4.4,965.0
3,https://www.allrecipes.com/recipe/220128/chef-...,"{'@context': 'http://schema.org', '@type': ['R...",4.5,487.0
4,https://www.allrecipes.com/recipe/150306/the-b...,"{'@context': 'http://schema.org', '@type': ['R...",4.5,1479.0


In [9]:
# Examine distribution
df.iloc[:,2:].describe()

Unnamed: 0,rating_value,rating_count
count,102.0,102.0
mean,4.512745,280.803922
std,0.312344,707.749649
min,3.3,1.0
25%,4.3,10.25
50%,4.5,53.5
75%,4.7,203.0
max,5.0,5899.0


The median rating count was 53.5 ratings. Although it is possible to limit the number of recipes by setting a threshold, that threshold would be arbitrary and subject to individual judgement. Thus, the recipe ratings were not used to filter recipes.

### Remove Recipes that fail Definition

A list of common terms was create to identify recipes that do not meet the definition of a `fried chicken recipe` as outlined. The list was curated by examining the titles of recipes and, for now, is a relatively manual process. In larger datasets, more complex NLP models may be trained and used to perform this task.

In [10]:
# Define a list of terms to exclude, then join with pipe for regex
terms_to_exclude = "|".join([
    # terms that indicate no deep frying
    "pan", "bowl", "oven", "bake", "air",

    # cuisines that tend to not use whole chicken
    "korea", "japan", "asia", "marsala","biryani",

    # terms that indicate chicken is not the star
    "stir", "rice", "general", "sandwich", "salad", "steak", "pork", 

    # terms that indicate whole chicken is not used
    "ball", "skin", "leg", "chunk", "liver", "drum", "wing", "breast", "strip", "gizzard", "sauce","loin","thigh","tender"
])

# Create condition that identifies URLs that contain any of these terms.
cond = df["recipe_url"].str.contains(terms_to_exclude)

# Filter out recipes that don't match the conditions
print(f"Before dropping URLs that don't meet definition: {df.shape}")
df = df.loc[~cond]
print(f"After dropping URLs that don't meet definition: {df.shape}")

Before dropping URLs that don't meet definition: (102, 4)
After dropping URLs that don't meet definition: (33, 4)


In [11]:
# Reset index for neatness
df.reset_index(inplace = True, drop = True)

# Visually examine the remaining 33 recipes
for row in df.itertuples():
    print(f"""{str(row[0] + 1).rjust(2,'0')}: {row[1]}""")

01: https://www.allrecipes.com/recipe/8805/crispy-fried-chicken/
02: https://www.allrecipes.com/recipe/89268/triple-dipped-fried-chicken/
03: https://www.allrecipes.com/recipe/220128/chef-johns-buttermilk-fried-chicken/
04: https://www.allrecipes.com/recipe/8970/millie-pasquinellis-fried-chicken/
05: https://www.allrecipes.com/recipe/16573/chicken-fried-chicken/
06: https://www.allrecipes.com/recipe/8635/southern-fried-chicken/
07: https://www.allrecipes.com/recipe/24778/better-than-best-fried-chicken/
08: https://www.allrecipes.com/recipe/86047/garlic-chicken-fried-chicken/
09: https://www.allrecipes.com/recipe/15375/fried-chicken-with-creamy-gravy/
10: https://www.allrecipes.com/recipe/87473/mustard-fried-chicken/
11: https://www.allrecipes.com/recipe/8802/tanyas-louisiana-southern-fried-chicken/
12: https://www.allrecipes.com/recipe/8717/deep-south-fried-chicken/
13: https://www.allrecipes.com/recipe/178809/southern-style-buttermilk-fried-chicken/
14: https://www.allrecipes.com/reci

By removing non-recipe URLs and URLs that contain terms defying the  definition of a `fried chicken recipe`, the original 120 scraped URLs were reduced to just 33. Finally, ingredient extraction will reveal which recipes actually use whole chicken, which will further narrow down the list.

# Extract Ingredients

Now that mostly fried chicken recipes remain, ingredient extraction was performed on a single recipe first (Chef John's Buttermilk Fried Chicken) before the same extraction methods were repeated for the remaining 32 recipes. 

## Ingredient Amounts

Ingredient amounts were presented first for each ingredient. Thus, the words within each ingredient were split using whitespace, with the first split being the ingredient amounts.

In [12]:
# Store ingredients in a list
ingredient_list = ast.literal_eval(df.loc[2,"contents"])["recipeIngredient"]

print(f"Ingredient Amounts extracted from Chef John's Buttermilk Fried Chicken\n")

# Visually examine the results
for index, ing in enumerate(ingredient_list):
    print(f"""{str(index + 1).rjust(2,'0')}: {str.split(ing," ")[0].ljust(5," ")} \t {str.split(ing," ")[1:]}""")    

Ingredient Amounts extracted from Chef John's Buttermilk Fried Chicken

01: 1     	 ['(3', '1/2)', 'pound', 'chicken,', 'cut', 'into', '8', 'pieces']
02: 1     	 ['teaspoon', 'black', 'pepper']
03: 1     	 ['teaspoon', 'salt']
04: 1     	 ['teaspoon', 'paprika']
05: 0.5   	 ['teaspoon', 'white', 'pepper']
06: 0.25  	 ['teaspoon', 'dried', 'rosemary']
07: 0.25  	 ['teaspoon', 'ground', 'thyme']
08: 0.25  	 ['teaspoon', 'dried', 'oregano']
09: 0.25  	 ['teaspoon', 'dried', 'sage']
10: 0.25  	 ['teaspoon', 'cayenne', 'pepper']
11: 2     	 ['cups', 'buttermilk']
12: 2     	 ['cups', 'flour']
13: 1     	 ['teaspoon', 'salt']
14: 0.5   	 ['teaspoon', 'paprika']
15: 0.5   	 ['teaspoon', 'cayenne', 'pepper']
16: 0.5   	 ['teaspoon', 'garlic', 'powder']
17: 0.5   	 ['teaspoon', 'white', 'pepper']
18: 0.5   	 ['teaspoon', 'onion', 'powder']
19: 2.5   	 ['quarts', 'peanut', 'oil', 'for', 'frying']


Ingredient amounts were extracted successfully in decimal form.

## Unit of Measurement (UoM)

Next, the units of each ingredient were observed to not necessarily come after ingredient amounts as is the case with ingredient `01`, chicken, in the previous section. Thus, the Natural Language ToolKit(NLTK) was used to standardize the form of each word (singular vs plural) and to exclude stopwords from being picked up. After Porter stemming and stopword removal, each token was checked against a list of common unit of measurements in the American kitchen for extraction.

In [13]:
# Define unit of measurements common to the American home kitchen
measurements = [
    "teaspoon", 
    "tablespoon",
    "cup",
    "quart",
    "pound",
    "ounce"
]

# Define stopwords
eng_stopwords = stopwords.words("english")

# Define a stemmer
stemmer = nltk.stem.PorterStemmer()

print(f"Units of Measurement extracted from Chef John's Buttermilk Fried Chicken\n")

# Iterate through each ingredient
for index, ing in enumerate(ingredient_list):

    # Create variable for printing
    ing_org = ing
    
    # Remove punctuation and take lower case
    for punctuation_mark in string.punctuation:
        ing = ing.replace(punctuation_mark,"").lower()
        
    # Split words into tokens based on whitespace
    tokens = ing.split(" ")

    # Initate blank list to stored stemmed tokens
    stemmed_tokens = []

    # Iterate through all but first token (1st token is ingredient amount)
    for token in tokens[1:]:

        # Exclude stopwords and "", then append stemmed token to blank list
        if (not token in eng_stopwords) and token != "":
            stemmed_tokens.append(stemmer.stem(token))

    # Compare each token to list of common measurements, keeping only those which are units and the first token
    # Each ingredient can only have 1 unit of measurement
    uom = [token for token in stemmed_tokens if token in measurements][0]
    
    print(f"""{str(index + 1).rjust(2,'0')}: {uom} \t was extracted from \t {ing_org}""")   

Units of Measurement extracted from Chef John's Buttermilk Fried Chicken

01: pound 	 was extracted from 	 1 (3 1/2) pound chicken, cut into 8 pieces
02: teaspoon 	 was extracted from 	 1 teaspoon black pepper
03: teaspoon 	 was extracted from 	 1 teaspoon salt
04: teaspoon 	 was extracted from 	 1 teaspoon paprika
05: teaspoon 	 was extracted from 	 0.5 teaspoon white pepper
06: teaspoon 	 was extracted from 	 0.25 teaspoon dried rosemary
07: teaspoon 	 was extracted from 	 0.25 teaspoon ground thyme
08: teaspoon 	 was extracted from 	 0.25 teaspoon dried oregano
09: teaspoon 	 was extracted from 	 0.25 teaspoon dried sage
10: teaspoon 	 was extracted from 	 0.25 teaspoon cayenne pepper
11: cup 	 was extracted from 	 2 cups buttermilk
12: cup 	 was extracted from 	 2 cups flour
13: teaspoon 	 was extracted from 	 1 teaspoon salt
14: teaspoon 	 was extracted from 	 0.5 teaspoon paprika
15: teaspoon 	 was extracted from 	 0.5 teaspoon cayenne pepper
16: teaspoon 	 was extracted from 	 0

The unit of measurement for all 19 ingredients were extracted successfully.

## Ingredient Name

In the third extraction, identifying the core ingredient in a list of words requires a way of assigning importance to each token based on its neighbours. Although this can be achieved using more complex NLP models that take into account word ordering and semantics, a similar method to the previous section was used, whereby each token was compared to a list of common ingredients found in fried chicken recipes.

Furthermore, as ingredients may contain more than 1 word (example: black pepper, white pepper), the strategy used was to first identify if the ingredient contains the common term `pepper`, then to add the matched tokens `white` or `black` to the common term `pepper`, resulting in `black pepper` and `white pepper`. Aside from `pepper`, this strategy was applied to other common terms like `oil` and `powder`.

In [14]:
# Define common ingredients in fried chicken
common_ingredients = [
    "chicken",
    "cayenne",
    "paprika",
    "rosemary",
    "thyme",
    "oregano",
    "sage",
    "buttermilk",
    "salt",
    "flour",
    "onion",
    "garlic",
    "vegetable",
    "peanut",
    "coconut",    
    "white",      # white pepper
    "black"       # black pepper
]

### PorterStemmer

Similar to `unit of measurement`, a stemmer was used to standardize the form of tokens within the ingredient list.

In [15]:
print(f"Ingredients extracted from Chef John's Buttermilk Fried Chicken\n")

# Again iterating through each ingredient 
for index, ing in enumerate(ingredient_list):

    # Create variable for printing
    ing_org = ing
    
    # Remove punctuation and take lower case
    for punctuation_mark in string.punctuation:
        ing = ing.replace(punctuation_mark,"").lower()

    # Split words into tokens based on whitespace
    tokens = ing.split(" ")

    # Initate blank list to stored stemmed tokens
    stemmed_tokens = []

    # Iterate through all but first token (1st token is ingredient amount)
    for token in tokens[1:]:

        # Exclude stopwords and "", then append stemmed token to blank list
        if (not token in eng_stopwords) and token != "":
            stemmed_tokens.append(stemmer.stem(token))
    
    # Create blank list to store ingredients
    extracted_ingredients = []

    # Logic for identifying and disambiguation of common ingredients
    # The code here is left explicit for easier reading of logic, a condensed version is used when combining all three extractions
    if "pepper" in stemmed_tokens:        
        for token in stemmed_tokens:
            if token in common_ingredients:
                extracted_ingredients.append(token + " pepper")
    elif "powder" in stemmed_tokens:
        for token in stemmed_tokens:
            if token in common_ingredients:
                extracted_ingredients.append(token + " powder")
    elif "oil" in stemmed_tokens:
        for token in stemmed_tokens:
            if token in common_ingredients:
                extracted_ingredients.append(token + " oil")
    else:
        for token in stemmed_tokens:
            if token in common_ingredients:
                extracted_ingredients.append(token)

    # print(f"""{str(index + 1).rjust(2,'0')}: {stemmed_tokens} \t was extracted from \t {ing_org}""")   
    try:
        print(f"""{str(index + 1).rjust(2,'0')}: {extracted_ingredients[0].ljust(12," ")} \t was extracted from \t {ing_org}""")
    except:
        print(f"""{str(index + 1).rjust(2,'0')}: {" ".ljust(12," ")} \t was extracted from \t {ing_org}""")

Ingredients extracted from Chef John's Buttermilk Fried Chicken

01: chicken      	 was extracted from 	 1 (3 1/2) pound chicken, cut into 8 pieces
02: black pepper 	 was extracted from 	 1 teaspoon black pepper
03: salt         	 was extracted from 	 1 teaspoon salt
04: paprika      	 was extracted from 	 1 teaspoon paprika
05: white pepper 	 was extracted from 	 0.5 teaspoon white pepper
06:              	 was extracted from 	 0.25 teaspoon dried rosemary
07: thyme        	 was extracted from 	 0.25 teaspoon ground thyme
08: oregano      	 was extracted from 	 0.25 teaspoon dried oregano
09: sage         	 was extracted from 	 0.25 teaspoon dried sage
10:              	 was extracted from 	 0.25 teaspoon cayenne pepper
11: buttermilk   	 was extracted from 	 2 cups buttermilk
12: flour        	 was extracted from 	 2 cups flour
13: salt         	 was extracted from 	 1 teaspoon salt
14: paprika      	 was extracted from 	 0.5 teaspoon paprika
15:              	 was extracted from 	 0

After stemmification, ingredients 06, 10, 15 were not extracted. This is due to PorterStemmer converting `rosemary` to `rosemari` and `cayenne` to `cayenn`, leading to no match in the list of common ingredients.

### WordNetLemmatizer

Next, a lemmatizer was used. Unlike stemmification which cuts off word endings, lemmatization goes a step further to identify and return the base or dictionary form of the word.

In [16]:
print(f"Ingredients extracted from Chef John's Buttermilk Fried Chicken\n")

# Import a lemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()

# Again iterating through each ingredient 
for index, ing in enumerate(ingredient_list):

    # Create variable for printing
    ing_org = ing
    
    # Remove punctuation and take lower case
    for punctuation_mark in string.punctuation:
        ing = ing.replace(punctuation_mark,"").lower()

    # Split words into tokens based on whitespace
    tokens = ing.split(" ")

    # Initate blank list to stored stemmed tokens
    stemmed_tokens = []

    # Iterate through all but first token (1st token is ingredient amount)
    for token in tokens[1:]:

        # Exclude stopwords and "", then append stemmed token to blank list
        if (not token in eng_stopwords) and token != "":
            stemmed_tokens.append(lemmatizer.lemmatize(token))
    
    # Create blank list to store ingredients
    extracted_ingredients = []

    # Logic for identifying and disambiguation of common ingredients
    # The code here is left explicit for easier reading of logic, a condensed version is used when combining all three extractions
    if "pepper" in stemmed_tokens:        
        for token in stemmed_tokens:
            if token in common_ingredients:
                extracted_ingredients.append(token + " pepper")
    elif "powder" in stemmed_tokens:
        for token in stemmed_tokens:
            if token in common_ingredients:
                extracted_ingredients.append(token + " powder")
    elif "oil" in stemmed_tokens:
        for token in stemmed_tokens:
            if token in common_ingredients:
                extracted_ingredients.append(token + " oil")
    else:
        for token in stemmed_tokens:
            if token in common_ingredients:
                extracted_ingredients.append(token)

    # print(f"""{str(index + 1).rjust(2,'0')}: {stemmed_tokens} \t was extracted from \t {ing_org}""")   
    print(f"""{str(index + 1).rjust(2,'0')}: {extracted_ingredients[0].ljust(12," ")} \t was extracted from \t {ing_org}""")   

Ingredients extracted from Chef John's Buttermilk Fried Chicken

01: chicken      	 was extracted from 	 1 (3 1/2) pound chicken, cut into 8 pieces
02: black pepper 	 was extracted from 	 1 teaspoon black pepper
03: salt         	 was extracted from 	 1 teaspoon salt
04: paprika      	 was extracted from 	 1 teaspoon paprika
05: white pepper 	 was extracted from 	 0.5 teaspoon white pepper
06: rosemary     	 was extracted from 	 0.25 teaspoon dried rosemary
07: thyme        	 was extracted from 	 0.25 teaspoon ground thyme
08: oregano      	 was extracted from 	 0.25 teaspoon dried oregano
09: sage         	 was extracted from 	 0.25 teaspoon dried sage
10: cayenne pepper 	 was extracted from 	 0.25 teaspoon cayenne pepper
11: buttermilk   	 was extracted from 	 2 cups buttermilk
12: flour        	 was extracted from 	 2 cups flour
13: salt         	 was extracted from 	 1 teaspoon salt
14: paprika      	 was extracted from 	 0.5 teaspoon paprika
15: cayenne pepper 	 was extracted from

The ingredients were extracted successfully using the lemmatizer. Thus, the WordNetLemmatizer was used instead of PorterStemmer.

## Combining Ingredient Amounts, UoM and Ingredients

Finally, all three extractions were combined and condensed into one block of code for brevity.

In [21]:
# Initiate blank dictionary to store ingredients
dict = {
    "recipe_name":[],
    "ing_amt":[],
    "ing_uom":[],
    "ing_name":[]
}

for row in df.loc[[2],:].itertuples():

    # Extract ingredients into a list from JSON dictionary
    ing_list = ast.literal_eval(row[2])["recipeIngredient"]
    recipe_name = ast.literal_eval(row[2])["name"]

    # Iterate through each ingredient
    for ing in ing_list:

        # Append recipe name
        dict["recipe_name"].append(recipe_name)
        
        # Extract ingredient amounts
        try:
            dict["ing_amt"].append(float(ing.split(" ")[0]))
        except:
            dict["ing_amt"].append(np.NaN)
        
        # Remove punctuation and take lower case
        # This step has to come after extracting amount else decimal point will be removed
        for punctuation_mark in string.punctuation:
            ing = ing.replace(punctuation_mark,"").lower()

        # Split string into tokens based on whitespace
        tokens = ing.split(" ")

        # Initate blank list to stored stemmed tokens
        stemmed_tokens = []
    
        # Iterate through all but first token (1st token is ingredient amount)
        for token in tokens[1:]:
    
            # Exclude stopwords and "", then append stemmed token to blank list
            if (not token in eng_stopwords) and token != "":
                stemmed_tokens.append(lemmatizer.lemmatize(token))
                
        # Extract ingredient UoM
        try:
            dict["ing_uom"].append([uom for uom in stemmed_tokens if uom in measurements][0])
        except:
            dict["ing_uom"].append(np.NaN)

        # Extract ingredient name
        try:
            if "pepper" in tokens:
                dict["ing_name"].append([name + " pepper" for name in stemmed_tokens if name in common_ingredients][0])
            elif "powder" in tokens:
                dict["ing_name"].append([name + " powder" for name in stemmed_tokens if name in common_ingredients][0])
            elif "oil" in tokens:
                dict["ing_name"].append([name + " oil" for name in stemmed_tokens if name in common_ingredients][0])
            else:
                dict["ing_name"].append([name for name in stemmed_tokens if name in common_ingredients][0])
        except:
            dict["ing_name"].append(np.NaN)

# Convert dictionary into DataFrame
ing_df = pd.DataFrame(dict)

# Examine DataFrame
ing_df

Unnamed: 0,recipe_name,ing_amt,ing_uom,ing_name
0,Chef John&#39;s Buttermilk Fried Chicken,1.0,pound,chicken
1,Chef John&#39;s Buttermilk Fried Chicken,1.0,teaspoon,black pepper
2,Chef John&#39;s Buttermilk Fried Chicken,1.0,teaspoon,salt
3,Chef John&#39;s Buttermilk Fried Chicken,1.0,teaspoon,paprika
4,Chef John&#39;s Buttermilk Fried Chicken,0.5,teaspoon,white pepper
5,Chef John&#39;s Buttermilk Fried Chicken,0.25,teaspoon,rosemary
6,Chef John&#39;s Buttermilk Fried Chicken,0.25,teaspoon,thyme
7,Chef John&#39;s Buttermilk Fried Chicken,0.25,teaspoon,oregano
8,Chef John&#39;s Buttermilk Fried Chicken,0.25,teaspoon,sage
9,Chef John&#39;s Buttermilk Fried Chicken,0.25,teaspoon,cayenne pepper


The resulting DataFrame contains all 19 ingredients present in Chef John's Buttermilk Fried Chicken recipe. Note index 0, chicken, has a slight problem where instead of extracting 1 chicken of 3.5 pounds in weight, only `1` and `pound` were extracted. This can be addressed by the average weight of 1 whole chicken being roughly 4 pounds, and will be addressed later.

Repeating for all recipes:

In [28]:
# Initiate blank dictionary to store ingredients
dict = {
    "recipe_url"  : [],
    "recipe_name" : [],
    "ing_amt"     : [],
    "ing_uom"     : [],
    "ing_name"    : [],
    "ing_org"     : []
}

for row in df.itertuples():

    # Extract ingredients into a list from JSON dictionary
    ing_list = ast.literal_eval(row[2])["recipeIngredient"]
    recipe_name = ast.literal_eval(row[2])["name"]
    recipe_url = row[1]

    # Iterate through each ingredient
    for ing in ing_list:

        # Append recipe name and yield
        dict["recipe_name"].append(recipe_name)
        dict["recipe_url"].append(recipe_url)
        dict["ing_org"].append(ing)
        
        # Extract ingredient amounts
        try:
            dict["ing_amt"].append(float(ing.split(" ")[0]))
        except:
            dict["ing_amt"].append(np.NaN)
        
        # Remove punctuation and take lower case
        # This step has to come after extracting amount else decimal point will be removed
        for punctuation_mark in string.punctuation:
            ing = ing.replace(punctuation_mark,"").lower()

        # Split string into tokens based on whitespace
        tokens = ing.split(" ")

        # Initate blank list to stored stemmed tokens
        stemmed_tokens = []
    
        # Iterate through all but first token (1st token is ingredient amount)
        for token in tokens[1:]:
    
            # Exclude stopwords and "", then append stemmed token to blank list
            if (not token in eng_stopwords) and token != "":
                stemmed_tokens.append(lemmatizer.lemmatize(token))
                
        # Extract ingredient UoM
        try:
            dict["ing_uom"].append([uom for uom in stemmed_tokens if uom in measurements][0])
        except:
            dict["ing_uom"].append(np.NaN)

        # Extract ingredient name
        try:
            if "pepper" in tokens:
                dict["ing_name"].append([name + " pepper" for name in stemmed_tokens if name in common_ingredients][0])
            elif "powder" in tokens:
                dict["ing_name"].append([name + " powder" for name in stemmed_tokens if name in common_ingredients][0])
            elif "oil" in tokens:
                dict["ing_name"].append([name + " oil" for name in stemmed_tokens if name in common_ingredients][0])
            else:
                dict["ing_name"].append([name for name in stemmed_tokens if name in common_ingredients][0])
        except:
            dict["ing_name"].append(np.NaN)

    print(f"""Finished extraction for recipe {str(row[0] + 1).rjust(2,'0')}/{df.shape[0]}""",end = "\r")

Finished extraction for recipe 33/33

## Remove Recipes without Whole Chicken

The extracted ingredients were stored in a DataFrame.

In [29]:
# Convert dictionary into DataFrame
ing_df = pd.DataFrame(dict)

# Examine the dataframe
ing_df.head()

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org
0,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,1.0,pound,chicken,"1 (4 pound) chicken, cut into pieces"
1,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,1.0,cup,buttermilk,1 cup buttermilk
2,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,2.0,cup,flour,2 cups all-purpose flour for coating
3,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,1.0,teaspoon,paprika,1 teaspoon paprika
4,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,,,,salt and pepper to taste


In [30]:
# Verify 33 recipes remain
ing_df["recipe_url"].nunique()

33

In [31]:
# Extract chicken from each recipe, then store in a dataframe
cond = ing_df["ing_name"] == "chicken"
chicken_df = ing_df.loc[cond]

# Use the previous terms to exclude list to identify recipes to drop
cond = chicken_df["ing_org"].str.contains(terms_to_exclude)

# Examine the recipes to be excluded
x = chicken_df.loc[cond,"recipe_url"].nunique()
print(f"Number of recipes to be excluded because they don't use whole chicken: {x}.")

# Examine the first 5 recipes to be excluded
chicken_df.loc[cond].head()

Number of recipes to be excluded because they don't use whole chicken: 15.


Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org
49,https://www.allrecipes.com/recipe/16573/chicke...,Chicken Fried Chicken,6.0,,chicken,"6 skinless, boneless chicken breast halves"
63,https://www.allrecipes.com/recipe/24778/better...,Better than Best Fried Chicken,4.0,,chicken,"4 skinless, boneless chicken breast halves"
78,https://www.allrecipes.com/recipe/86047/garlic...,Garlic Chicken Fried Chicken,4.0,,chicken,"4 skinless, boneless chicken breast halves - p..."
91,https://www.allrecipes.com/recipe/87473/mustar...,Mustard Fried Chicken,5.0,pound,chicken,"5 pounds chicken wings, separated at joints, t..."
159,https://www.allrecipes.com/recipe/230854/perfe...,Perfect Crispy Fried Chicken,3.0,,chicken,"3 chicken leg quarters, cut into thighs and dr..."


A total of 15 recipes were found to not use whole chicken as the main ingredient, and thus were excluded. The recipes that remain were also examined.

In [32]:
# Examine the recipes that use whole chicken
chicken_df.loc[~cond]

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org
0,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,1.0,pound,chicken,"1 (4 pound) chicken, cut into pieces"
16,https://www.allrecipes.com/recipe/89268/triple...,Triple-Dipped Fried Chicken,1.0,pound,chicken,"1 (3 pound) whole chicken, cut into pieces"
17,https://www.allrecipes.com/recipe/220128/chef-...,Chef John&#39;s Buttermilk Fried Chicken,1.0,pound,chicken,"1 (3 1/2) pound chicken, cut into 8 pieces"
41,https://www.allrecipes.com/recipe/8970/millie-...,Millie Pasquinelli&#39;s Fried Chicken,2.0,pound,chicken,"2 (2 to 3 pound) whole chickens, cut into pieces"
58,https://www.allrecipes.com/recipe/8635/souther...,Southern Fried Chicken,1.0,pound,chicken,"1 (3 pound) whole chicken, cut into pieces"
60,https://www.allrecipes.com/recipe/24778/better...,Better than Best Fried Chicken,1.0,ounce,chicken,1 (10.5 ounce) can condensed cream of chicken ...
87,https://www.allrecipes.com/recipe/15375/fried-...,Fried Chicken with Creamy Gravy,1.0,pound,chicken,"1 (4 pound) whole chicken, cut into pieces"
89,https://www.allrecipes.com/recipe/15375/fried-...,Fried Chicken with Creamy Gravy,1.0,cup,chicken,1 cup chicken broth
100,https://www.allrecipes.com/recipe/8802/tanyas-...,Tanya&#39;s Louisiana Southern Fried Chicken,1.0,pound,chicken,"1 (3 pound) whole chicken, cut into 6 pieces"
113,https://www.allrecipes.com/recipe/8717/deep-so...,Deep South Fried Chicken,1.0,pound,chicken,"1 (3 pound) whole chicken, cut into pieces"


Some recipes use `chicken broth` and `chicken bouillion granules` (index 89 anf 145). Furthermore, although the recipes all use whole chicken, the weight of each chicken differs. These irregularities will be fixed in one go when identifying missed ingredients.

An anti join was used to remove recipes that do not use whole chicken.

In [33]:
# Anti join: first outer join then filter for left only
# outer join
outer = ing_df.merge(
    chicken_df.loc[cond],
    how = "outer",
    left_on = "recipe_url",
    right_on = "recipe_url",
    suffixes = ("", "_drop"),
    indicator = True # needed for anti join
)

# Filter for left_only
cond = outer["_merge"] == "left_only"
ing_df = outer.loc[cond].iloc[:,0:6]

In [34]:
# Reset index for neatness
ing_df.reset_index(inplace = True, drop = True)

# Verify only 18 recipes remain
ing_df["recipe_url"].nunique()

18

Now that only recipes that use whole chicken remain, any missed ingredients can now be identified.

## Identify Missed Ingredients

In [35]:
# Create a condition to identify any row with null values
cond = ing_df.isna().any(axis = 1)

# Identify recipes with null values
recipes_with_null = ing_df.loc[cond,"recipe_name"].nunique()
print(f"The number of recipes with null values:{recipes_with_null}.")
print(f"The number of missed ingredients: {cond.sum()}")

The number of recipes with null values:17.
The number of missed ingredients: 56


As expected, since the extraction process was modelled after Chef John's recipe, all other recipes seem to have a missed ingredient.

In [36]:
ing_df.loc[cond]

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org
4,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,,,,salt and pepper to taste
11,https://www.allrecipes.com/recipe/89268/triple...,Triple-Dipped Fried Chicken,0.5,teaspoon,,0.5 teaspoon poultry seasoning
12,https://www.allrecipes.com/recipe/89268/triple...,Triple-Dipped Fried Chicken,1.5,cup,,"1.5 cups beer, or as needed"
13,https://www.allrecipes.com/recipe/89268/triple...,Triple-Dipped Fried Chicken,2.0,,,"2 egg yolks, beaten"
39,https://www.allrecipes.com/recipe/8970/millie-...,Millie Pasquinelli&#39;s Fried Chicken,,,,salt and pepper to taste
40,https://www.allrecipes.com/recipe/8970/millie-...,Millie Pasquinelli&#39;s Fried Chicken,4.0,,,"4 large eggs, beaten"
45,https://www.allrecipes.com/recipe/8635/souther...,Southern Fried Chicken,,,,salt to taste
46,https://www.allrecipes.com/recipe/8635/souther...,Southern Fried Chicken,,,black pepper,ground black pepper to taste
49,https://www.allrecipes.com/recipe/8635/souther...,Southern Fried Chicken,,teaspoon,cayenne pepper,1/2 teaspoon cayenne pepper (optional)
52,https://www.allrecipes.com/recipe/15375/fried-...,Fried Chicken with Creamy Gravy,0.5,cup,,0.5 cup milk


In [39]:
ing_df.loc[cond].to_csv("../11_raw_data/x.csv",index = False)

## Final Extraction

In [37]:
# Keep only recipes that use whole chicken
df_clean = df.merge(
    ing_df[["recipe_url"]].drop_duplicates(),
    how = "inner",
    left_on = "recipe_url",
    right_on = "recipe_url",
    validate = "one_to_one"
)

In [116]:
# Define unit of measurements common to the American home kitchen
measurements = [
    "teaspoon", 
    "tablespoon",
    "cup",
    "quart",
    "pound",
    "ounce",
    "clove",
    "packet"
]

# Define common ingredients in fried chicken
common_ingredients = [
    "chicken",
    "cayenne",
    "paprika",
    "rosemary",
    "thyme",
    "oregano",
    "sage",
    "buttermilk",
    "milk",
    "salt",
    "flour",
    "onion",
    "garlic",
    "vegetable",
    "peanut",
    "coconut",    
    "white",      # white pepper
    "black",      # black pepper
    "egg",
    "beer",
    "shortening",
    "mustard",
    "honey",
    "butter",
    "lard",
    "sugar",
    "curry",
    "chili",
    "sherry",
    "baking",
    "barbeque",
    "worcestershire",
    "steak",
    "italian",
    "lemon",
    "oyster",
    "hot",
    "poultry",
    "pickle",
    "celery"
]

In [129]:
# Initiate blank dictionary to store ingredients
dict = {
    "recipe_url"  : [],
    "recipe_name" : [],
    "ing_amt"     : [],
    "ing_uom"     : [],
    "ing_name"    : [],
    "ing_org"     : [],
    "ing_stemmed" : []
}

common_base_terms = ["pepper","seasoning", "brine", "sauce","powder", "oil", "seed", "broth", "granule"]

for row in df_clean.itertuples():

    # Extract ingredients into a list from JSON dictionary
    ing_list = ast.literal_eval(row[2])["recipeIngredient"]
    recipe_name = ast.literal_eval(row[2])["name"]
    recipe_url = row[1]

    # Iterate through each ingredient
    for ing in ing_list:

        # Append recipe name and yield
        dict["recipe_name"].append(recipe_name)
        dict["recipe_url"].append(recipe_url)
        dict["ing_org"].append(ing)
        
        # Extract ingredient amounts
        try:
            dict["ing_amt"].append(float(ing.split(" ")[0]))
        except:
            dict["ing_amt"].append(np.NaN)
        
        # Remove punctuation and take lower case
        # This step has to come after extracting amount else decimal point will be removed
        for punctuation_mark in string.punctuation:
            ing = ing.replace(punctuation_mark,"").lower()

        # Split string into tokens based on whitespace
        tokens = ing.split(" ")

        # Initate blank list to stored stemmed tokens
        stemmed_tokens = []
    
        # Iterate through all but first token (1st token is ingredient amount)
        for token in tokens[1:]:
    
            # Exclude stopwords and "", then append stemmed token to blank list
            if (not token in eng_stopwords) and token != "":
                stemmed_tokens.append(lemmatizer.lemmatize(token))

        dict["ing_stemmed"].append(stemmed_tokens)
                
        # Extract ingredient UoM
        try:
            dict["ing_uom"].append([uom for uom in stemmed_tokens if uom in measurements][0])
        except:
            dict["ing_uom"].append(np.NaN)

        # Extract ingredient name
        try:
            # First if clause to deal with the ambiguous "to taste"
            if "taste" in stemmed_tokens:
                dict["ing_name"].append("to taste")
                
            # Second if clause to deal with common base terms (x powder, y oil)
            elif any(x in common_base_terms for x in stemmed_tokens):
                match_term = [name for name in stemmed_tokens if name in common_ingredients]
                if len(match_term) > 0:
                    dict["ing_name"].append(match_term[0] + " " + [x for x in stemmed_tokens if x in common_base_terms][0])
                else:
                    dict["ing_name"].append([x for x in stemmed_tokens if x in common_base_terms][0])
                    
            # Third if clause to deal with common ingredients that do no have common base terms
            elif any(x in common_ingredients for x in stemmed_tokens):
                dict["ing_name"].append([name for name in stemmed_tokens if name in common_ingredients][0])

            # Finally, for any stranglers (ie Kikkoman Tempura Batter Mix)
            else:
                dict["ing_name"].append(" ".join(tokens[2:]))
                
        except:
            dict["ing_name"].append(np.NaN)
            
    print(f"""Finished extraction for recipe {str(row[0] + 1).rjust(2,'0')}/{df_clean.shape[0]}""",end = "\r")

Finished extraction for recipe 18/18

In [130]:
# Convert dictionary into DataFrame
ing_df = pd.DataFrame(dict)

# Examine the dataframe
ing_df.head()

# Create a condition to identify any row with null values
cond = ing_df["ing_name"].isna()

# Identify recipes with null ingredients
recipes_with_null = ing_df.loc[cond,"recipe_name"].nunique()
print(f"The number of recipes with null ingredients:{recipes_with_null}.")
print(f"The number of missed ingredients: {cond.sum()}")

The number of recipes with null ingredients:0.
The number of missed ingredients: 0


Although null values still exist in `ing_amt` and `ing_uom`, all ingredients have been extracted successfully.

# More cleaning

## Chicken

Assumption: 1 whole chicken is 4 pounds.

In [131]:
cond = ing_df["ing_name"] == "chicken"
ing_df.loc[cond].head()

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org,ing_stemmed
0,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,1.0,pound,chicken,"1 (4 pound) chicken, cut into pieces","[4, pound, chicken, cut, piece]"
16,https://www.allrecipes.com/recipe/89268/triple...,Triple-Dipped Fried Chicken,1.0,pound,chicken,"1 (3 pound) whole chicken, cut into pieces","[3, pound, whole, chicken, cut, piece]"
17,https://www.allrecipes.com/recipe/220128/chef-...,Chef John&#39;s Buttermilk Fried Chicken,1.0,pound,chicken,"1 (3 1/2) pound chicken, cut into 8 pieces","[3, 12, pound, chicken, cut, 8, piece]"
41,https://www.allrecipes.com/recipe/8970/millie-...,Millie Pasquinelli&#39;s Fried Chicken,2.0,pound,chicken,"2 (2 to 3 pound) whole chickens, cut into pieces","[2, 3, pound, whole, chicken, cut, piece]"
50,https://www.allrecipes.com/recipe/8635/souther...,Southern Fried Chicken,1.0,pound,chicken,"1 (3 pound) whole chicken, cut into pieces","[3, pound, whole, chicken, cut, piece]"


In [132]:
ing_df.loc[cond, "ing_amt"] = ing_df.loc[cond, "ing_amt"] * 4 # 4 pounds per whole chicken
ing_df.loc[cond, "ing_uom"] = "pound"

ing_df.loc[cond].head()

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org,ing_stemmed
0,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,4.0,pound,chicken,"1 (4 pound) chicken, cut into pieces","[4, pound, chicken, cut, piece]"
16,https://www.allrecipes.com/recipe/89268/triple...,Triple-Dipped Fried Chicken,4.0,pound,chicken,"1 (3 pound) whole chicken, cut into pieces","[3, pound, whole, chicken, cut, piece]"
17,https://www.allrecipes.com/recipe/220128/chef-...,Chef John&#39;s Buttermilk Fried Chicken,4.0,pound,chicken,"1 (3 1/2) pound chicken, cut into 8 pieces","[3, 12, pound, chicken, cut, 8, piece]"
41,https://www.allrecipes.com/recipe/8970/millie-...,Millie Pasquinelli&#39;s Fried Chicken,8.0,pound,chicken,"2 (2 to 3 pound) whole chickens, cut into pieces","[2, 3, pound, whole, chicken, cut, piece]"
50,https://www.allrecipes.com/recipe/8635/souther...,Southern Fried Chicken,4.0,pound,chicken,"1 (3 pound) whole chicken, cut into pieces","[3, pound, whole, chicken, cut, piece]"


## Eggs

In [134]:
cond = ing_df["ing_name"] == "egg"
ing_df.loc[cond]

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org,ing_stemmed
13,https://www.allrecipes.com/recipe/89268/triple...,Triple-Dipped Fried Chicken,2.0,,egg,"2 egg yolks, beaten","[egg, yolk, beaten]"
40,https://www.allrecipes.com/recipe/8970/millie-...,Millie Pasquinelli&#39;s Fried Chicken,4.0,,egg,"4 large eggs, beaten","[large, egg, beaten]"
53,https://www.allrecipes.com/recipe/15375/fried-...,Fried Chicken with Creamy Gravy,1.0,,egg,"1 egg, beaten","[egg, beaten]"
64,https://www.allrecipes.com/recipe/8802/tanyas-...,Tanya&#39;s Louisiana Southern Fried Chicken,2.0,,egg,"2 eggs, beaten","[egg, beaten]"
97,https://www.allrecipes.com/recipe/57676/a-sout...,A Southern Fried Chicken,2.0,,egg,2 eggs,[egg]
126,https://www.allrecipes.com/recipe/254804/chef-...,Chef John&#39;s Nashville Hot Chicken,1.0,,egg,1 large egg,"[large, egg]"
154,https://www.allrecipes.com/recipe/261544/juicy...,Juicy Honey Fried Chicken,1.0,,egg,1 egg,[egg]


In [135]:
ing_df.loc[cond,"ing_uom"] = "piece"

ing_df.loc[cond]

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org,ing_stemmed
13,https://www.allrecipes.com/recipe/89268/triple...,Triple-Dipped Fried Chicken,2.0,piece,egg,"2 egg yolks, beaten","[egg, yolk, beaten]"
40,https://www.allrecipes.com/recipe/8970/millie-...,Millie Pasquinelli&#39;s Fried Chicken,4.0,piece,egg,"4 large eggs, beaten","[large, egg, beaten]"
53,https://www.allrecipes.com/recipe/15375/fried-...,Fried Chicken with Creamy Gravy,1.0,piece,egg,"1 egg, beaten","[egg, beaten]"
64,https://www.allrecipes.com/recipe/8802/tanyas-...,Tanya&#39;s Louisiana Southern Fried Chicken,2.0,piece,egg,"2 eggs, beaten","[egg, beaten]"
97,https://www.allrecipes.com/recipe/57676/a-sout...,A Southern Fried Chicken,2.0,piece,egg,2 eggs,[egg]
126,https://www.allrecipes.com/recipe/254804/chef-...,Chef John&#39;s Nashville Hot Chicken,1.0,piece,egg,1 large egg,"[large, egg]"
154,https://www.allrecipes.com/recipe/261544/juicy...,Juicy Honey Fried Chicken,1.0,piece,egg,1 egg,[egg]


## Milk

In [140]:
cond = ing_df["ing_name"].str.contains("milk") == True
ing_df.loc[cond]

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org,ing_stemmed
1,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,1.0,cup,buttermilk,1 cup buttermilk,"[cup, buttermilk]"
27,https://www.allrecipes.com/recipe/220128/chef-...,Chef John&#39;s Buttermilk Fried Chicken,2.0,cup,buttermilk,2 cups buttermilk,"[cup, buttermilk]"
52,https://www.allrecipes.com/recipe/15375/fried-...,Fried Chicken with Creamy Gravy,0.5,cup,milk,0.5 cup milk,"[cup, milk]"
62,https://www.allrecipes.com/recipe/15375/fried-...,Fried Chicken with Creamy Gravy,1.0,cup,milk,1 cup milk,"[cup, milk]"
65,https://www.allrecipes.com/recipe/8802/tanyas-...,Tanya&#39;s Louisiana Southern Fried Chicken,1.0,ounce,milk,1 (12 fluid ounce) can evaporated milk,"[12, fluid, ounce, evaporated, milk]"
77,https://www.allrecipes.com/recipe/178809/south...,Southern-Style Buttermilk Fried Chicken,2.0,cup,buttermilk,2 cups buttermilk,"[cup, buttermilk]"
98,https://www.allrecipes.com/recipe/57676/a-sout...,A Southern Fried Chicken,4.0,cup,buttermilk,4 cups buttermilk,"[cup, buttermilk]"
113,https://www.allrecipes.com/recipe/196428/south...,Southern Spicy Fried Chicken,1.0,quart,buttermilk,1 quart buttermilk,"[quart, buttermilk]"
123,https://www.allrecipes.com/recipe/254804/chef-...,Chef John&#39;s Nashville Hot Chicken,1.0,cup,buttermilk,1 cup buttermilk,"[cup, buttermilk]"
153,https://www.allrecipes.com/recipe/261544/juicy...,Juicy Honey Fried Chicken,0.5,cup,milk,0.5 cup milk,"[cup, milk]"


In [142]:
ing_df.loc[65,"ing_amt"] = 12

In [143]:
ing_df.loc[cond]

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org,ing_stemmed
1,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,1.0,cup,buttermilk,1 cup buttermilk,"[cup, buttermilk]"
27,https://www.allrecipes.com/recipe/220128/chef-...,Chef John&#39;s Buttermilk Fried Chicken,2.0,cup,buttermilk,2 cups buttermilk,"[cup, buttermilk]"
52,https://www.allrecipes.com/recipe/15375/fried-...,Fried Chicken with Creamy Gravy,0.5,cup,milk,0.5 cup milk,"[cup, milk]"
62,https://www.allrecipes.com/recipe/15375/fried-...,Fried Chicken with Creamy Gravy,1.0,cup,milk,1 cup milk,"[cup, milk]"
65,https://www.allrecipes.com/recipe/8802/tanyas-...,Tanya&#39;s Louisiana Southern Fried Chicken,12.0,ounce,milk,1 (12 fluid ounce) can evaporated milk,"[12, fluid, ounce, evaporated, milk]"
77,https://www.allrecipes.com/recipe/178809/south...,Southern-Style Buttermilk Fried Chicken,2.0,cup,buttermilk,2 cups buttermilk,"[cup, buttermilk]"
98,https://www.allrecipes.com/recipe/57676/a-sout...,A Southern Fried Chicken,4.0,cup,buttermilk,4 cups buttermilk,"[cup, buttermilk]"
113,https://www.allrecipes.com/recipe/196428/south...,Southern Spicy Fried Chicken,1.0,quart,buttermilk,1 quart buttermilk,"[quart, buttermilk]"
123,https://www.allrecipes.com/recipe/254804/chef-...,Chef John&#39;s Nashville Hot Chicken,1.0,cup,buttermilk,1 cup buttermilk,"[cup, buttermilk]"
153,https://www.allrecipes.com/recipe/261544/juicy...,Juicy Honey Fried Chicken,0.5,cup,milk,0.5 cup milk,"[cup, milk]"


# Data Enrichment

Before dealing with the other null values, the units of measurement need to be converted from US imperial units to the metric system for 2 reasons:
- grocery prices in canada are listed in metric units
- `to taste` measurements are quantified using USDA measurement of salt content in fried chicken

## Unit Conversion

In [144]:
# Define metric unit conversions
metric_conversion_rate = {
    # units regarding mass (metric unit gram)
    "pound"     : 453.59233, # https://www.metric-conversions.org/weight/pounds-to-grams.htm

    # units regarding volume (metric unit mL or cm3)
    "teaspoon"  : 4.9289215, # https://www.metric-conversions.org/volume/us-teaspoons-to-milliliters.htm#metricConversionTable?val=1
    "tablespoon": 14.786765, # https://www.metric-conversions.org/volume/us-tablespoons-to-milliliters.htm
    "quart"     : 946.35295, # https://www.metric-conversions.org/volume/us-liquid-quarts-to-milliliters.htm
    "cup"       : 236.58824, # https://www.metric-conversions.org/volume/us-cups-to-milliliters.htm 
    "ounce"     : 29.573529  # https://www.metric-conversions.org/volume/us-ounces-to-milliliters.htm
}

In [145]:
# Define metric units
metric_uom = {
    # units regarding mass (metric unit gram)
    "pound"     : "g", # https://www.metric-conversions.org/weight/pounds-to-grams.htm

    # units regarding volume (metric unit mL or cm3)
    "teaspoon"  : "mL", # https://www.metric-conversions.org/volume/us-teaspoons-to-milliliters.htm#metricConversionTable?val=1
    "tablespoon": "mL", # https://www.metric-conversions.org/volume/us-tablespoons-to-milliliters.htm
    "quart"     : "mL", # https://www.metric-conversions.org/volume/us-liquid-quarts-to-milliliters.htm
    "cup"       : "mL", # https://www.metric-conversions.org/volume/us-cups-to-milliliters.htm 
    "ounce"     : "mL"  # https://www.metric-conversions.org/volume/us-ounces-to-milliliters.htm
}

In [146]:
# Map (VLOOKUP) the conversion rates
ing_df["ing_amt_metric"] = ing_df["ing_amt"] * ing_df["ing_uom"].map(metric_conversion_rate)
ing_df["ing_uom_metric"] = ing_df["ing_uom"].map(metric_uom)

In [147]:
# Examine the conversion result
ing_df

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org,ing_stemmed,ing_amt_metric,ing_uom_metric
0,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,4.0,pound,chicken,"1 (4 pound) chicken, cut into pieces","[4, pound, chicken, cut, piece]",1814.369320,g
1,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,1.0,cup,buttermilk,1 cup buttermilk,"[cup, buttermilk]",236.588240,mL
2,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,2.0,cup,flour,2 cups all-purpose flour for coating,"[cup, allpurpose, flour, coating]",473.176480,mL
3,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,1.0,teaspoon,paprika,1 teaspoon paprika,"[teaspoon, paprika]",4.928922,mL
4,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,,,to taste,salt and pepper to taste,"[pepper, taste]",,
...,...,...,...,...,...,...,...,...,...
168,https://www.allrecipes.com/recipe/216975/butte...,Buttermilk Fried Chicken,1.0,teaspoon,onion powder,1 teaspoon onion powder,"[teaspoon, onion, powder]",4.928922,mL
169,https://www.allrecipes.com/recipe/216975/butte...,Buttermilk Fried Chicken,1.0,teaspoon,poultry seasoning,1 teaspoon poultry seasoning,"[teaspoon, poultry, seasoning]",4.928922,mL
170,https://www.allrecipes.com/recipe/216975/butte...,Buttermilk Fried Chicken,1.0,teaspoon,celery seed,1 teaspoon celery seeds,"[teaspoon, celery, seed]",4.928922,mL
171,https://www.allrecipes.com/recipe/216975/butte...,Buttermilk Fried Chicken,,,oil,Vegetable oil for deep-frying,"[oil, deepfrying]",,


# More Data Cleaning Part II

## Oil

In [156]:
cond = ing_df["ing_name"].str.contains("oil") == True
ing_df.loc[cond]

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org,ing_stemmed,ing_amt_metric,ing_uom_metric
5,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,2.0,quart,vegetable oil,2 quarts vegetable oil for frying,"[quart, vegetable, oil, frying]",1892.7059,mL
6,https://www.allrecipes.com/recipe/89268/triple...,Triple-Dipped Fried Chicken,1.0,quart,vegetable oil,1 quart vegetable oil for frying,"[quart, vegetable, oil, frying]",946.35295,mL
35,https://www.allrecipes.com/recipe/220128/chef-...,Chef John&#39;s Buttermilk Fried Chicken,2.5,quart,peanut oil,2.5 quarts peanut oil for frying,"[quart, peanut, oil, frying]",2365.882375,mL
42,https://www.allrecipes.com/recipe/8970/millie-...,Millie Pasquinelli&#39;s Fried Chicken,1.0,quart,vegetable oil,1 quart vegetable oil for frying,"[quart, vegetable, oil, frying]",946.35295,mL
51,https://www.allrecipes.com/recipe/8635/souther...,Southern Fried Chicken,1.0,quart,vegetable oil,1 quart vegetable oil for frying,"[quart, vegetable, oil, frying]",946.35295,mL
60,https://www.allrecipes.com/recipe/15375/fried-...,Fried Chicken with Creamy Gravy,3.0,cup,vegetable oil,3 cups vegetable oil,"[cup, vegetable, oil]",709.76472,mL
71,https://www.allrecipes.com/recipe/8802/tanyas-...,Tanya&#39;s Louisiana Southern Fried Chicken,1.5,cup,vegetable oil,1.5 cups vegetable oil for frying,"[cup, vegetable, oil, frying]",354.88236,mL
87,https://www.allrecipes.com/recipe/178809/south...,Southern-Style Buttermilk Fried Chicken,5.0,cup,vegetable oil,5 cups vegetable oil for frying,"[cup, vegetable, oil, frying]",1182.9412,mL
91,https://www.allrecipes.com/recipe/8836/fried-c...,Fried Chicken,2.0,quart,vegetable oil,2 quarts vegetable oil for frying,"[quart, vegetable, oil, frying]",1892.7059,mL
103,https://www.allrecipes.com/recipe/57676/a-sout...,A Southern Fried Chicken,2.0,cup,vegetable oil,2 cups oil for frying,"[cup, oil, frying]",473.17648,mL


In [151]:
# Calculate mean amount of oil used
avg_oil = ing_df.loc[cond, "ing_amt_metric"].mean()
print(f"Average oil used throughout recipes: {np.round(avg_oil,2)} mL.")

Average oil used throughout recipes: 997.05 mL.


Average oil usage is roughly 1L.

In [153]:
ing_df.loc[cond, "ing_amt_metric"] = ing_df.loc[cond, "ing_amt_metric"].fillna(avg_oil)

In [155]:
# Update oil to vegetable oil
cond = ing_df["ing_name"] == "oil"
ing_df.loc[cond, "ing_name"] = "vegetable oil"

In [158]:
# Update oil to vegetable oil
cond = ing_df["ing_uom_metric"].isna()
cond2 = ing_df["ing_name"].str.contains("oil") == True
ing_df.loc[cond & cond2, "ing_uom_metric"] = "mL"

In [159]:
cond = ing_df["ing_name"].str.contains("oil") == True
ing_df.loc[cond]

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org,ing_stemmed,ing_amt_metric,ing_uom_metric
5,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,2.0,quart,vegetable oil,2 quarts vegetable oil for frying,"[quart, vegetable, oil, frying]",1892.7059,mL
6,https://www.allrecipes.com/recipe/89268/triple...,Triple-Dipped Fried Chicken,1.0,quart,vegetable oil,1 quart vegetable oil for frying,"[quart, vegetable, oil, frying]",946.35295,mL
35,https://www.allrecipes.com/recipe/220128/chef-...,Chef John&#39;s Buttermilk Fried Chicken,2.5,quart,peanut oil,2.5 quarts peanut oil for frying,"[quart, peanut, oil, frying]",2365.882375,mL
42,https://www.allrecipes.com/recipe/8970/millie-...,Millie Pasquinelli&#39;s Fried Chicken,1.0,quart,vegetable oil,1 quart vegetable oil for frying,"[quart, vegetable, oil, frying]",946.35295,mL
51,https://www.allrecipes.com/recipe/8635/souther...,Southern Fried Chicken,1.0,quart,vegetable oil,1 quart vegetable oil for frying,"[quart, vegetable, oil, frying]",946.35295,mL
60,https://www.allrecipes.com/recipe/15375/fried-...,Fried Chicken with Creamy Gravy,3.0,cup,vegetable oil,3 cups vegetable oil,"[cup, vegetable, oil]",709.76472,mL
71,https://www.allrecipes.com/recipe/8802/tanyas-...,Tanya&#39;s Louisiana Southern Fried Chicken,1.5,cup,vegetable oil,1.5 cups vegetable oil for frying,"[cup, vegetable, oil, frying]",354.88236,mL
87,https://www.allrecipes.com/recipe/178809/south...,Southern-Style Buttermilk Fried Chicken,5.0,cup,vegetable oil,5 cups vegetable oil for frying,"[cup, vegetable, oil, frying]",1182.9412,mL
91,https://www.allrecipes.com/recipe/8836/fried-c...,Fried Chicken,2.0,quart,vegetable oil,2 quarts vegetable oil for frying,"[quart, vegetable, oil, frying]",1892.7059,mL
103,https://www.allrecipes.com/recipe/57676/a-sout...,A Southern Fried Chicken,2.0,cup,vegetable oil,2 cups oil for frying,"[cup, oil, frying]",473.17648,mL


## To taste

In [161]:
cond = ing_df["ing_name"] == "to taste"
ing_df.loc[cond]

Unnamed: 0,recipe_url,recipe_name,ing_amt,ing_uom,ing_name,ing_org,ing_stemmed,ing_amt_metric,ing_uom_metric
4,https://www.allrecipes.com/recipe/8805/crispy-...,Crispy Fried Chicken,,,to taste,salt and pepper to taste,"[pepper, taste]",,
39,https://www.allrecipes.com/recipe/8970/millie-...,Millie Pasquinelli&#39;s Fried Chicken,,,to taste,salt and pepper to taste,"[pepper, taste]",,
45,https://www.allrecipes.com/recipe/8635/souther...,Southern Fried Chicken,,,to taste,salt to taste,[taste],,
46,https://www.allrecipes.com/recipe/8635/souther...,Southern Fried Chicken,,,to taste,ground black pepper to taste,"[black, pepper, taste]",,
89,https://www.allrecipes.com/recipe/8836/fried-c...,Fried Chicken,,,to taste,salt and pepper to taste,"[pepper, taste]",,
105,https://www.allrecipes.com/recipe/9000/honey-f...,Honey Fried Chicken,,,to taste,salt and pepper to taste,"[pepper, taste]",,
112,https://www.allrecipes.com/recipe/196428/south...,Southern Spicy Fried Chicken,,,to taste,salt and ground black pepper to taste,"[ground, black, pepper, taste]",,
120,https://www.allrecipes.com/recipe/196428/south...,Southern Spicy Fried Chicken,,,to taste,salt and ground black pepper to taste,"[ground, black, pepper, taste]",,
161,https://www.allrecipes.com/recipe/8785/moms-ol...,Mom&#39;s Old-Fashioned Fried Chicken,,,to taste,salt and pepper to taste,"[pepper, taste]",,


In [None]:
sodium_per_100g_chicken = 288 #mg https://fdc.nal.usda.gov/fdc-app.html#/food-details/172386/nutrients
sodium_per_100g_salt = 38800 #mg https://fdc.nal.usda.gov/fdc-app.html#/food-details/173468/nutrients

# Load Costs

In [None]:
cost_df = pd.read_csv("../11_raw_data/20231103-1016_ingredient_cost.csv")

In [None]:
cost_df.head()

In [None]:
final_df = ing_df.merge(
    cost_df.loc[:,["Material", "Price\n(CAD)", "Unit", "Density\nMeasurement", "Density\nUnit"]],
    left_on = "ing_name",
    right_on = "Material"
)

In [None]:
final_df.head()

In [None]:
final_df["Density\nMeasurement"] = final_df["Density\nMeasurement"].str.replace("-","0").str.replace("","0")
final_df["Density\nMeasurement"] = final_df["Density\nMeasurement"].astype("float")

In [25]:
cost_list = []

for index, row in final_df.iterrows():
    if row["ing_uom_metric"] == row["Unit"]:
        cost_list.append(row["ing_amt_metric"] * row["Price\n(CAD)"])
    else:
        cost_list.append(row["ing_amt_metric"] * row["Price\n(CAD)"] * row["Density\nMeasurement"])

In [26]:
final_df.loc[:,"cost"] = cost_list
final_df

Unnamed: 0,recipe_name,ing_amt,ing_uom,ing_name,ing_amt_metric,ing_uom_metric,Material,Price\n(CAD),Unit,Density\nMeasurement,Density\nUnit,cost
0,Crispy Fried Chicken,4.0,pound,chicken,1814.36932,g,chicken,0.00998,g,0.0,-,18.107406
1,Chef John&#39;s Buttermilk Fried Chicken,4.0,pound,chicken,1814.36932,g,chicken,0.00998,g,0.0,-,18.107406
2,Crispy Fried Chicken,1.0,cup,buttermilk,236.58824,mL,buttermilk,0.004,mL,0.0,-,0.946353
3,Chef John&#39;s Buttermilk Fried Chicken,2.0,cup,buttermilk,473.17648,mL,buttermilk,0.004,mL,0.0,-,1.892706
4,Crispy Fried Chicken,2.0,cup,flour,473.17648,mL,flour,0.001508,g,0.0503,g/mL,0.035892
5,Chef John&#39;s Buttermilk Fried Chicken,2.0,cup,flour,473.17648,mL,flour,0.001508,g,0.0503,g/mL,0.035892
6,Crispy Fried Chicken,1.0,teaspoon,paprika,4.928922,mL,paprika,0.0175,g,0.0406,g/mL,0.003502
7,Chef John&#39;s Buttermilk Fried Chicken,1.0,teaspoon,paprika,4.928922,mL,paprika,0.0175,g,0.0406,g/mL,0.003502
8,Chef John&#39;s Buttermilk Fried Chicken,0.5,teaspoon,paprika,2.464461,mL,paprika,0.0175,g,0.0406,g/mL,0.001751
9,Crispy Fried Chicken,2.0,quart,vegetable oil,1892.7059,mL,vegetable oil,0.003511,mL,0.0,-,6.64529


In [27]:
final_df = final_df.groupby(
    by = ["recipe_name","ing_name"],
    as_index = False
).agg(
    cost = ("cost","sum")
)

In [29]:
pivot_df = final_df.pivot(
    columns = "recipe_name",
    index   = "ing_name",
    values = "cost"
)

In [30]:
pivot_df

recipe_name,Chef John&#39;s Buttermilk Fried Chicken,Crispy Fried Chicken
ing_name,Unnamed: 1_level_1,Unnamed: 2_level_1
black pepper,0.005369,
buttermilk,1.892706,0.946353
cayenne pepper,0.002054,
chicken,18.107406,18.107406
flour,0.035892,0.035892
garlic powder,0.00226,
onion powder,0.001518,
oregano,0.002684,
paprika,0.005253,0.003502
peanut oil,16.4902,


In [33]:
pivot_df.columns = ["recipe 1","recipe 2"]
pivot_df

Unnamed: 0_level_0,recipe 1,recipe 2
ing_name,Unnamed: 1_level_1,Unnamed: 2_level_1
black pepper,0.005369,
buttermilk,1.892706,0.946353
cayenne pepper,0.002054,
chicken,18.107406,18.107406
flour,0.035892,0.035892
garlic powder,0.00226,
onion powder,0.001518,
oregano,0.002684,
paprika,0.005253,0.003502
peanut oil,16.4902,


In [78]:
final_df.to_csv("../12_processed_data/recipes_pivot.csv")