### Text Mining Student Specialist Problem Set - Shrusti Ghela


#### Problem Statement:
Scrape at least 100 recipes from the web, provide their ingredient lists and clean the ingredient data for the further calculation.

You will have three main tasks:
- scraping data: scrape at least 100 recipes from the website that you choose
- cleaning data: clean your ingredient data for further calculation. This includes, but is not limited to, removing excess white spaces, correcting for all edge cases, and correcting any remaining formatting issues
- calculating: what are the 10 most common ingredients used in these recipes?

In [193]:
#library import
from urllib.request import urlopen
import re
import csv
import pandas as pd
import spacy

### About the website

While starting on this assignment, I came across this website called allrecipes.com. While analyzing the structure of the website to begin scrapping, I realized that the website is very well structured and it would be fairly easy to extract data from this. Also, the website has a lot of recipes from various cuisines and seems to be wellmaintained and moderated in terms of the recipes. Thus, I chose to work with this website.

### Task 1: Web Scrapping Recipes

There were two approaches that I considered to complete this task. 
- Approach 1: To extract required information from the HTML front end, directly use code to download the HTML contents and extract out useful information using requests and regex. 
- Approach 2: To extract required information from the HTML front end, directly use code to download the HTML contents and extract out useful information using using requests and Beautiful Soup.

About:
- Regular Expression (shortened as regex): It is a sequence of characters that specifies a search pattern in text.
- Requests: It is a Python module in which you can send HTTP requests to retrieve contents. It helps you to access website HTML contents by sending Get or Post requests.
- Beautiful Soup: It helps you parse the HTML or XML documents into a readable format. It allows you to search different elements within the documents and help you retrieve required information faster.

#### Step 1: Understanding the HTML 
After surfing the website for a couple of minutes, I understood that there is a base url to all the recipes on the website: https://www.allrecipes.com/recipe/ 
So, I checked out if all of these recipes with the base url have a similar HTML structure. To do this, I took a couple of pages at random and checked the structure. For demonstration here, I have provided the HTML structure of one such page. 

In [194]:
# url = "https://www.allrecipes.com/recipe/281728/double-cheddar-rotisserie-chicken-lasagna"
# page = urlopen(url)
# html_bytes = page.read()
# html = html_bytes.decode("utf-8")
#print(html) 
#uncomment the above print statement to see the html struct


There is a lot of unwanted information here. But, we understand a few key poits from this which makes our task fairly simpler. 
- The title of the HTML page gives us the name of the recipe with some minor modifications. Hence, regex can be used to extract the name of the recipe. 
- The ingredients of that particular recipe is stored as a list with the key "recipeIngredient" for all the pages. Hence, regex with some simple functions can give us a list of ingredients. 
- Even though the url to any recipe is our input in this case, we can extract the complete url from the HTML using the key "url"

#### Step 2: Scrape the website allrecipes.com

In [195]:
# recipies =[] #list of all the recipe names
# link =[] #list of all the recipe url
# ingri = [] #list of all the recipe ingredients
# for i in range(23000, 23132):
#         try:
            
#             url = baseurl + str(i) #adding the page to the base url
#             page = urlopen(url)
#             html_bytes = page.read() 
#             html = html_bytes.decode("utf-8") 
            
#             pattern = "<title.*?>.*?</title.*?>" #regex to extract the title of the page - this contains the name
#             match_results = re.search(pattern, html, re.IGNORECASE)
#             title = match_results.group()
#             title = re.sub("<.*?>", "", title) #removing the <title> tag from the found pattern
#             title = re.sub("\| Allrecipes", "", title) #removing the unwanted data from the found pattern
#             title = re.sub("Recipe", "", title) #removing the Recipe at the end of the name (completely optional)
            
#             recipies.append(title) #adding the data to the list
            
#             #print(title)
            
#             pattern2 = r'"url": ".*?"' #regex to extract the complete url of the page
#             match_results2 = re.search(pattern2, html, re.IGNORECASE)
#             url2 = match_results2.group() 
#             url2 = re.sub('"url":', "", url2) #removing the unwanted '"url": ' tag from the found pattern
#             url2 = re.sub('"', "", url2) #removing unwanted " marks
            
#             link.append(url2) #adding the data to the list 
            
#             #print(url2)
            
#             pattern3 = '(?<="recipeIngredient": \[)[\S\s]*(?="recipeInstructions")' #regex to extract the ingredients
#             match_results3 = re.search(pattern3, html, re.IGNORECASE)
#             ingridients = match_results3.group()
#             ingridients = re.sub('\],', "", ingridients) #removing unwanted symbols from the pattern found
#             ingridients = re.sub('"', "", ingridients) #removing unwanted symbols from the pattern found
#             ingridients = re.sub('\n', "", ingridients) #removing new lines from the pattern
#             ingridients = re.sub('\\s+', ' ', ingridients) #removing multiple white spaces and replacing it with single white space
#             ingridients = ingridients.split(',') #converting the string to a list so that it could be converted into multiple rows later
            
#             ingri.append(ingridients) #adding the data to the list 
            
#             #print(ingridients)
            
#         except:
#             continue

 

#### Step 3: Scraping more than 100 recipes' name, link, and ingredients

In [196]:
# baseurl = "https://www.allrecipes.com/recipe/"

In [197]:
import json

f = open ('../recipes/recipes3.json', "r")
data = json.loads(f.read())

df = pd.DataFrame()
  
print(df)

recipies = list(map(lambda x: x['title'], data))
link = list(map(lambda x: x['url'], data))
ingri = list(map(lambda x: x['ingredients'], data))
  
# append columns to an empty DataFrame
df['name'] = recipies
df['url'] = link
df['ingridient'] = ingri

  
df

Empty DataFrame
Columns: []
Index: []


Unnamed: 0,name,url,ingridient
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,"[1 lb rinsed mung bean sprouts, 1/2 teaspoon s..."
1,No-Bake Alaska,https://www.food.com/recipe/no-bake-alaska-282721,"[1 (12 ounce) pre-made poundcake, 2 cups choco..."
2,Freezer Frosting,https://www.food.com/recipe/freezer-frosting-2...,"[1/3 cup shortening, 4 1/2-5 cups confectioner..."
3,Beet Salad with Mixed Greens,https://www.allrecipes.com/recipe/281980/beet-...,"[1 bunch red beets, trimmed and washed, 2 tabl..."
4,Braised Baby Bok Choy,https://www.allrecipes.com/recipe/283007/brais...,"[1 tablespoon extra-virgin olive oil, 4 large ..."
...,...,...,...
779,Vietnamese Tofu Salad,https://www.allrecipes.com/recipe/106069/vietn...,"[1 tablespoon vegetable oil, 2 tablespoons cho..."
780,Sunset Rum Punch,https://www.allrecipes.com/recipe/246985/sunse...,"[crushed ice, 1.5 fluid ounces spiced rum (suc..."
781,Chicken In Sour Cream,https://www.allrecipes.com/recipe/8519/chicken...,"[1 tablespoon vegetable oil, 8 chicken thighs,..."
782,Big Al's K.C. Bar-B-Q Sauce,https://www.allrecipes.com/recipe/44491/big-al...,"[2 cups ketchup, 2 cups tomato sauce, 1.25 cup..."


In [198]:
df = df.explode('ingridient') #convert the ingridients from the list to multiple rows


In [199]:
df

Unnamed: 0,name,url,ingridient
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,1 lb rinsed mung bean sprouts
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,1/2 teaspoon salt
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,2 tablespoons white distilled vinegar
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,2 tablespoons sesame oil
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,1 tablespoon sugar
...,...,...,...
783,Easy Corn Chowder I,https://www.allrecipes.com/recipe/15212/easy-c...,1 (14.75 ounce) can cream-style corn
783,Easy Corn Chowder I,https://www.allrecipes.com/recipe/15212/easy-c...,1.5 cups cubed potatoes
783,Easy Corn Chowder I,https://www.allrecipes.com/recipe/15212/easy-c...,1 (10.75 ounce) can condensed cream of mushroo...
783,Easy Corn Chowder I,https://www.allrecipes.com/recipe/15212/easy-c...,3 cups milk


In [200]:
df.to_csv('rawData.csv', index=False) #creating a csv

### Task 2: Cleaning Scraped Data

On investigating the 109 scraped recipes, I observed that there are several formatting "edge cases" unique to this website. These edge cases were not at all related to the ingredient names. These edge cases are:

- Measurements are represented as ½ (called vulgar fractions)
- Several non-alphanumeric characters such as copyright and trademark symbols used to identify ingredients, comma as used in "½ onion, finely chopped", brackets as used in "1 (1 ounce) envelope dry onion soup mix", hyphens (-) as used in "all-purpose flour"


I will clean the data in two phases:

- Primary Cleaning: The objective of the first phase is to ensure that the data is readable and accessible on all platforms by fixing encoding errors and eliminating symbols which aren't translated well across platforms. This cleaning will not get rid of any punctuations, stopwords etc.

- Problem-specific cleaning :The objective of the second phase of the cleaning is to prepare the data for our calculations and is centered on the problem set requirement.

I believe it is a good practice to separate the both, as if requirements change in the future you can always proceed with the result of the first cleaning phase to perform another analysis altogether.

#### Step 1: Primary Cleaning

In [201]:
#primary cleaning
#convert vulgar fractions 
import unicodedata
unicodedata.numeric(u'⅕')
unicodedata.name(u'⅕')

#convert vulgar fractions
for ix, row in df.iterrows():
    for char in row['ingridient']:
        if unicodedata.name(char).startswith('VULGAR FRACTION'):  
            normalized = unicodedata.normalize('NFKC', char)
            df.iloc[ix, 2] = df.iloc[ix, 2].replace(char, normalized)

In [202]:
#sanity check for vulgar fractions removal
df.iloc[1:15, :]

Unnamed: 0,name,url,ingridient
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,1/2 teaspoon salt
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,2 tablespoons white distilled vinegar
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,2 tablespoons sesame oil
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,1 tablespoon sugar
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,salt and pepper
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,1 tablespoon toasted sesame seeds
1,No-Bake Alaska,https://www.food.com/recipe/no-bake-alaska-282721,1 (12 ounce) pre-made poundcake
1,No-Bake Alaska,https://www.food.com/recipe/no-bake-alaska-282721,"2 cups chocolate ice cream, slightly softened"
1,No-Bake Alaska,https://www.food.com/recipe/no-bake-alaska-282721,"1 (8 ounce) container whipped topping, thawed"
1,No-Bake Alaska,https://www.food.com/recipe/no-bake-alaska-282721,"1/4 cup flaked coconut, toasted"


#### Step 2: Problem-Specific Cleaning


The objective is to extract the ingredient name from sentences which contain additional information such as measurement, unit of measurement, ingredient state-specific information (chopped, minced, frozen etc).

#####  Step 2a. Data Exploration

In order to eliminate the additional information, the position of the additional information w.r.t the ingredient name is helpful. It is indicative of the position of the ingredient and almost follows a pattern though not strictly.

A few patterns and their example are:

- Pattern: quantity measurement ingredient
Example: 1 teaspoon soy sauce

- Pattern: quantity ingredient
Example: 2 eggs

- Pattern: quantity quantity ingredient, ingredient-specific information
Example: 1⁄2 onion, finely chopped

A few other patterns can be observed here


In [203]:
df.ingridient

0                          1 lb rinsed mung bean sprouts
0                                      1/2 teaspoon salt
0                  2 tablespoons white distilled vinegar
0                               2 tablespoons sesame oil
0                                     1 tablespoon sugar
                             ...                        
783                 1 (14.75 ounce) can cream-style corn
783                              1.5 cups cubed potatoes
783    1 (10.75 ounce) can condensed cream of mushroo...
783                                          3 cups milk
783                             salt and pepper to taste
Name: ingridient, Length: 7044, dtype: object

Let's see if there are any overlaps in this cleaned data.

In [204]:
df.ingridient.value_counts()

1 teaspoon vanilla extract                           76
0.5 teaspoon salt                                    68
1 teaspoon salt                                      64
1 egg                                                48
2 eggs                                               47
                                                     ..
1 (12 ounce) package frozen blueberries               1
1 prepared angel food cake, cut into chunks           1
2 cups chopped fresh strawberries                     1
1 (12 ounce) package frozen peach slices, chopped     1
1.5 cups cubed potatoes                               1
Name: ingridient, Length: 4046, dtype: int64


##### 2b. Ingredient Extraction Methodology via Named Entity Recognition

Since there is a dependency among the components of the sentence and we know that the ingredient name will be a noun, we can use this information along with custom regex expression to eliminate measurement units to extract the ingredient name.

Here, to find the ingredient,

PSEUDOCODE

for each token do the following  

  1.   If on checking the token dependencies, the dependencies of the token for sentences' subject or root is true then move to step 2          
  2.   If the token is a noun, then move to step 3.
  3.   Scan the token for childrens which are either modifications or compounds and not measurements and return the identified token as ingredient name



In [347]:
from nltk.stem import WordNetLemmatizer
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex
# load the existing small model from spacy
base_model = spacy.load('en_core_web_sm')


def custom_tokenizer(nlp):
    inf = list(nlp.Defaults.infixes)               # Default infixes
    # Remove the generic op between numbers or between a number and a -
    inf.remove(r"(?<=[0-9])[+\-\*^](?=[0-9-])")
    inf = tuple(inf)                               # Convert inf to tuple
    # Add the removed rule after subtracting (?<=[0-9])-(?=[0-9]) pattern
    infixes = inf + tuple([r"(?<=[0-9])[+*^](?=[0-9-])", r"(?<=[0-9])-(?=-)"])
    # Remove - between letters rule
    infixes = [x for x in infixes if '-|–|—|--|---|——|~' not in x]
    infix_re = compile_infix_regex(infixes)

    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                     suffix_search=nlp.tokenizer.suffix_search,
                     infix_finditer=infix_re.finditer,
                     token_match=nlp.tokenizer.token_match,
                     rules=nlp.Defaults.tokenizer_exceptions)


base_model.tokenizer = custom_tokenizer(base_model)


def ingredient_parser(ingredients):
    # measures and common words (already lemmatized)
    measures = ["bowl", "bulb", "cube", "cup", "drop", "ounce", "oz", "pinch", "pound", "teaspoon",
                "tablespoon", "kg", "kilogram", "gram", "package", "gallon", "liter", "lb", "container", "inch", "chunk"]
    if 'garlic' in ingredients:
        measures.append('clove')
    words_to_remove = ['fresh', 'a', 'bunch', 'crushed', 'minced', 'ground', 'melted', 'softened', 'thawed', 'trimmed', '-inch' 'divided', 'piece', 'chopped',
                       'can', 'sliced', 'diced', 'grated', 'shredded', 'cubed', 'large', 'small', 'medium', 'thinly', 'coarsely', 'to', 'taste', 'rinsed', "thick", "drained", "cut", "half"]
    # Turn ingredient list from string into a list
    # We first get rid of all the punctuation
    translator = str.maketrans('', '', string.punctuation)
    # initialize nltk's lemmatizer
    lemmatizer = WordNetLemmatizer()
    ingredients.translate(translator)
    # We split up with hyphens as well as spaces
    items = re.split(' |,', ingredients)
    # Get rid of words containing non alphabet letters
    items = [word for word in items if word.isalpha(
    ) or re.search(r"^.*?(?=-[A-Za-z])", word)]
    # Turn everything to lowercase
    items = [word.lower() for word in items]
    # remove accents
    items = [unidecode.unidecode(word) for word in items]
    # Lemmatize words so we can compare words to measuring words
    items = [lemmatizer.lemmatize(word) for word in items]
    # get rid of stop words
    stop_words = set(corpus.stopwords.words('english'))
    items = [word for word in items if word not in stop_words]
    # Gets rid of measuring words/phrases, e.g. heaped teaspoon
    items = [word for word in items if word not in measures]
    # Get rid of common easy words
    items = [word for word in items if word not in words_to_remove]
    return ' '.join(items)


def extract_ingredient(ingri):
    parsedIngri = ingredient_parser(ingri)
    tokens = list(base_model(parsedIngri))
    ingredients = []
    if len(tokens) == 1:
        ingredients.insert(0, tokens[0].text)
    else:
      for token in tokens:
        if (token.dep_ in ['nsubj', 'ROOT', 'dobj', 'npadvmod', 'appos', 'compound']) and (token.pos_ in ['NOUN', 'PROPN', 'VERB']):
            # explore children
            for child in token.children:
                if child.dep_ in ['amod', 'conj', 'aux', 'nmod', 'oprd']:
                    ingredients.insert(0, child.text)
            ingredients.insert(0, token.text)
    return ' '.join(ingredients)


tests = []
test_extract = tests.append(extract_ingredient('1 kg all-purpose flour'))
test_extract = tests.append(extract_ingredient(
    '1 (14.75 ounce) can cream-style corn'))
test_extract = tests.append(extract_ingredient('1 tablespoon garlic powder'))
test_extract = tests.append(extract_ingredient('salt and pepper to taste'))
test_extract = tests.append(extract_ingredient(
    '3 (5 ounce) skinless, boneless chicken breasts, cut into chunks'))
test_extract = tests.append(extract_ingredient(
    '5 medium carrots, cut into 1-inch pieces'))
test_extract = tests.append(extract_ingredient(
    '1 (1.25 ounce) package beef with onion soup mix'))
test_extract = tests.append(extract_ingredient('vodka,liters'))
test_extract = tests.append(
    extract_ingredient('2 1/2 teaspoons ground cloves'))
test_extract = tests.append(
    extract_ingredient('2 cloves crushed garlic'))
test_extract = tests.append(
    extract_ingredient('2 cloves garlic, minced'))
test_extract = tests.append(
    extract_ingredient('2 tablespoons sesame oil'))
test_extract = tests.append(
    extract_ingredient('1 lb sugar'))
test_extract = tests.append(
    extract_ingredient('1 cup trimmed and coarsely chopped watercress'))


print(tests)


['flour all-purpose', 'corn cream-style', 'powder garlic', 'pepper salt', 'breast chicken boneless skinless', 'carrot', 'mix soup onion beef', 'vodka', 'clove', 'garlic', 'garlic', 'oil sesame', 'sugar', 'watercress']


In [348]:
extracted = []

for ix, row in df.iterrows():
    print('\r', "Extracting ingredient for row", ix, end='')
    extract = extract_ingredient(row['ingridient'])
    extracted.append(extract)

 Extracting ingredient for row 783

In [349]:
#convert to dataframe to view 
clean_recipe = df[['name', 'url']]
clean_recipe['ingredient'] = extracted

In [350]:
"""I earlier exploded the ingredient using comma as a delimiter, and due to the type of data, 
   there were multiple such rows where there were no ingredients but words such as 'chopped' or 'softened' which 
   will be dropped after the above function and we will have empty strings in ingredients column for such values. 
   Hence we need to get rid of those empty strings"""

nan_value = float("NaN")

clean_recipe.replace("", nan_value, inplace=True)

clean_recipe.dropna(subset = ["ingredient"], inplace=True)

clean_recipe

Unnamed: 0,name,url,ingredient
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,sprout bean mung
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,salt
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,vinegar distilled
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,oil sesame
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,sugar
...,...,...,...
783,Easy Corn Chowder I,https://www.allrecipes.com/recipe/15212/easy-c...,corn cream-style
783,Easy Corn Chowder I,https://www.allrecipes.com/recipe/15212/easy-c...,potato
783,Easy Corn Chowder I,https://www.allrecipes.com/recipe/15212/easy-c...,soup condensed mushroom cream
783,Easy Corn Chowder I,https://www.allrecipes.com/recipe/15212/easy-c...,milk


In [351]:
clean_recipe.to_csv("cleanData.csv") 

In [352]:
clean_recipe

Unnamed: 0,name,url,ingredient
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,sprout bean mung
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,salt
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,vinegar distilled
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,oil sesame
0,Mung Bean Sprout Salad,https://www.food.com/recipe/mung-bean-sprout-s...,sugar
...,...,...,...
783,Easy Corn Chowder I,https://www.allrecipes.com/recipe/15212/easy-c...,corn cream-style
783,Easy Corn Chowder I,https://www.allrecipes.com/recipe/15212/easy-c...,potato
783,Easy Corn Chowder I,https://www.allrecipes.com/recipe/15212/easy-c...,soup condensed mushroom cream
783,Easy Corn Chowder I,https://www.allrecipes.com/recipe/15212/easy-c...,milk


### Task 3: Analysis and Calculation

#### Step 1: Count Calculation

In [353]:
count_df = pd.DataFrame(clean_recipe.ingredient.value_counts().rename_axis('ingredient').reset_index(name='count'))

In [354]:
print("There are {} unique ingredients".format(count_df.shape[0]))

There are 1749 unique ingredients


In [355]:
count_df.head(11)

Unnamed: 0,ingredient,count
0,salt,249
1,sugar white,238
2,flour all-purpose,197
3,butter,190
4,egg,189
5,onion,164
6,garlic,123
7,extract vanilla,122
8,milk,109
9,pepper black,100


#### Step 2: Proportion Calculation


Let us find if one ingredient appears more than once in a recipe. This is important as if they don't appear more than once than the count divided by the number of recipe will give us the proportion.

However, if an ingredient occurs more than once the count is not reflective of the number of recipes it occurs in alone and includes multiple occurence within a recipe.


In [356]:
count_recipe_ingredient = clean_recipe.groupby(['name', 'ingredient']).count()

In [357]:
count_recipe_ingredient

Unnamed: 0_level_0,Unnamed: 1_level_0,url
name,ingredient,Unnamed: 2_level_1
2-Ingredient Wheat-Free Banana Pancakes (Paleo),broken banana,1
2-Ingredient Wheat-Free Banana Pancakes (Paleo),butter,1
2-Ingredient Wheat-Free Banana Pancakes (Paleo),egg,1
3-Ingredient Pancakes,banana ripe,1
3-Ingredient Pancakes,cinnamon,1
...,...,...
Zucchini Walnut Bread,salt,1
Zucchini Walnut Bread,soda baking,1
Zucchini Walnut Bread,sugar white,1
Zucchini Walnut Bread,walnut,1


In [358]:
count_recipe_ingredient.url.value_counts()

1    6732
2     137
3       8
4       1
Name: url, dtype: int64



Since there are a few recipes which contain the same ingredient multiple number of times. This could be because there could be variation of the ingredient such as chopped, diced onions etc.

Thus, I first group the recipes by name and find the set of ingredient associated with each and then count each ingredients occurence to eventually calculate the proportion.


In [359]:
recipe_ingredient = clean_recipe.groupby('name')['ingredient'].apply(set)
print(recipe_ingredient)

name
2-Ingredient Wheat-Free Banana Pancakes (Paleo)                         {butter, broken banana, egg}
3-Ingredient Pancakes                              {banana ripe, powder baking, egg, cinnamon, ne...
A 20-Minute Chicken Parmesan                       {cheese parmesan, butter, egg, crumb seasoned ...
A Michelada for All (Vegan and Gluten Free)        {needed salt sriracha, juiced lime, pepper bla...
A and Z Dip                                        {cheese cream, heart artichoke, seasoning gall...
                                                                         ...                        
Zucchini Cake II                                   {zucchini, flour all-purpose, soda baking, pow...
Zucchini Noodles with Bolognese Sauce              {beef, spiralized zucchini, divided garlic, to...
Zucchini Relish                                    {zucchini, pepper red bell, sugar white, juice...
Zucchini Tomato Soup II                            {zucchini, broth chicken, tomato, n

In [360]:
ingd_count = {}
for el in count_df.ingredient:
    for r in recipe_ingredient.index:
        if el in recipe_ingredient[r]:
            if el not in ingd_count:
                ingd_count[el] = 1
            else:
                ingd_count[el] += 1

KeyboardInterrupt: 

In [329]:
prop_df = pd.DataFrame(ingd_count.items(), columns = ['ingredient', 'proportion'])

In [330]:
prop_df['proportion'] = prop_df['proportion'].div(len(df))

In [331]:
prop_df.sort_values( by = 'proportion', ascending = False)

Unnamed: 0,ingredient,proportion
0,salt,0.033788
1,sugar white,0.030097
2,flour all-purpose,0.026547
3,egg,0.026263
4,onion,0.022998
...,...,...
984,sauce canned tomato,0.000142
983,crumbled bouillon chicken,0.000142
982,pea,0.000142
981,bean black sodium reduced,0.000142


In [332]:
#join with count_df
results_df = pd.merge(count_df, prop_df, on = 'ingredient')
results_df

Unnamed: 0,ingredient,count,proportion
0,salt,249,0.033788
1,sugar white,238,0.030097
2,flour all-purpose,197,0.026547
3,egg,189,0.026263
4,onion,163,0.022998
...,...,...,...
1796,round eggplant,1,0.000142
1797,pumpkin peeled,1,0.000142
1798,shredded cheese swiss,1,0.000142
1799,mix dressing ranch,1,0.000142


In [333]:
#save to file only top 10
results_df.iloc[:10, :].to_csv("results.csv")
#save all results
results_df.to_csv("resultsAll.csv")

The top 10 ingredients are filled with condiments and diary mainly. The only vegetables here is garlic since it is used in almost all sauces, gravies etc. There is flour too and the presence of this along with eggs and butter suggest substantial baking recipes among the scraped dataset. 

Funnily enough, water is the top 5th ingredient, even though it is used in almost all recipes. This is because some of the recipes don't consider water as an ingredient.

We see that the 10th ingredient is 'package' and it was not properly removed during the cleaning step. 

### Further Improvements:

Task 1 was completed entirely using regex. It is simple and direct. However, if the website's front-end structure changes then I will need to adjust the code accordingly. 

I also coded for this using the Beautiful Soup module. This is similar to the approach that I demonstrated above. Here, instead of writing regex for getting the data, the formatting is fairly simpler. But I enjoy writing regex so I went with that for the demonstration. 

There is another way of going about the scrapping task. If website stores data in API and the website queries the API each time when user visit the website, you can simulate the request and directly query data from the API. It is definitely a preferred approach if you can find the API request. The data you receive will be more structured and stable. This is because compared to the website front end, it is less likely for the company to change its backend API. However, I could not find API for allrecipes.com. 


Task 2, specifically Problem-Specific cleaning was done using a simple approach, which yielded good results. But there is another approach that could be used to get even better results.

Another approach for Task 2: Named Entity Recognition
Named Entity Recognition (NER) can be used to extract ingredient name from the unstructured text. Named Entity Recognition is one of the first tasks of information extraction that seeks to locate and classify named entity mentioned in unstructed text into pre-defined categories such as people, organizations etc.

To create a custom named entity recongition model, we need training data which is annonated in the format specified by spacy documentation. I attempted this approach as well.  I am still in the process of structuring the code for this approach. So, for the purpose of this demonstration, I have not added that approach here. If needed, I can provide the code for the same. 




### Takeaway:
As a Data Science student, I have always believed that Data Science is not just about fancy Machine Learning algorithms. It involves a lot more than fitting just ML models. Understanding the problem, Data gathering, Data Cleaning and Feature Engineering are few of the tasks. In this assignment, I not only understood the problem statement and got the answers, I gathered my own data based on the problem statement, I cleaned the data and then performed calculation and reached the end goal. Even though there is still a scope for improvement (which I am working on), I had a fun time working on this end-to-end project! I am hoping to get to work on more such projects.  