### Text Mining Student Specialist Problem Set - Shrusti Ghela


#### Problem Statement:
Scrape at least 100 recipes from the web, provide their ingredient lists and clean the ingredient data for the further calculation.

You will have three main tasks:
- scraping data: scrape at least 100 recipes from the website that you choose
- cleaning data: clean your ingredient data for further calculation. This includes, but is not limited to, removing excess white spaces, correcting for all edge cases, and correcting any remaining formatting issues
- calculating: what are the 10 most common ingredients used in these recipes?

In [199]:
#library import
from urllib.request import urlopen
import re
import csv
import pandas as pd
import spacy

### About the website

While starting on this assignment, I came across this website called allrecipes.com. While analyzing the structure of the website to begin scrapping, I realized that the website is very well structured and it would be fairly easy to extract data from this. Also, the website has a lot of recipes from various cuisines and seems to be wellmaintained and moderated in terms of the recipes. Thus, I chose to work with this website.

### Task 1: Web Scrapping Recipes

There were two approaches that I considered to complete this task. 
- Approach 1: To extract required information from the HTML front end, directly use code to download the HTML contents and extract out useful information using requests and regex. 
- Approach 2: To extract required information from the HTML front end, directly use code to download the HTML contents and extract out useful information using using requests and Beautiful Soup.

About:
- Regular Expression (shortened as regex): It is a sequence of characters that specifies a search pattern in text.
- Requests: It is a Python module in which you can send HTTP requests to retrieve contents. It helps you to access website HTML contents by sending Get or Post requests.
- Beautiful Soup: It helps you parse the HTML or XML documents into a readable format. It allows you to search different elements within the documents and help you retrieve required information faster.

#### Step 1: Understanding the HTML 
After surfing the website for a couple of minutes, I understood that there is a base url to all the recipes on the website: https://www.allrecipes.com/recipe/ 
So, I checked out if all of these recipes with the base url have a similar HTML structure. To do this, I took a couple of pages at random and checked the structure. For demonstration here, I have provided the HTML structure of one such page. 

In [258]:
url = "https://www.allrecipes.com/recipe/23010"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
#print(html) 
#uncomment the above print statement to see the html struct


There is a lot of unwanted information here. But, we understand a few key poits from this which makes our task fairly simpler. 
- The title of the HTML page gives us the name of the recipe with some minor modifications. Hence, regex can be used to extract the name of the recipe. 
- The ingredients of that particular recipe is stored as a list with the key "recipeIngredient" for all the pages. Hence, regex with some simple functions can give us a list of ingredients. 
- Even though the url to any recipe is our input in this case, we can extract the complete url from the HTML using the key "url"

#### Step 2: Scrape the website allrecipes.com

In [140]:

recipies =[] #list of all the recipe names
link =[] #list of all the recipe url
ingri = [] #list of all the recipe ingredients
for i in range(23000, 23132):
        try:
            
            url = baseurl + str(i) #adding the page to the base url
            page = urlopen(url)
            html_bytes = page.read() 
            html = html_bytes.decode("utf-8") 
            
            pattern = "<title.*?>.*?</title.*?>" #regex to extract the title of the page - this contains the name
            match_results = re.search(pattern, html, re.IGNORECASE)
            title = match_results.group()
            title = re.sub("<.*?>", "", title) #removing the <title> tag from the found pattern
            title = re.sub("\| Allrecipes", "", title) #removing the unwanted data from the found pattern
            title = re.sub("Recipe", "", title) #removing the Recipe at the end of the name (completely optional)
            
            recipies.append(title) #adding the data to the list
            
            #print(title)
            
            pattern2 = r'"url": ".*?"' #regex to extract the complete url of the page
            match_results2 = re.search(pattern2, html, re.IGNORECASE)
            url2 = match_results2.group() 
            url2 = re.sub('"url":', "", url2) #removing the unwanted '"url": ' tag from the found pattern
            url2 = re.sub('"', "", url2) #removing unwanted " marks
            
            link.append(url2) #adding the data to the list 
            
            #print(url2)
            
            pattern3 = '(?<="recipeIngredient": \[)[\S\s]*(?="recipeInstructions")' #regex to extract the ingredients
            match_results3 = re.search(pattern3, html, re.IGNORECASE)
            ingridients = match_results3.group()
            ingridients = re.sub('\],', "", ingridients) #removing unwanted symbols from the pattern found
            ingridients = re.sub('"', "", ingridients) #removing unwanted symbols from the pattern found
            ingridients = re.sub('\n', "", ingridients) #removing new lines from the pattern
            ingridients = re.sub('\\s+', ' ', ingridients) #removing multiple white spaces and replacing it with single white space
            ingridients = ingridients.split(',') #converting the string to a list so that it could be converted into multiple rows later
            
            ingri.append(ingridients) #adding the data to the list 
            
            #print(ingridients)
            
        except:
            continue

 

#### Step 3: Scraping more than 100 recipes' name, link, and ingredients

In [133]:
baseurl = "https://www.allrecipes.com/recipe/"

[' 1 yellow onion',
 ' chopped',
 ' 1 tablespoon olive oil',
 ' 4 cloves garlic',
 ' minced',
 ' ½ pound cubed beef stew meat',
 ' 4 tablespoons tomato paste',
 ' 1 (28 ounce) can crushed tomatoes',
 ' 1 pinch dried parsley',
 ' 1 pinch dried basil',
 ' 1 pinch fennel seed',
 ' salt and pepper to taste ']

In [160]:
df = pd.DataFrame()
  
print(df)
  
# append columns to an empty DataFrame
df['name'] = recipies
df['url'] = link
df['ingridient'] = ingri

  
df

Empty DataFrame
Columns: []
Index: []


Unnamed: 0,name,url,ingridient
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,"[ 1 yellow onion, chopped, 1 tablespoon oliv..."
1,Pumpkin Rice with Apricots,https://www.allrecipes.com/recipe/23004/pumpk...,"[ ½ cup uncooked white rice, 1 ¾ cups water, ..."
2,Sweet Heavenly Rice Dessert,https://www.allrecipes.com/recipe/23005/sweet...,"[ 1 (15 ounce) can crushed pineapple, drained..."
3,Cold Oven Popovers,https://www.allrecipes.com/recipe/23006/cold-...,"[ 3 eggs, 1 cup milk, 1 cup all-purpose flou..."
4,Delicious Raisin Nut Banana Bread,https://www.allrecipes.com/recipe/23008/delic...,"[ 2 cups raisin nut bran cereal, ½ cup milk, ..."
...,...,...,...
105,Spinach and Bean Casserole,https://www.allrecipes.com/recipe/23126/spina...,"[ 1 cup dry black-eyed peas, ¼ cup olive oil,..."
106,Skillet Spaghetti Supper,https://www.allrecipes.com/recipe/23128/skill...,"[ 1 cup diced onion, 1 teaspoon minced garlic..."
107,Bobotie,https://www.allrecipes.com/recipe/23129/bobotie/,"[ 2 onions, thickly sliced, 3 cups water, o..."
108,Easy Candy Yams,https://www.allrecipes.com/recipe/23130/easy-...,"[ 4 cups chopped sweet potato, 1 (12 ounce) j..."


In [161]:
df = df.explode('ingridient') #convert the ingridients from the list to multiple rows


In [162]:
df

Unnamed: 0,name,url,ingridient
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,1 yellow onion
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,chopped
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,1 tablespoon olive oil
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,4 cloves garlic
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,minced
...,...,...,...
109,Buttermilk Pie III,https://www.allrecipes.com/recipe/23131/butte...,3 eggs
109,Buttermilk Pie III,https://www.allrecipes.com/recipe/23131/butte...,1 cup buttermilk
109,Buttermilk Pie III,https://www.allrecipes.com/recipe/23131/butte...,1 teaspoon vanilla extract
109,Buttermilk Pie III,https://www.allrecipes.com/recipe/23131/butte...,¼ teaspoon ground nutmeg


In [163]:
df.to_csv('rawData.csv', index=False) #creating a csv

### Task 2: Cleaning Scraped Data

On investigating the 109 scraped recipes, I observed that there are several formatting "edge cases" unique to this website. These edge cases were not at all related to the ingredient names. These edge cases are:

- Measurements are represented as ½ (called vulgar fractions)
- Several non-alphanumeric characters such as copyright and trademark symbols used to identify ingredients, comma as used in "½ onion, finely chopped", brackets as used in "1 (1 ounce) envelope dry onion soup mix", hyphens (-) as used in "all-purpose flour"


I will clean the data in two phases:

- Primary Cleaning: The objective of the first phase is to ensure that the data is readable and accessible on all platforms by fixing encoding errors and eliminating symbols which aren't translated well across platforms. This cleaning will not get rid of any punctuations, stopwords etc.

- Problem-specific cleaning :The objective of the second phase of the cleaning is to prepare the data for our calculations and is centered on the problem set requirement.

I believe it is a good practice to separate the both, as if requirements change in the future you can always proceed with the result of the first cleaning phase to perform another analysis altogether.

#### Step 1: Primary Cleaning

In [164]:
#primary cleaning
#convert vulgar fractions 
import unicodedata
unicodedata.numeric(u'⅕')
unicodedata.name(u'⅕')

#convert vulgar fractions
for ix, row in df.iterrows():
    for char in row['ingridient']:
        if unicodedata.name(char).startswith('VULGAR FRACTION'):  
            normalized = unicodedata.normalize('NFKC', char)
            df.iloc[ix, 2] = df.iloc[ix, 2].replace(char, normalized)

In [166]:
#sanity check for vulgar fractions removal
df.iloc[1:15, :]

Unnamed: 0,name,url,ingridient
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,chopped
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,1 tablespoon olive oil
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,4 cloves garlic
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,minced
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,1⁄2 pound cubed beef stew meat
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,4 tablespoons tomato paste
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,1 (28 ounce) can crushed tomatoes
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,1 pinch dried parsley
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,1 pinch dried basil
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,1 pinch fennel seed


#### Step 2: Problem-Specific Cleaning


The objective is to extract the ingredient name from sentences which contain additional information such as measurement, unit of measurement, ingredient state-specific information (chopped, minced, frozen etc).

#####  Step 2a. Data Exploration

In order to eliminate the additional information, the position of the additional information w.r.t the ingredient name is helpful. It is indicative of the position of the ingredient and almost follows a pattern though not strictly.

A few patterns and their example are:

- Pattern: quantity measurement ingredient
Example: 1 teaspoon soy sauce

- Pattern: quantity ingredient
Example: 2 eggs

- Pattern: quantity quantity ingredient, ingredient-specific information
Example: 1⁄2 onion, finely chopped

A few other patterns can be observed here


In [167]:
df.ingridient

0                      1 yellow onion
0                             chopped
0              1 tablespoon olive oil
0                     4 cloves garlic
0                              minced
                    ...              
109                            3 eggs
109                  1 cup buttermilk
109        1 teaspoon vanilla extract
109          ¼ teaspoon ground nutmeg
109     1 (9 inch) unbaked pie crust 
Name: ingridient, Length: 1077, dtype: object

Let's see if there are any overlaps in this cleaned data.

In [169]:
df.ingridient.value_counts()

 chopped                          31
 softened                         16
 2 eggs                           13
 1 teaspoon salt                  11
 drained                          11
                                  ..
 1 cup canned pumpkin              1
 ¾ teaspoon ground cinnamon        1
 ¼ teaspoon ground nutmeg          1
 4 tablespoons shortening          1
 1 (9 inch) unbaked pie crust      1
Name: ingridient, Length: 717, dtype: int64


##### 2b. Ingredient Extraction Methodology via Named Entity Recognition

Since there is a dependency among the components of the sentence and we know that the ingredient name will be a noun, we can use this information along with custom regex expression to eliminate measurement units to extract the ingredient name.

Here, to find the ingredient,

PSEUDOCODE

for each token do the following  

  1.   If on checking the token dependencies, the dependencies of the token for sentences' subject or root is true then move to step 2          
  2.   If the token is a noun, then move to step 3.
  3.   Scan the token for childrens which are either modifications or compounds and not measurements and return the identified token as ingredient name



In [207]:
#load the existing small model from spacy
base_model = spacy.load('en_core_web_sm')

In [209]:
measurements = re.compile(r'(bowl|bulb|cube|clove|cup|drop|ounce|oz|pinch|pound|teaspoon|tablespoon)s?')
extracted = []

for ix, row in df.iterrows():
    print('\r', "Extracting ingredient for row", ix, end='')
    tokens = base_model(row['ingridient'])
    extract = ''
    for token in tokens:
        if (token.dep_ in ['nsubj', 'ROOT']) and (token.pos_ in ['NOUN', 'PROPN']) and (not measurements.match(token.text)):
        #explore children
            for child in token.children:
                if (not measurements.match(child.text)) and (child.dep_ in ['amod', 'compound']):
                    extract += child.text + ' '
            extract += token.text + ' '
    extracted.append(extract)

 Extracting ingredient for row 109

In [227]:
#convert to dataframe to view 
clean_recipe = df[['name', 'url']]
clean_recipe['ingredient'] = extracted

In [230]:
"""I earlier exploded the ingredient using comma as a delimiter, and due to the type of data, 
   there were multiple such rows where there were no ingredients but words such as 'chopped' or 'softened' which 
   will be dropped after the above function and we will have empty strings in ingredients column for such values. 
   Hence we need to get rid of those empty strings"""

nan_value = float("NaN")

clean_recipe.replace("", nan_value, inplace=True)

clean_recipe.dropna(subset = ["ingredient"], inplace=True)

clean_recipe

                       name  \
0    Sam's Original Sauce     
0    Sam's Original Sauce     
0    Sam's Original Sauce     
0    Sam's Original Sauce     
0    Sam's Original Sauce     
..                      ...   
109    Buttermilk Pie III     
109    Buttermilk Pie III     
109    Buttermilk Pie III     
109    Buttermilk Pie III     
109    Buttermilk Pie III     

                                                   url        ingredient  
0     https://www.allrecipes.com/recipe/23000/sams-...     yellow onion   
0     https://www.allrecipes.com/recipe/23000/sams-...        olive oil   
0     https://www.allrecipes.com/recipe/23000/sams-...           garlic   
0     https://www.allrecipes.com/recipe/23000/sams-...             beef   
0     https://www.allrecipes.com/recipe/23000/sams-...     tomato paste   
..                                                 ...               ...  
109   https://www.allrecipes.com/recipe/23131/butte...    purpose flour   
109   https://www.allrecipe

In [257]:
clean_recipe.to_csv("cleanData.csv") 

In [231]:
clean_recipe

Unnamed: 0,name,url,ingredient
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,yellow onion
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,olive oil
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,garlic
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,beef
0,Sam's Original Sauce,https://www.allrecipes.com/recipe/23000/sams-...,tomato paste
...,...,...,...
109,Buttermilk Pie III,https://www.allrecipes.com/recipe/23131/butte...,purpose flour
109,Buttermilk Pie III,https://www.allrecipes.com/recipe/23131/butte...,eggs
109,Buttermilk Pie III,https://www.allrecipes.com/recipe/23131/butte...,buttermilk
109,Buttermilk Pie III,https://www.allrecipes.com/recipe/23131/butte...,vanilla extract


### Task 3: Analysis and Calculation

#### Step 1: Count Calculation

In [233]:
count_df = pd.DataFrame(clean_recipe.ingredient.value_counts().rename_axis('ingredient').reset_index(name='count'))

In [234]:
print("There are {} unique ingredients".format(count_df.shape[0]))

There are 264 unique ingredients


In [237]:
count_df.head(11)

Unnamed: 0,ingredient,count
0,white sugar,33
1,butter,31
2,salt,28
3,eggs,25
4,water,19
5,vanilla extract,17
6,milk,15
7,purpose flour,15
8,garlic,13
9,package,12


#### Step 2: Proportion Calculation


Let us find if one ingredient appears more than once in a recipe. This is important as if they don't appear more than once than the count divided by the number of recipe will give us the proportion.

However, if an ingredient occurs more than once the count is not reflective of the number of recipes it occurs in alone and includes multiple occurence within a recipe.


In [238]:
count_recipe_ingredient = clean_recipe.groupby(['name', 'ingredient']).count()

In [239]:
count_recipe_ingredient

Unnamed: 0_level_0,Unnamed: 1_level_0,url
name,ingredient,Unnamed: 2_level_1
Ann's Sister's Meatloaf,eggs,1
Ann's Sister's Meatloaf,ketchup,1
Ann's Sister's Meatloaf,monosodium glutamate,1
Ann's Sister's Meatloaf,package,1
Ann's Sister's Meatloaf,slices bacon,1
...,...,...
Yorkshire Pudding I,milk,1
Yorkshire Pudding I,purpose flour,1
Zucchini Chive Dip,cream cheese,1
Zucchini Chive Dip,milk,1


In [240]:
count_recipe_ingredient.url.value_counts()

1    643
2     18
Name: url, dtype: int64



Since there are a few recipes which contain the same ingredient multiple number of times. This could be because there could be variation of the ingredient such as chopped, diced onions etc.

Thus, I first group the recipes by name and find the set of ingredient associated with each and then count each ingredients occurence to eventually calculate the proportion.


In [241]:
recipe_ingredient = clean_recipe.groupby('name')['ingredient'].apply(set)

In [242]:
ingd_count = {}
for el in count_df.ingredient:
    for r in recipe_ingredient.index:
        if el in recipe_ingredient[r]:
            if el not in ingd_count:
                ingd_count[el] = 1
            else:
                ingd_count[el] += 1

In [243]:
prop_df = pd.DataFrame(ingd_count.items(), columns = ['ingredient', 'proportion'])

In [245]:
prop_df['proportion'] = prop_df['proportion'].div(len(df))

In [246]:
prop_df.sort_values( by = 'proportion', ascending = False)

Unnamed: 0,ingredient,proportion
0,white sugar,0.028784
1,butter,0.025998
2,salt,0.025070
3,eggs,0.023213
4,water,0.016713
...,...,...
154,curry powder,0.000929
155,turmeric,0.000929
156,chutney,0.000929
157,beef,0.000929


In [248]:
#join with count_df
results_df = pd.merge(count_df, prop_df, on = 'ingredient')
results_df

Unnamed: 0,ingredient,count,proportion
0,white sugar,33,0.028784
1,butter,31,0.025998
2,salt,28,0.025070
3,eggs,25,0.023213
4,water,19,0.016713
...,...,...,...
259,instant tea powder,1,0.000929
260,flavored drink mix,1,0.000929
261,mixed nuts,1,0.000929
262,graham cracker crumbs,1,0.000929


In [249]:
#save to file only top 10
results_df.iloc[:10, :].to_csv("results.csv")
#save all results
results_df.to_csv("resultsAll.csv")

The top 10 ingredients are filled with condiments and diary mainly. The only vegetables here is garlic since it is used in almost all sauces, gravies etc. There is flour too and the presence of this along with eggs and butter suggest substantial baking recipes among the scraped dataset. 

Funnily enough, water is the top 5th ingredient, even though it is used in almost all recipes. This is because some of the recipes don't consider water as an ingredient.

We see that the 10th ingredient is 'package' and it was not properly removed during the cleaning step. 

### Further Improvements:

Task 1 was completed entirely using regex. It is simple and direct. However, if the website's front-end structure changes then I will need to adjust the code accordingly. 

I also coded for this using the Beautiful Soup module. This is similar to the approach that I demonstrated above. Here, instead of writing regex for getting the data, the formatting is fairly simpler. But I enjoy writing regex so I went with that for the demonstration. 

There is another way of going about the scrapping task. If website stores data in API and the website queries the API each time when user visit the website, you can simulate the request and directly query data from the API. It is definitely a preferred approach if you can find the API request. The data you receive will be more structured and stable. This is because compared to the website front end, it is less likely for the company to change its backend API. However, I could not find API for allrecipes.com. 


Task 2, specifically Problem-Specific cleaning was done using a simple approach, which yielded good results. But there is another approach that could be used to get even better results.

Another approach for Task 2: Named Entity Recognition
Named Entity Recognition (NER) can be used to extract ingredient name from the unstructured text. Named Entity Recognition is one of the first tasks of information extraction that seeks to locate and classify named entity mentioned in unstructed text into pre-defined categories such as people, organizations etc.

To create a custom named entity recongition model, we need training data which is annonated in the format specified by spacy documentation. I attempted this approach as well.  I am still in the process of structuring the code for this approach. So, for the purpose of this demonstration, I have not added that approach here. If needed, I can provide the code for the same. 




### Takeaway:
As a Data Science student, I have always believed that Data Science is not just about fancy Machine Learning algorithms. It involves a lot more than fitting just ML models. Understanding the problem, Data gathering, Data Cleaning and Feature Engineering are few of the tasks. In this assignment, I not only understood the problem statement and got the answers, I gathered my own data based on the problem statement, I cleaned the data and then performed calculation and reached the end goal. Even though there is still a scope for improvement (which I am working on), I had a fun time working on this end-to-end project! I am hoping to get to work on more such projects.  