# Part 2 - Scraping www.food.com

#### Name: Amirhossein Kiani
#### Contact: ahosseinkiani@gmail.com

In this notebook I **re**-scrape food.com to retrive lots of missing information in two crucial columns: `ingredient_quants` and `ingredient_items`. The info missing from these columns will turn out to be curcial down the road, in at least two stages: 
- enriching the text data that will be used to extract semantic embeddings from our recipes' text data, which is going to be used in our semantic search and recommendation systems.
- retrieving accurate recipe information when deploying our recommender system using Streamlit.

But **what is wrong** with the original `ingredient_quants` and `ingredient_items`? Many of the recipes suffer from one of the following: 
- inaccurate ingredient-quants lists (e.g., from '1 1/2 cans of beans', 1 and 1/2 have been separated), inaccurate ingred-items lists (e.g., '2 spoons of butter or margrine' both butter and margrine have been sepearated and added to the relevant list). As a result of such mistakes, many mismatches occur between ingred_quants and ingred_items. 
- Also measures are missing (lbs, teaspoon, cans, etc), so it's not clear what those quantities are measured with. Run the following codes to see an example: 
    - `print('https://www.food.com/recipe/crunchy-peanut-biscuit-ring-18289')`
    - `print(recipes.iloc[14987]['RecipeIngredientQuantities'])`
    - `print(recipes.iloc[14987]['RecipeIngredientParts'])`
                        
I fix these problems by following the procedures below:

- Feature-engineering the recipe urls. These are missing in the original dataset uploaded on Kaggle.
- Loop through all links and extract the information missiong from `ingredient_quants` and `ingredient_items` from the page.
- Create a dataframe containing `link`, `RecipeID`, etc., and joing it with the Recipes dataset on `RecipeID`.

-----------------------------
**Selenium** is an open-source automated testing framework primarily used for web applications. It allows testers and developers to automate the testing process of web applications across different browsers and platforms. Along with many other features, Selenium provides a set of APIs (Application Programming Interfaces) that allow you to control web browsers programmatically. This enables the user to simulate user interactions, such as clicking buttons, filling out forms, and navigating through web pages. Thiese features are used for scraping food.com in this notebook.


----------------------------

### Table of Contents

- [Imports and Sanity Checks](#imports)
- [Adding Recipe url to the Dataframe](#url)
- [Selenium Playground](#play)
- [Scrape the Engineered `url` Column](#scrape)
- [Checking the Results](#results)
- [Authors](#authors)
- [Conclusion](#conc)

## Imports and Sanity Checks <a class ='author' id='imports'></a>

In [23]:
# Regular imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# This allows scrolling through all the columns. Useful for dataframes with too many columns:

pd.set_option('display.max_columns', None)

In [10]:
recipes = pd.read_parquet('../recipes.parquet')

In [11]:
recipes.sample()

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,...,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions
147059,154198.0,Easy Pizza Bake,156034,Sherrybeth,PT40M,PT15M,PT55M,2006-02-01 14:10:00+00:00,This is such a quick and easy recipe and kids ...,[],...,7.7,29.5,1386.5,41.5,1.4,12.2,14.3,12.0,,[Use a large baking dish (13 X 9 is too small)...


## Adding Recipe `url` to the Dataframe <a class ='author' id='url'></a>
We will first reconstruct the recipe urls from the original recipes dataset. 
* We can use these urls to check recipe data recorded in the dataset and the actual info on the respective recipe webpages.
* We also use these links to scrape food.com in order to upgrade the ingredients (currently ongoing in another notebook).

In [14]:
recipes['url']= recipes['Name'].apply(lambda x: x.replace(' ','-')+'-')
recipes['url']

0                        Low-Fat-Berry-Blue-Frozen-Dessert-
1                                                  Biryani-
2                                            Best-Lemonade-
3                           Carina's-Tofu-Vegetable-Kebabs-
4                                             Cabbage-Soup-
                                ...                        
522512                      Meg's-Fresh-Ginger-Gingerbread-
522513    Roast-Prime-Rib-au-Poivre-with-Mixed-Peppercorns-
522514                               Kirshwasser-Ice-Cream-
522515            Quick-&-Easy-Asian-Cucumber-Salmon-Rolls-
522516                             Spicy-Baked-Scotch-Eggs-
Name: url, Length: 522517, dtype: object

In [35]:
recipes['url'] = recipes[['url', 'RecipeId']].apply(lambda x: 'https://www.food.com/recipe/' + x['url'] + str(int(x['RecipeId'])), axis=1)
recipes['url']

0         https://www.food.com/recipe/Low-Fat-Berry-Blue...
1                    https://www.food.com/recipe/Biryani-39
2              https://www.food.com/recipe/Best-Lemonade-40
3         https://www.food.com/recipe/Carina's-Tofu-Vege...
4               https://www.food.com/recipe/Cabbage-Soup-42
                                ...                        
522512    https://www.food.com/recipe/Meg's-Fresh-Ginger...
522513    https://www.food.com/recipe/Roast-Prime-Rib-au...
522514    https://www.food.com/recipe/Kirshwasser-Ice-Cr...
522515    https://www.food.com/recipe/Quick-&-Easy-Asian...
522516    https://www.food.com/recipe/Spicy-Baked-Scotch...
Name: url, Length: 522517, dtype: object

In [9]:
recipes.sample()

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,...,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url
343826,356621.0,Grilled Scallops &amp; Walla Walla Sweet Onion...,576273,Chef GreanEyes,PT45M,PT55M,PT1H40M,2009-02-19 11:07:00+00:00,Make and share this Grilled Scallops &amp; Wal...,[],...,252.6,699.6,12.5,1.2,3.1,16.2,6.0,6,[Saute onion and garlic in olive oil over lowe...,https://www.food.com/recipe/Grilled-Scallops-&...


In [None]:
# Save the new recipes dataset with urls

#recipes.to_parquet('recipes_with_urls.parquet')

## Selenium Playground <a class ='author' id='play'></a>
You can safely ignore this section, as it serves merely as playground to find out the best code four our scraping.

In [143]:
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import pandas as pd

#pattern = r'\d+(?:-\d+)?\n?.*?(?:\n\d+(?:\.\d+)?)?(?:\s*ml|\s*g)\s*\w+|\d+(?:-\d+)?\n?.*'

# Set up Selenium webdriver with Chrome
driver = webdriver.Chrome()

# Open the webpage
driver.get("https://www.food.com/recipe/crock-pot-chicken-with-black-beans-cream-cheese-89204")
time.sleep(1)

#button = driver.find_element(By.CLASS_NAME, 'button.button--units.svelte-ovaflp')

# Click the button
#button.click()
#time.sleep(1)  # Add a delay if necessary

ingred_quants = driver.find_elements(By.CLASS_NAME, 'ingredient-quantity.svelte-ovaflp')
ingred_quants = [quant.text for quant in ingred_quants]

ingred_items = driver.find_elements(By.CLASS_NAME, 'ingredient-text.svelte-ovaflp')
ingred_items = [item.text for item in ingred_items]

# Close the browser
driver.quit()

In [144]:
len(ingred_quants)

5

In [145]:
len(ingred_items)

5

In [146]:
for i,j in zip(ingred_quants,ingred_items):
    print(i,j)

4 -5 boneless chicken breasts, frozen
1 (15 1/2 ounce) can black beans
1 (15 ounce) can corn
1 (15 ounce) jar salsa, any kind
1 (8 ounce) package cream cheese


## Scrape the Engineered `url` Column <a class ='author' id='scrape'></a>

- Loop through the links
- Extract `ingred_quant` and `ingred_items` and add the items to two new columns in `recipes_with_urls`.

In [5]:
# recipes = pd.read_parquet('recipes_with_urls.parquet')

In [7]:
recipes.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url
135813,142553.0,Spinach Pasta Rolls,89831,Kittencalrecipezazz,PT45M,PT1H,PT1H45M,2005-10-24 16:20:00+00:00,"This recipe goes back years, and I make it qui...",[],Savory,"[Brunch, Oven, Stove Top, < 4 Hours]","[1, 2, 2, 1⁄2, 1, 1, 1⁄2, None, 3, 2, 8]","[frozen spinach, garlic, ricotta cheese, parme...",4.0,1.0,436.3,22.6,12.4,96.5,900.5,32.6,2.3,9.8,25.8,,8 lasagna rolls,"[Grease an 8 x 8-inch baking dish., Boil lasag...",https://www.food.com/recipe/Spinach-Pasta-Roll...
411758,426857.0,Healthy Single-Serve Key Lime Pie Yogurt,1375473,the80srule,PT5M,PT0S,PT5M,2010-05-24 09:45:00+00:00,"Ok, not really key lime unless you use actual ...",[],Breakfast,"[Dessert, Lime, Citrus, Fruit, Low Cholesterol...","[1, 2, 1⁄8, 1⁄2]","[plain yogurt, key lime juice, vanilla extract...",5.0,1.0,182.6,8.0,5.1,31.9,113.4,20.2,0.1,18.3,8.6,1.0,1 cup,"[Mix all the ingredients together well, then s...",https://www.food.com/recipe/Healthy-Single-Ser...


In [8]:
recipes['ingred_quants'] = recipes['RecipeId'].apply(lambda x: [])
recipes['ingred_items'] = recipes['RecipeId'].apply(lambda x: [])

In [9]:
recipes.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,ingred_quants,ingred_items
19297,22690.0,Butterscotch-Ginger Ice Cream Sauce For Two,20371,Lennie,PT3M,PT5M,PT8M,2002-03-16 19:31:00+00:00,This is a wonderful sauce to serve over French...,[],Sauces,"[Frozen Desserts, Dessert, Canadian, Kid Frien...","[1⁄3, 2, 1⁄4, 3, 1⁄4, 1, 2]","[brown sugar, cornstarch, ground ginger, honey...",,,335.0,9.2,5.8,26.5,69.2,65.5,0.1,61.2,1.1,2.0,,"[In a two-cup glass measuring cup, combine the...",https://www.food.com/recipe/Butterscotch-Ginge...,[],[]
125661,132079.0,Cincinnati Chili,185105,Stacky5,PT15M,PT20M,PT35M,2005-08-02 18:25:00+00:00,This is a wonderful chili that is a personal f...,[https://img.sndimg.com/food/image/upload/w_55...,Spaghetti,"[Meat, Healthy, Kid Friendly, < 60 Mins, Begin...","[1 1⁄2, 2, 2, 2, 1⁄4, 1⁄8, 1, 4, 2, 16, 2]","[onions, garlic cloves, chili powder, ground c...",4.5,4.0,1177.5,36.6,17.9,169.9,1786.5,136.4,15.5,18.4,75.6,,,"[Cook the beef, onions and garlic in a large (...",https://www.food.com/recipe/Cincinnati-Chili-1...,[],[]


In [None]:
import time
from datetime import datetime
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException

#cf= pd.DataFrame(columns=['url','ingred_quants','ingred_items'])

# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run Chrome in headless mode

# Set up Selenium webdriver with Chrome and the specified options
driver = webdriver.Chrome(options=chrome_options)
#driver = webdriver.Chrome()
print(datetime.now())

counter = 0  # Counter variable

for i, url in zip(range(0,len(links)),links[0:]):
    try:
        driver.get(url)
        
        ingred_quants = driver.find_elements(By.CLASS_NAME, 'ingredient-quantity.svelte-ovaflp')
        ingred_quants = [quant.text for quant in ingred_quants]
        ingred_items = driver.find_elements(By.CLASS_NAME, 'ingredient-text.svelte-ovaflp')
        ingred_items = [item.text for item in ingred_items]
        
        cf.at[i, 'url'] = url
        cf.at[i, 'ingred_quants'] = ingred_quants
        cf.at[i, 'ingred_items'] = ingred_items
        
    except NoSuchElementException:
        continue
    
    counter += 1  # Increment counter
    
    if counter == 100:
        # Save DataFrame to pickle file
        cf.to_pickle('Recipes_final2.pkl')
        print(datetime.now())
        
        counter = 0  # Reset counter

# Save the final state of the DataFrame
cf.to_pickle('Recipes_final2.pkl')

# Close the browser
driver.quit()

2023-06-20 23:06:44.835341
2023-06-20 23:14:28.966519
2023-06-20 23:21:33.428642
2023-06-20 23:27:27.634854
2023-06-20 23:34:05.705968
2023-06-20 23:40:35.944785
2023-06-20 23:49:07.375720
2023-06-20 23:56:59.933512
2023-06-21 00:04:22.983725
2023-06-21 00:10:32.905152
2023-06-21 00:16:59.944395
2023-06-21 00:22:42.561568
2023-06-21 00:28:32.541835
2023-06-21 00:34:20.254934
2023-06-21 00:40:42.640386
2023-06-21 00:46:42.143293
2023-06-21 00:52:58.582498
2023-06-21 00:59:04.755266
2023-06-21 01:04:56.126777
2023-06-21 01:10:55.557169
2023-06-21 01:16:32.642942
2023-06-21 01:22:53.419275
2023-06-21 01:29:08.453856
2023-06-21 01:35:32.834558
2023-06-21 01:41:17.508761
2023-06-21 01:47:46.103258
2023-06-21 01:53:44.297089
2023-06-21 01:59:56.135518
2023-06-21 02:06:17.192477
2023-06-21 02:12:20.063296
2023-06-21 02:18:09.547667
2023-06-21 02:24:19.655364
2023-06-21 02:30:53.449689
2023-06-21 02:36:54.436831
2023-06-21 02:46:52.949007
2023-06-21 02:54:36.668068
2023-06-21 03:03:50.745124
2

## Checking the Results <a class ='author' id='results'></a>

Let's now make a comparison between the extracted ingredint data and the original ones:

In [32]:
new_data = pd.read_pickle('../Recipes_final2.pkl')

In [33]:
new_data

Unnamed: 0,url,ingred_quants,ingred_items
0,https://www.food.com/recipe/Low-Fat-Berry-Blue...,"[4, 1⁄4, 1, 1]","[cups blueberries, fresh or frozen, cup granul..."
1,https://www.food.com/recipe/Biryani-39,"[1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","[tablespoon saffron, teaspoons milk, warm, hot..."
2,https://www.food.com/recipe/Best-Lemonade-40,"[1 1⁄2, 1, , 1 1⁄2, , 3⁄4]","[cups sugar, tablespoon lemons, rind of or 1 t..."
3,https://www.food.com/recipe/Carina's-Tofu-Vege...,"[12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","[ounces extra firm tofu, water-packed, medium ..."
4,https://www.food.com/recipe/Cabbage-Soup-42,"[46, 4, 1, 2, 1]","[ounces plain tomato juice, cups cabbage, shre..."
...,...,...,...
521707,https://www.food.com/recipe/Budget-Friendly-Ga...,"[8, 12, 6, 1, 3, 3, 5, 2, ]","[ounces fettuccine pasta, ounces button mushro..."
521708,https://www.food.com/recipe/Cafe-Au-Cin-540556,"[1, 1, 1⁄2, 2, 1⁄2]","[cup dark roast coffee, tablespoon sugar, teas..."
521709,https://www.food.com/recipe/Bubbly-Champagne-P...,"[1, 1, 1, 1, 16, 25, 1, 1, ]","[Brut champagne, orange, large apple, large pe..."
521710,https://www.food.com/recipe/Cheesy-Baked-Lobst...,"[2, 3, 1, 1, 1, 1, 1, 1, 1, 5]","[lobster tails, tablespoons garlic powder, cup..."


In [51]:
recipes[['url', 'RecipeIngredientQuantities', 'RecipeIngredientParts']]

Unnamed: 0,url,RecipeIngredientQuantities,RecipeIngredientParts
0,https://www.food.com/recipe/Low-Fat-Berry-Blue...,"[4, 1⁄4, 1, 1]","[blueberries, granulated sugar, vanilla yogurt..."
1,https://www.food.com/recipe/Biryani-39,"[1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","[saffron, milk, hot green chili peppers, onion..."
2,https://www.food.com/recipe/Best-Lemonade-40,"[1 1⁄2, 1, None, 1 1⁄2, None, 3⁄4]","[sugar, lemons, rind of, lemon, zest of, fresh..."
3,https://www.food.com/recipe/Carina's-Tofu-Vege...,"[12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","[extra firm tofu, eggplant, zucchini, mushroom..."
4,https://www.food.com/recipe/Cabbage-Soup-42,"[46, 4, 1, 2, 1]","[plain tomato juice, cabbage, onion, carrots, ..."
...,...,...,...
522512,https://www.food.com/recipe/Meg's-Fresh-Ginger...,"[3, 1⁄2, 1⁄2, 1⁄4, 1⁄4, 1, 1⁄4, 1 1⁄2, 1, 1⁄2,...","[fresh ginger, unsalted butter, dark brown sug..."
522513,https://www.food.com/recipe/Roast-Prime-Rib-au...,"[9, 2, 4, 2, 1⁄3, 3 1⁄2, 1⁄3]","[Dijon mustard, garlic, peppercorns, shallot, ..."
522514,https://www.food.com/recipe/Kirshwasser-Ice-Cr...,"[3, 3, 1⁄2, 1⁄2, 2, 2]","[half-and-half, heavy cream, brandy, sugar]"
522515,https://www.food.com/recipe/Quick-&-Easy-Asian...,"[4, 1⁄4, 1, 1⁄2, 8 1⁄2, 40]","[wasabi paste, dill, English cucumber, smoked ..."


In [39]:
for i,j in zip(recipes['RecipeIngredientQuantities'].iloc[0],recipes['RecipeIngredientParts'].iloc[0]):
    print(f"{i} {j}")

4 blueberries
1⁄4 granulated sugar
1 vanilla yogurt
1 lemon juice


In [40]:
for i,j in zip(new_data['ingred_quants'].iloc[0],new_data['ingred_items'].iloc[0]):
    print(f"{i} {j}")

4 cups blueberries, fresh or frozen
1⁄4 cup granulated sugar
1 cup vanilla yogurt, 1% or nonfat
1 tablespoon lemon juice


We can see the measures are retrieved. Same for contextual terms, such as 'fresh' and 'frozen'.

In [44]:
recipes[recipes['url'] =='https://www.food.com/recipe/Crunchy-Peanut-Biscuit-Ring-18289']

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url
14987,18289.0,Crunchy Peanut Biscuit Ring,28397,Alexis Mom,PT20M,PT5M,PT25M,2002-01-28 11:12:00+00:00,This is great recipe for a breakfast or brunch...,[],Breads,"[Dessert, < 30 Mins, Oven, Easy]","[1⁄2, 1⁄2 - 1, 3⁄4, 1⁄4, 2]","[butter, margarine, peanuts, brown sugar, mapl...",,,483.6,25.5,9.8,30.5,808.2,58.4,0.8,26.2,7.4,,,"[Heat oven to 350F., Lightly grease a 12 cup f...",https://www.food.com/recipe/Crunchy-Peanut-Bis...


We can also check that the mismatch between ingredient quants and items is gone now (check https://www.food.com/recipe/crunchy-peanut-biscuit-ring-18289 for comparision):

In [49]:
# The original recipes dataset

for i,j in zip(recipes.iloc[14987]['RecipeIngredientQuantities'],recipes.iloc[14987]['RecipeIngredientParts']):
    print(f"{i} {j}")

1⁄2 butter
1⁄2 - 1 margarine
3⁄4 peanuts
1⁄4 brown sugar
2 maple syrup


In [50]:
# The scraped data

for i,j in zip(new_data.iloc[14987]['ingred_quants'],new_data.iloc[14987]['ingred_items']):
    print(f"{i} {j}")

1⁄2 cup butter or 1/2 cup margarine
1⁄2 - 1 cup chopped peanuts
3⁄4 cup firmly packed brown sugar
1⁄4 cup maple syrup
2 (10 ounce) cans Hungry Jack big flaky refrigerated biscuits


## Conclusion <a class ='author' id='conc'></a>

We created recipe urls and scraped them to retrieve a lot of missing data from the original Kaggle dataset. We also illustrated how the problems are solved.