# Upgrade the recipes dataset with accurate ingredients info:
Many of the recipes suffer from one of the following: inaccurate ingredient-quants lists (e.g., from '1 1/2 cans of beans', 1 and 1/2 have been separated), inaccurate ingred-items lists (e.g., '2 spoons of butter or margrine' both butter and margrine have been sepearated and added to the relevant list). As a result of such mistakes, many mismatches occur between ingred_quants and ingred_items. Also measures are missing (lbs, teaspoon, cans, etc), so it's not clear what those quantities are measured with. Run the following codes to see an example: 

                        print('https://www.food.com/recipe/crunchy-peanut-biscuit-ring-18289')
                        print(recipes.iloc[14987]['RecipeIngredientQuantities'])
                        print(recipes.iloc[14987]['RecipeIngredientParts'])
I'm goling to fix these problems by following the procedures below:

- Scrape all the links and recipe ids by making Selemnium scroll through food.com/search as much as it takes.
- Loop through all links and extract original `ingredient_quants` and `ingredient_items` from the page (maybe also other things such as `picked_by`).
- Create a dataframe containing `link`, `RecipeID`, etc., and joing it with the Recipes dataset on `RecipeID`
- Drop the original `ingredient_quants` and `ingredient_items` from the Recipes dataset.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
recipes = pd.read_parquet('recipes.parquet')

## Selenium Playground

In [143]:
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import pandas as pd

#pattern = r'\d+(?:-\d+)?\n?.*?(?:\n\d+(?:\.\d+)?)?(?:\s*ml|\s*g)\s*\w+|\d+(?:-\d+)?\n?.*'

# Set up Selenium webdriver with Chrome
driver = webdriver.Chrome()

# Open the webpage
driver.get("https://www.food.com/recipe/crock-pot-chicken-with-black-beans-cream-cheese-89204")
time.sleep(1)

#button = driver.find_element(By.CLASS_NAME, 'button.button--units.svelte-ovaflp')

# Click the button
#button.click()
#time.sleep(1)  # Add a delay if necessary

ingred_quants = driver.find_elements(By.CLASS_NAME, 'ingredient-quantity.svelte-ovaflp')
ingred_quants = [quant.text for quant in ingred_quants]

ingred_items = driver.find_elements(By.CLASS_NAME, 'ingredient-text.svelte-ovaflp')
ingred_items = [item.text for item in ingred_items]

# Close the browser
driver.quit()

In [144]:
len(ingred_quants)

5

In [145]:
len(ingred_items)

5

In [146]:
for i,j in zip(ingred_quants,ingred_items):
    print(i,j)

4 -5 boneless chicken breasts, frozen
1 (15 1/2 ounce) can black beans
1 (15 ounce) can corn
1 (15 ounce) jar salsa, any kind
1 (8 ounce) package cream cheese


## Scrape the new Recipes Dataframe (`recipes_with_urls`):
- Loop through the links
- Extract `ingred_quant` and `ingred_items` and add the items to two new columns in `recipes_with_urls`.

In [4]:
import pandas as pd
pd.set_option('display.max_columns', 100)

In [5]:
df = pd.read_parquet('recipes_with_urls.parquet')

In [7]:
df.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url
135813,142553.0,Spinach Pasta Rolls,89831,Kittencalrecipezazz,PT45M,PT1H,PT1H45M,2005-10-24 16:20:00+00:00,"This recipe goes back years, and I make it qui...",[],Savory,"[Brunch, Oven, Stove Top, < 4 Hours]","[1, 2, 2, 1⁄2, 1, 1, 1⁄2, None, 3, 2, 8]","[frozen spinach, garlic, ricotta cheese, parme...",4.0,1.0,436.3,22.6,12.4,96.5,900.5,32.6,2.3,9.8,25.8,,8 lasagna rolls,"[Grease an 8 x 8-inch baking dish., Boil lasag...",https://www.food.com/recipe/Spinach-Pasta-Roll...
411758,426857.0,Healthy Single-Serve Key Lime Pie Yogurt,1375473,the80srule,PT5M,PT0S,PT5M,2010-05-24 09:45:00+00:00,"Ok, not really key lime unless you use actual ...",[],Breakfast,"[Dessert, Lime, Citrus, Fruit, Low Cholesterol...","[1, 2, 1⁄8, 1⁄2]","[plain yogurt, key lime juice, vanilla extract...",5.0,1.0,182.6,8.0,5.1,31.9,113.4,20.2,0.1,18.3,8.6,1.0,1 cup,"[Mix all the ingredients together well, then s...",https://www.food.com/recipe/Healthy-Single-Ser...


In [8]:
df['ingred_quants'] = df['RecipeId'].apply(lambda x: [])
df['ingred_items'] = df['RecipeId'].apply(lambda x: [])

In [9]:
df.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,ingred_quants,ingred_items
19297,22690.0,Butterscotch-Ginger Ice Cream Sauce For Two,20371,Lennie,PT3M,PT5M,PT8M,2002-03-16 19:31:00+00:00,This is a wonderful sauce to serve over French...,[],Sauces,"[Frozen Desserts, Dessert, Canadian, Kid Frien...","[1⁄3, 2, 1⁄4, 3, 1⁄4, 1, 2]","[brown sugar, cornstarch, ground ginger, honey...",,,335.0,9.2,5.8,26.5,69.2,65.5,0.1,61.2,1.1,2.0,,"[In a two-cup glass measuring cup, combine the...",https://www.food.com/recipe/Butterscotch-Ginge...,[],[]
125661,132079.0,Cincinnati Chili,185105,Stacky5,PT15M,PT20M,PT35M,2005-08-02 18:25:00+00:00,This is a wonderful chili that is a personal f...,[https://img.sndimg.com/food/image/upload/w_55...,Spaghetti,"[Meat, Healthy, Kid Friendly, < 60 Mins, Begin...","[1 1⁄2, 2, 2, 2, 1⁄4, 1⁄8, 1, 4, 2, 16, 2]","[onions, garlic cloves, chili powder, ground c...",4.5,4.0,1177.5,36.6,17.9,169.9,1786.5,136.4,15.5,18.4,75.6,,,"[Cook the beef, onions and garlic in a large (...",https://www.food.com/recipe/Cincinnati-Chili-1...,[],[]


In [10]:
cf= pd.DataFrame(columns=['ingred_quants','ingred_items'])
cf

Unnamed: 0,ingred_quants,ingred_items


In [11]:
import numpy as np
links = np.array(df['url'])

In [12]:
len(links)

522517

In [16]:
for i, url in zip(range(13452,13459),links[13452:13459]):
    print(i)

13452
13453
13454
13455
13456
13457
13458


In [None]:
import time
from datetime import datetime
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException

#cf= pd.DataFrame(columns=['url','ingred_quants','ingred_items'])

# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run Chrome in headless mode

# Set up Selenium webdriver with Chrome and the specified options
driver = webdriver.Chrome(options=chrome_options)
#driver = webdriver.Chrome()
print(datetime.now())

counter = 0  # Counter variable

for i, url in zip(range(23698,len(links)),links[23698:]):
    try:
        driver.get(url)
        
        ingred_quants = driver.find_elements(By.CLASS_NAME, 'ingredient-quantity.svelte-ovaflp')
        ingred_quants = [quant.text for quant in ingred_quants]
        ingred_items = driver.find_elements(By.CLASS_NAME, 'ingredient-text.svelte-ovaflp')
        ingred_items = [item.text for item in ingred_items]
        
        cf.at[i, 'url'] = url
        cf.at[i, 'ingred_quants'] = ingred_quants
        cf.at[i, 'ingred_items'] = ingred_items
        
    except NoSuchElementException:
        continue
    
    counter += 1  # Increment counter
    
    if counter == 100:
        # Save DataFrame to pickle and CSV
        cf.to_parquet('Recipes_final.parquet')
        print(datetime.now())
        
        counter = 0  # Reset counter

# Save the final state of the DataFrame
cf.to_parquet('Recipes_final.parquet')

# Close the browser
driver.quit()

2023-06-20 23:06:44.835341
2023-06-20 23:14:28.966519
2023-06-20 23:21:33.428642
2023-06-20 23:27:27.634854
2023-06-20 23:34:05.705968
2023-06-20 23:40:35.944785
2023-06-20 23:49:07.375720
2023-06-20 23:56:59.933512
2023-06-21 00:04:22.983725
2023-06-21 00:10:32.905152
2023-06-21 00:16:59.944395
2023-06-21 00:22:42.561568
2023-06-21 00:28:32.541835
2023-06-21 00:34:20.254934
2023-06-21 00:40:42.640386
2023-06-21 00:46:42.143293
2023-06-21 00:52:58.582498
2023-06-21 00:59:04.755266
2023-06-21 01:04:56.126777
2023-06-21 01:10:55.557169
2023-06-21 01:16:32.642942
2023-06-21 01:22:53.419275
2023-06-21 01:29:08.453856
2023-06-21 01:35:32.834558
2023-06-21 01:41:17.508761
2023-06-21 01:47:46.103258
2023-06-21 01:53:44.297089
2023-06-21 01:59:56.135518
2023-06-21 02:06:17.192477
2023-06-21 02:12:20.063296
2023-06-21 02:18:09.547667
2023-06-21 02:24:19.655364
2023-06-21 02:30:53.449689
2023-06-21 02:36:54.436831
2023-06-21 02:46:52.949007
2023-06-21 02:54:36.668068
2023-06-21 03:03:50.745124
2

In [20]:
cf

Unnamed: 0,url,ingred_quants,ingred_items
0,https://www.food.com/recipe/Low-Fat-Berry-Blue...,"[4, 1⁄4, 1, 1]","[cups blueberries, fresh or frozen, cup granul..."
1,https://www.food.com/recipe/Biryani-39,"[1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","[tablespoon saffron, teaspoons milk, warm, hot..."
2,https://www.food.com/recipe/Best-Lemonade-40,"[1 1⁄2, 1, , 1 1⁄2, , 3⁄4]","[cups sugar, tablespoon lemons, rind of or 1 t..."
3,https://www.food.com/recipe/Carina's-Tofu-Vege...,"[12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","[ounces extra firm tofu, water-packed, medium ..."
4,https://www.food.com/recipe/Cabbage-Soup-42,"[46, 4, 1, 2, 1]","[ounces plain tomato juice, cups cabbage, shre..."
...,...,...,...
23694,https://www.food.com/recipe/Nutty-Cherry-Pie-2...,"[2, 1, 1, 3⁄4, 2, , 3⁄4]","[pie crusts, i use the ones in the dairy secti..."
23695,https://www.food.com/recipe/Texas-Hot-Cocoa-Ca...,"[3 1⁄2 - 4, 1⁄4, 1⁄2, 1, 2, 2, 1⁄2, 2, 1, 1, 1...","[tablespoons cocoa, lb butter, cup vegetable o..."
23696,https://www.food.com/recipe/Jack-Daniel's-Stea...,"[3, 10, 3 3⁄4]","[lbs sirloin steaks, ounces Worcestershire sau..."
23697,https://www.food.com/recipe/Horseradish-Mashed...,"[4, 8, 1⁄4, 1⁄2, 1, 1⁄8, 1⁄4, 1⁄4]","[ounces celery root, peeled and cut into chunk..."


Now, make comparsions with the original columns of the `Receipes` dataframe (`RecipeIngredientQuantities` and `RecipeIngredientParts`) and then drop the latter.