<a href="https://colab.research.google.com/github/cwf2/clas-3801-fa23/blob/main/allrecipes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install additional packages

We're going to use `requests-html` instead of `requests` because it seems to do a better job of mimicking a web browser and thus not getting booted from the site.

In [1]:
!pip install requests-html

Collecting requests-html
  Downloading requests_html-0.10.0-py3-none-any.whl (13 kB)
Collecting pyquery (from requests-html)
  Downloading pyquery-2.0.0-py3-none-any.whl (22 kB)
Collecting fake-useragent (from requests-html)
  Downloading fake_useragent-1.4.0-py3-none-any.whl (15 kB)
Collecting parse (from requests-html)
  Downloading parse-1.20.0-py2.py3-none-any.whl (19 kB)
Collecting bs4 (from requests-html)
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting w3lib (from requests-html)
  Downloading w3lib-2.1.2-py3-none-any.whl (21 kB)
Collecting pyppeteer>=0.0.14 (from requests-html)
  Downloading pyppeteer-1.0.2-py3-none-any.whl (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.4/83.4 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyee<9.0.0,>=8.1.0 (from pyppeteer>=0.0.14->requests-html)
  Downloading pyee-8.2.2-py2.py3-none-any.whl (12 kB)
Collecting urllib3<2.0.0,>=1.25.8 (from pyppeteer>=0.

### import statements

In addition to **requests-html** we're using **json** to decode the data stored in the page header, and **pandas** to process some of the data.

In [2]:
import os
import json
import time, random
import pandas as pd
from matplotlib import pyplot as plt
from requests_html import HTMLSession

### Preliminaries

Aspects of the connection to allrecipes that are common to the whole script.

In [3]:
# create a session for all HTTP requests
session = HTMLSession()

# this part of the URL doesn't change
base_url = 'https://www.allrecipes.com/recipe/'

### Download a single recipe

Here we test out our methods on a single page. Make sure to run this cell as infrequently as possible, so that the site doesn't get mad about bot-like patterns of requests.

In [4]:
page = '18379/best-green-bean-casserole/'
response = session.get(base_url+page)
if not response.ok:
  response.raise_for_status()

html = response.html


## Processing the results

Now that we have the HTTP response, we can take our time parsing it to pull out the information we want.

### Parsing the HTML

A lot of the information we want is right on the page. We just need to be able to navigate the HTML structure in order to find the bits we're interested in.

In [5]:
# title
title = html.find('h1.article-heading', first=True).text
print(title)

Best Green Bean Casserole


In [6]:
# ingredients
ingredients = []
for item in html.find('li.mntl-structured-ingredients__list-item > p'):
  quantity = item.find('span[@data-ingredient-quantity="true"]', first=True)
  if quantity is not None:
    quantity = int(quantity.text)

  unit = item.find('span[@data-ingredient-unit="true"]', first=True)
  if unit is not None:
    unit = unit.text

  name = item.find('span[@data-ingredient-name="true"]', first=True)
  if name is not None:
    name = name.text

  ingredients.append(dict(
      quantity = quantity,
      unit = unit,
      name = name,
  ))

ingredients = pd.DataFrame(ingredients)
display(ingredients)

Unnamed: 0,quantity,unit,name
0,2,(14.5 ounce) cans,"green beans, drained"
1,1,(10.5 ounce) can,condensed cream of mushroom soup
2,1,cup,"shredded Cheddar cheese, divided"
3,1,(6 ounce) can,French-fried onions


In [7]:
# nutrition facts summary
facts_summary = {}

for tr in html.find('tr.mntl-nutrition-facts-summary__table-row'):
  cells = tr.find('td')
  if len(cells) == 2:
    value, label = [cell.text for cell in cells]
    facts_summary[label] = value

print(facts_summary)

{'Calories': '322', 'Fat': '23g', 'Carbs': '20g', 'Protein': '7g'}


In [8]:
# steps
steps = []
for li in html.find('#recipe__steps-content_1-0 > ol > li'):
  step_paras = [p.text for p in li.find('li > p')]
  step_text = '\n'.join(step_paras)
  steps.append(step_text)

for i, step in enumerate(steps):
  print(i, step, sep='\t')

0	Preheat the oven to 350 degrees F (175 degrees C).
1	Mix green beans and condensed soup together in a large microwave-safe bowl until well combined. Microwave on high until warm, 3 to 5 minutes.
2	Stir 1/2 cup Cheddar cheese into the green bean mixture. Microwave on high for 2 to 3 minutes.
3	Transfer mixture to a casserole dish and spread evenly over the bottom. Sprinkle French-fried onions over top, then sprinkle with remaining Cheddar.
4	Bake in the preheated oven until cheese is melted and the onions are just turning brown, about 10 minutes.


### Parsing embedded JSON data

Some of the things we want, like the comments at the bottom of the page, are not initially in the html structure of the page; rather, they're included in JSON format as part of a `<script>` element in the header.

Here we extract a dictionary of recipe attributes from that element:

In [9]:
# get JSON data stored in the header
script = html.find('#allrecipes-schema_1-0', first=True)
data = json.loads(script.text)
data = data[0]

Now we can pull out the data we need using keys to the dictionary.

In [10]:
# number of ratings
n_ratings = data.get('aggregateRating', {}).get('ratingCount')

# average ratings
rating = data.get('aggregateRating', {}).get('ratingValue')
if rating is not None:
  rating = float(rating)

In [11]:
# comments
comments = pd.DataFrame(dict(
    rating = comment.get('reviewRating', {}).get('ratingValue'),
    name = comment.get('author', {}).get('name'),
    text = comment.get('reviewBody'),
) for comment in data['review'])
display(comments)

Unnamed: 0,rating,name,text
0,5,Karen,"Added chopped water chestnuts and bacon, and m..."
1,5,PamMiami,Very good!! I did add a splash of water to the...
2,5,queenandrea1988,"This is a traditional meal, not only for potlu..."
3,5,Allrecipes Member,Quick and easy and great tasting
4,5,Allrecipes Member,"It was delicious, despite not being a good che..."
...,...,...,...
94,5,brittany,i loved it it was so delicious
95,5,Kelli Annette Holder,"Very simple, I used fresh green beans so I boi..."
96,4,Nicole,The dish was a major hit!! I made it for Frien...
97,5,barb63,I've been making this variation of the green b...


In [12]:
# full nutrition info
nutrition = {}
for k, v in data['nutrition'].items():
  if k.startswith('@'):
    continue
  if k.endswith('Content'):
    k = k[:-7]
  nutrition[k] = v

print(nutrition)

{'calories': '322 kcal', 'carbohydrate': '20 g', 'cholesterol': '20 mg', 'fiber': '2 g', 'protein': '7 g', 'saturatedFat': '9 g', 'sodium': '1068 mg', 'sugar': '3 g', 'fat': '23 g', 'unsaturatedFat': '0 g'}


# Part II

Now that we've tested each individual element of the page that we might want to extract, let's synthesize all the steps above in one custom function.
- It takes as an argument the path to a recipe.
- It downloads the recipe page and extracts all the data we want.
- It returns a dictionary with the extracted data.

In [13]:
def ripRecipe(path):
    '''Rip a single recipe from AllRecipes and return a dictionary'''

    # request the page from AllRecipes
    response = session.get(base_url+path)

    # check for HTTP errors
    if not response.ok:
        response.raise_for_status()

    # get just the HTML content of the response
    html = response.html

    #
    # extract data from the HTML
    #

    # title
    title = html.find('h1.article-heading', first=True).text


    # ingredients
    ingredients = []
    for item in html.find('li.mntl-structured-ingredients__list-item > p'):
        quantity = item.find('span[@data-ingredient-quantity="true"]', first=True)
        if quantity is not None:
            quantity = quantity.text

        unit = item.find('span[@data-ingredient-unit="true"]', first=True)
        if unit is not None:
            unit = unit.text

        name = item.find('span[@data-ingredient-name="true"]', first=True)
        if name is not None:
            name = name.text

        ingredients.append(dict(
            quantity = quantity,
            unit = unit,
            name = name,
        ))

    # nutrition facts summary
    facts_summary = {}

    for tr in html.find('tr.mntl-nutrition-facts-summary__table-row'):
        cells = tr.find('td')
        if len(cells) == 2:
            value, label = [cell.text for cell in cells]
            facts_summary[label] = value

    # steps
    steps = []
    for li in html.find('#recipe__steps-content_1-0 > ol > li'):
        step_paras = [p.text for p in li.find('li > p')]
        step_text = '\n'.join(step_paras)
        steps.append(step_text)

    #
    # get additional attributes from JSON data in the header
    #

    # extract the JSON data
    script = html.find('#allrecipes-schema_1-0', first=True)
    data = json.loads(script.text)
    data = data[0]

    # number of ratings
    n_ratings = data.get('aggregateRating', {}).get('ratingCount')

    # average ratings
    rating = data.get('aggregateRating', {}).get('ratingValue')
    if rating is not None:
        rating = float(rating)

    # comments
    comments = [dict(
        rating = comment.get('reviewRating', {}).get('ratingValue'),
        name = comment.get('author', {}).get('name'),
        text = comment.get('reviewBody'),
    ) for comment in data.get('review')]

    # full nutrition info
    nutrition = {}
    for k, v in data.get('nutrition', {}).items():
        if k.startswith('@'):
            continue
        if k.endswith('Content'):
            k = k[:-7]
            nutrition[k] = v

    #
    # package everything as one dictionary
    #

    recipe = dict(
        title = title,
        ingredients = ingredients,
        facts_summary = facts_summary,
        steps = steps,
        n_ratings = n_ratings,
        rating = rating,
        comments = comments,
    )

    return recipe


### Testing it out

Let's try the new function on a different recipe.

In [14]:
path = '275305/green-beans-almondine/'
recipe = ripRecipe(path)
print(recipe)

{'title': 'Green Beans Almondine', 'ingredients': [{'quantity': '1', 'unit': 'pound', 'name': 'fresh green beans'}, {'quantity': '1 ½', 'unit': 'ounces', 'name': 'slivered almonds'}, {'quantity': '2', 'unit': 'tablespoons', 'name': 'butter'}, {'quantity': '2', 'unit': 'cloves', 'name': 'garlic, minced, or more to taste'}, {'quantity': '', 'unit': '', 'name': 'salt and ground black pepper to taste'}], 'facts_summary': {'Calories': '152', 'Fat': '12g', 'Carbs': '11g', 'Protein': '4g'}, 'steps': ['Place beans into a microwave-safe casserole dish with enough water to cover the bottom. Microwave on high power until almost tender, 8 to 10 minutes. Drain in a colander and refresh under cold running water to keep the nice green color.', 'Heat a frying pan over medium heat. Add almonds and cook until just starting to turn golden, about 3 to 5 minutes, watching closely so as to not burn. Take pan off of heat and add butter to melt. Return to heat and add garlic. Mix in green beans, season with s

### Try loading a list of recipes in batch mode

In [15]:
# download a list page instead of a single recipe

list_url = 'https://www.allrecipes.com/9x13-thanksgiving-casseroles-8401120'
response = session.get(list_url)
if not response.ok:
    response.raise_for_status()

html = response.html

In [16]:
# extract the links to featured recipes from the list

paths = []

for link in html.find('a.mntl-sc-block-featuredlink__link'):
    href = link.attrs.get('href')
    if href.startswith('https://www.allrecipes.com/recipe/'):
        paths.append(href[34:])

print('\n'.join(paths))

255942/savory-sausage-sun-dried-tomato-and-pumpkin-bread-pudding/
13478/sweet-potato-casserole-ii/
230283/tasty-baked-mac-n-cheese/
165112/merritts-butternut-squash-gratin/
231988/twice-baked-potato-casserole-with-bacon/
240157/loaded-mashed-potatoes/
234620/mascarpone-mashed-potatoes/
183620/savory-roasted-root-vegetables/
18249/baked-sweet-potatoes/
152886/thanksgiving-corn-casserole/
209928/spinach-casserole/
235239/thanksgiving-tradition-oyster-dressing/
26492/minnesota-wild-rice-dressing/
24741/cornbread-stuffing-with-sausage/
25891/garlic-potatoes-gratin/
270373/basic-funeral-potatoes/
240207/vintage-fresh-green-bean-casserole-circa-1956/
274668/tater-tot-corn-casserole/
235663/sweet-potato-crunch/
23386/sausage-and-cornbread-dressing/
19547/grandmas-corn-bread-dressing/
13606/awesome-broccoli-cheese-casserole/
13634/broccoli-and-stuffing-casserole/
276794/sweet-potato-and-gruyere-gratin/
23161/southern-cornbread-dressing/
219472/french-onion-green-bean-casserole/
275305/green-be

### Precautions in downloading recipes in batch

- Cache results locally so that we don't have to re-do things that work
- Add a randomly timed pause in between downloads, to seem less robotic

In [17]:
# use a chache
cache_file = 'cache.json'

if os.path.exists(cache_file):
    with open(cache_file) as f:
        cache = json.load(f)
else:
    cache = {}

In [18]:
# download all the recipes

for i, path in enumerate(paths):
    print(f'[{i+1}/{len(paths)}] {path}', end=' ... ')

    # check cache before downloading
    if path in cache:
        print('cached')
        continue

    # download this recipe and save to cache
    cache[path] = ripRecipe(path)
    print('downloaded')

    # pause
    time.sleep(random.randint(2,10))

[1/31] 255942/savory-sausage-sun-dried-tomato-and-pumpkin-bread-pudding/ ... downloaded
[2/31] 13478/sweet-potato-casserole-ii/ ... downloaded
[3/31] 230283/tasty-baked-mac-n-cheese/ ... downloaded
[4/31] 165112/merritts-butternut-squash-gratin/ ... downloaded
[5/31] 231988/twice-baked-potato-casserole-with-bacon/ ... downloaded
[6/31] 240157/loaded-mashed-potatoes/ ... downloaded
[7/31] 234620/mascarpone-mashed-potatoes/ ... downloaded
[8/31] 183620/savory-roasted-root-vegetables/ ... downloaded
[9/31] 18249/baked-sweet-potatoes/ ... downloaded
[10/31] 152886/thanksgiving-corn-casserole/ ... downloaded
[11/31] 209928/spinach-casserole/ ... downloaded
[12/31] 235239/thanksgiving-tradition-oyster-dressing/ ... downloaded
[13/31] 26492/minnesota-wild-rice-dressing/ ... downloaded
[14/31] 24741/cornbread-stuffing-with-sausage/ ... downloaded
[15/31] 25891/garlic-potatoes-gratin/ ... downloaded
[16/31] 270373/basic-funeral-potatoes/ ... downloaded
[17/31] 240207/vintage-fresh-green-bean-ca

In [19]:
# save cache to file
print(f'writing {cache_file}')

with open(cache_file, 'w') as f:
    json.dump(cache, f)

writing cache.json
