<a href="https://colab.research.google.com/github/cwf2/clas-3801-fa23/blob/main/allrecipes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install additional packages

We're going to use `requests-html` instead of `requests` because it seems to do a better job of mimicking a web browser and thus not getting booted from the site.

In [None]:
!pip install requests-html

### import statements

In addition to **requests-html** we're using **json** to decode the data stored in the page header, and **pandas** to process some of the data.

In [None]:
import os
import json
import time, random
import pandas as pd
from matplotlib import pyplot as plt
from requests_html import HTMLSession

### Preliminaries

Aspects of the connection to allrecipes that are common to the whole script.

In [None]:
# create a session for all HTTP requests
session = HTMLSession()

# this part of the URL doesn't change
base_url = 'https://www.allrecipes.com/recipe/'

### Download a single recipe

Here we test out our methods on a single page. Make sure to run this cell as infrequently as possible, so that the site doesn't get mad about bot-like patterns of requests.

In [None]:
page = '18379/best-green-bean-casserole/'
response = session.get(base_url+page)
if not response.ok:
  response.raise_for_status()

html = response.html


## Processing the results

Now that we have the HTTP response, we can take our time parsing it to pull out the information we want.

### Parsing the HTML

A lot of the information we want is right on the page. We just need to be able to navigate the HTML structure in order to find the bits we're interested in.

In [None]:
# title
title = html.find('h1.article-heading', first=True).text
print(title)

In [None]:
# ingredients
ingredients = []
for item in html.find('li.mntl-structured-ingredients__list-item > p'):
  quantity = item.find('span[@data-ingredient-quantity="true"]', first=True)
  if quantity is not None:
    quantity = int(quantity.text)

  unit = item.find('span[@data-ingredient-unit="true"]', first=True)
  if unit is not None:
    unit = unit.text

  name = item.find('span[@data-ingredient-name="true"]', first=True)
  if name is not None:
    name = name.text

  ingredients.append(dict(
      quantity = quantity,
      unit = unit,
      name = name,
  ))

ingredients = pd.DataFrame(ingredients)
display(ingredients)

In [None]:
# nutrition facts summary
facts_summary = {}

for tr in html.find('tr.mntl-nutrition-facts-summary__table-row'):
  cells = tr.find('td')
  if len(cells) == 2:
    value, label = [cell.text for cell in cells]
    facts_summary[label] = value

print(facts_summary)

In [None]:
# steps
steps = []
for li in html.find('#recipe__steps-content_1-0 > ol > li'):
  step_paras = [p.text for p in li.find('li > p')]
  step_text = '\n'.join(step_paras)
  steps.append(step_text)

for i, step in enumerate(steps):
  print(i, step, sep='\t')

### Parsing embedded JSON data

Some of the things we want, like the comments at the bottom of the page, are not initially in the html structure of the page; rather, they're included in JSON format as part of a `<script>` element in the header.

Here we extract a dictionary of recipe attributes from that element:

In [None]:
# get JSON data stored in the header
script = html.find('#allrecipes-schema_1-0', first=True)
data = json.loads(script.text)
data = data[0]

Now we can pull out the data we need using keys to the dictionary.

In [None]:
# number of ratings
n_ratings = data.get('aggregateRating', {}).get('ratingCount')

# average ratings
rating = data.get('aggregateRating', {}).get('ratingValue')
if rating is not None:
  rating = float(rating)

In [None]:
# comments
comments = pd.DataFrame(dict(
    rating = comment.get('reviewRating', {}).get('ratingValue'),
    name = comment.get('author', {}).get('name'),
    text = comment.get('reviewBody'),
) for comment in data['review'])
display(comments)

In [None]:
# full nutrition info
nutrition = {}
for k, v in data['nutrition'].items():
  if k.startswith('@'):
    continue
  if k.endswith('Content'):
    k = k[:-7]
  nutrition[k] = v

print(nutrition)

# Part II

Now that we've tested each individual element of the page that we might want to extract, let's synthesize all the steps above in one custom function.
- It takes as an argument the path to a recipe.
- It downloads the recipe page and extracts all the data we want.
- It returns a dictionary with the extracted data.

In [None]:
def ripRecipe(path):
    '''Rip a single recipe from AllRecipes and return a dictionary'''

    # request the page from AllRecipes
    response = session.get(base_url+path)

    # check for HTTP errors
    if not response.ok:
        response.raise_for_status()

    # get just the HTML content of the response
    html = response.html

    #
    # extract data from the HTML
    #

    # title
    title = html.find('h1.article-heading', first=True).text


    # ingredients
    ingredients = []
    for item in html.find('li.mntl-structured-ingredients__list-item > p'):
        quantity = item.find('span[@data-ingredient-quantity="true"]', first=True)
        if quantity is not None:
            quantity = quantity.text

        unit = item.find('span[@data-ingredient-unit="true"]', first=True)
        if unit is not None:
            unit = unit.text

        name = item.find('span[@data-ingredient-name="true"]', first=True)
        if name is not None:
            name = name.text

        ingredients.append(dict(
            quantity = quantity,
            unit = unit,
            name = name,
        ))

    # nutrition facts summary
    facts_summary = {}

    for tr in html.find('tr.mntl-nutrition-facts-summary__table-row'):
        cells = tr.find('td')
        if len(cells) == 2:
            value, label = [cell.text for cell in cells]
            facts_summary[label] = value

    # steps
    steps = []
    for li in html.find('#recipe__steps-content_1-0 > ol > li'):
        step_paras = [p.text for p in li.find('li > p')]
        step_text = '\n'.join(step_paras)
        steps.append(step_text)

    #
    # get additional attributes from JSON data in the header
    #

    # extract the JSON data
    script = html.find('#allrecipes-schema_1-0', first=True)
    data = json.loads(script.text)
    data = data[0]

    # number of ratings
    n_ratings = data.get('aggregateRating', {}).get('ratingCount')

    # average ratings
    rating = data.get('aggregateRating', {}).get('ratingValue')
    if rating is not None:
        rating = float(rating)

    # comments
    comments = [dict(
        rating = comment.get('reviewRating', {}).get('ratingValue'),
        name = comment.get('author', {}).get('name'),
        text = comment.get('reviewBody'),
    ) for comment in data.get('review')]

    # full nutrition info
    nutrition = {}
    for k, v in data.get('nutrition', {}).items():
        if k.startswith('@'):
            continue
        if k.endswith('Content'):
            k = k[:-7]
            nutrition[k] = v

    #
    # package everything as one dictionary
    #

    recipe = dict(
        title = title,
        ingredients = ingredients,
        facts_summary = facts_summary,
        steps = steps,
        n_ratings = n_ratings,
        rating = rating,
        comments = comments,
    )

    return recipe


### Testing it out

Let's try the new function on a different recipe.

In [None]:
path = '275305/green-beans-almondine/'
recipe = ripRecipe(path)
print(recipe)

### Try loading a list of recipes in batch mode

In [None]:
# download a list page instead of a single recipe

list_url = 'https://www.allrecipes.com/9x13-thanksgiving-casseroles-8401120'
response = session.get(list_url)
if not response.ok:
    response.raise_for_status()

html = response.html

In [None]:
# extract the links to featured recipes from the list

paths = []

for link in html.find('a.mntl-sc-block-featuredlink__link'):
    href = link.attrs.get('href')
    if href.startswith('https://www.allrecipes.com/recipe/'):
        paths.append(href[34:])

print('\n'.join(paths))

### Precautions in downloading recipes in batch

- Cache results locally so that we don't have to re-do things that work
- Add a randomly timed pause in between downloads, to seem less robotic

In [None]:
# use a chache
cache_file = 'cache.json'

if os.path.exists(cache_file):
    with open(cache_file) as f:
        cache = json.load(f)
else:
    cache = {}

In [None]:
# download all the recipes

for i, path in enumerate(paths):
    print(f'[{i+1}/{len(paths)}] {path}', end=' ... ')

    # check cache before downloading
    if path in cache:
        print('cached')
        continue

    # download this recipe and save to cache
    cache[path] = ripRecipe(path)
    print('downloaded')

    # pause
    time.sleep(random.randint(2,10))

In [None]:
# save cache to file
print(f'writing {cache_file}')

with open(cache_file, 'w') as f:
    json.dump(cache, f)