# DS3000 Day 9

Oct 11 2022

Admin:
- due Tonight before midnight
    - hw3
    - Project Proposals
        - Check Canvas for an example Data Analysis Plan
- lab on Friday (no lecture; maybe visitor)
- hw4 will be posted by the end of today, due next Tuesday before midnight

Content:
- Continue Intro to Web Scraping/DS Pipelines

### Previously on DS 3000

## Beginning the pipeline

**Goal:** Get a list of recipe names from www.allrecipes.com like we did for:

https://www.allrecipes.com/search?q=cheese+fondue

1. Write function `crawl_recipes(query)` which:
    * takes the search phrase (the ingredient) as input argument
    * builds the correct url that leads directly to the page that lists the recipes
    * uses `requests` to get the content of this page returns the html text of the page
1. Write `extract_recipes(text)` which:
    * takes the text returned by `crawl_recipes` as argument
    * builds a BeautifulSoup object out of that text 
    * finds names of all recipes
        - to identify which tags / classes to `find_all()`, open the page in your browser and "inspect" 
        - start from the recipe object above, and call another `find_all()` to zoom into the recipe name itself
    * returns the list of recipe names
    

A new function that will help if you wish to query multiple words:

`string.replace()`

So, if you wish to turn `cheese fondue` into `cheese+fondue`:

`string = 'cheese fondue'`

`string.replace(" ", "+")`

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import requests
import json
from datetime import datetime
import pandas as pd
import plotly
import plotly.express as px
from bs4 import BeautifulSoup

In [None]:
def crawl_recipes(query):
    """ gets html of from allrecipes.com to search query
    
    Args:
        query (str): search string
        
    Returns:
        html_str (str): html response from allreceipes.com
    """
    
    query = query.replace(" ", "+")
    url = f'https://www.allrecipes.com/search?q={query}'
    return requests.get(url).text

def extract_recipes(text):
    """ builds list of recipe names from allrecipies html
    
    Args:
        html_str (str): html response from allrecipes.com, see crawl_recipes()
        
    Returns:
        recipe_list (list): list of recipes
    """
    # build soup object from text
    soup = BeautifulSoup(text)
    
    
    recipe_list = []
    for recipe in soup.find_all(class_='card__title-text'):
        # extract / store recipe
        recipe = recipe.text
        recipe_list.append(recipe)
    return recipe_list
    

In [None]:
cheesefondue_html = crawl_recipes('cheese fondue')
recipe_list = extract_recipes(cheesefondue_html)
recipe_list

In [None]:
meatloaf_html = crawl_recipes('meatloaf')
new_recipe_list = extract_recipes(meatloaf_html)

In [None]:
new_recipe_list

## Getting info from each recipe's own page:

When we interact with the webpage in the browser, clicking on the header with the recipe name leads us to the actual recipe. Let's have a look at how it's done. Here is the link (`<a >` tag) for the first and third cards of the meatloaf search:

```html
<a class="comp mntl-card-list-items mntl-document-card mntl-card card card--no-image" 
   data-cta="" 
   data-doc-id="6663943" 
   data-ordinal="1" 
   data-tax-levels="" 
   href="https://www.allrecipes.com/recipe/219171/classic-meatloaf/" 
   id="mntl-card-list-items_1-0">
```

```html
<a class="comp mntl-card-list-items mntl-document-card mntl-card card card--no-image" 
   data-cta="" 
   data-doc-id="6663443" 
   data-ordinal="3" 
   data-tax-levels="" 
   href="https://www.allrecipes.com/recipe/223381/melt-in-your-mouth-meat-loaf/" 
   id="mntl-card-list-items_1-0-2">
```



In [None]:
meatloaf_html = crawl_recipes('meatloaf')
soup = BeautifulSoup(meatloaf_html)

In [None]:
# get a single recipe with link
recipe = soup.find_all('a', class_='comp mntl-card-list-items mntl-document-card mntl-card card card--no-image')[0]

In [None]:
recipe

`BeautifulSoup` exposes a tag's attributes as a dictionary:

In [None]:
recipe.attrs

In [None]:
recipe.attrs['href']

# Adding `href` to our dataframe of recipes

Let's modify our `extract_recipes()` function such that rather than returning just the names of the dishes, it returns a list of dictionaries, where each dictionary has the `name` and `url` fields:

## `from_dict`

First, a useful tool to turn a dictionary into a data frame where the keys are features (columns) and the values are lists that correspond to the values of the features (rows) is the `pd.DataFrame.from_dict()` function:

In [None]:
example_dict = {'col1': [1,2,3,4,5],
                'col2': [6,7,8,9,10],
                'col3': ['who', 'what', 'when', 'where', 'why']}
pd.DataFrame.from_dict(example_dict)

In [None]:
def extract_recipes(text):
    """ builds list of recipe names from allrecipies html
    
    Args:
        html_str (str): html response from allrecipes.com, see crawl_recipes()
        
    Returns:
        df_recipe (pd.DataFrame): dataframe of recipes
    """
    # build soup object from text
    soup = BeautifulSoup(text)
    
    recipe_list = []
    for recipe in soup.find_all(class_='card__title-text'):
        # extract / store recipe
        recipe_name = recipe.text
        recipe_list.append(recipe_name)

    href_list = []
    for recipe in soup.find_all('a', class_='comp mntl-card-list-items mntl-document-card mntl-card card card--no-image'):
        # grab the link from each recipe
        recipe_link = recipe.attrs['href']
        href_list.append(recipe_link)
        
        
    # bundle as a dictionary (then use from_dict)
    recipe_dict = {'name': recipe_list,
                   'href': href_list}
    df_recipe = pd.DataFrame.from_dict(recipe_dict)
        
    return df_recipe

In [None]:
extract_recipes(meatloaf_html)

## String Manipulations
- `.split()` & `.join()`
- `.strip()`
- `.replace()`
- `.upper()` & `.lower()`

Visting [a specific recipe's page](https://www.allrecipes.com/recipe/219171/classic-meatloaf/) yields data stored in a string.  The methods above allow us to extract this information.

In [None]:
# .strip removes all leading and trailing whitespace (spaces and newlines)
'\n\n\n hello!      \n    hello! \n\n    \n \n'.strip()

In [None]:
# we saw .replace last class:
'cheese fondue'.replace(' ', '+')

In [None]:
"hello fred".replace("fred", "george")

In [None]:
# can use replace to delete parts of the string
'lets forget about it, okay?'.replace(' it', '')

In [None]:
# capitalize everything
'dont shout!'.upper()

In [None]:
# lowercase everything
'BE QuieT'.lower()

In [None]:
# split will split a string on every occurance of given string (',' below)
'fat: 54 g, calories: 430 cal, sugar: 10g'.split(',')

In [None]:
# put disparate strings into a single string, glued together by some other string
'<glue>'.join(['a', 'b', 'c', 'd'])

In [None]:
''.join(['a', 'b', 'c', 'd'])

In [None]:
name_list = 'last0, first0, last1, first1, last2, first2'.split(',')

','.join(name_list[:2])

In [None]:
','.join(name_list[2:4]).strip()

In [None]:
# visit specific recipe's page
url = 'https://www.allrecipes.com/recipe/283561/classic-cheese-fondue/'
html = requests.get(url).text
soup = BeautifulSoup(html)

In [None]:
soup

In [None]:
# get prep info from 'mntl-recipe-details__content'
info_str = soup.find_all(class_='mntl-recipe-details__content')[0].text.strip().replace('\n', ' ')
info_str

As a string, this isn't as useful, we'd like to transform it into a dictionary:

```python
prep_info_dict = {'Prep Time': '10 mins',
                  'Cook Time': '15 mins',
                  'Total Time': '25 mins',
                  'Servings': '10',
                  'Yield': '10 servings'}
```

In [None]:
# getting nutrition informatin
# after some crawling we can find the labels here
soup.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')

In [None]:
# and the values can be found using the .next_sibling attribute
soup.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')[0].next_sibling

In [None]:
# getting nutrition information
nutr_dict = dict()
nutr_list = soup.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')
for fact in nutr_list:
    nutr_dict[fact.text] = fact.next_sibling.strip()
    
nutr_dict

## Lecture Break/Practice
Write two functions: `extract_prep_info()` and `extract_nutrition()`, which both accept a url of a particular recipe (see examples above) and return dictionaries of the prep in of nutritional information, respectively. For example:

```python
url = 'https://www.allrecipes.com/recipe/283561/classic-cheese-fondue/'
extract_prep_info(url)
extract_nutrition(url)

```

yields:

```python
prep_info_dict = {'Prep Time': '10 mins',
                  'Cook Time': '15 mins',
                  'Total Time': '25 mins',
                  'Servings': '10',
                  'Yield': '10 servings'}

```

and

```python
nutr_info_dict = {'Total Fat': '14g',
                  'Saturated Fat': '9g',
                  'Cholesterol': '46mg',
                  'Sodium': '179mg',
                  'Total Carbohydrate': '3g',
                  'Total Sugars': '1g',
                  'Protein': '13g',
                  'Vitamin C': '0mg',
                  'Calcium': '461mg',
                  'Iron': '0mg',
                  'Potassium': '67mg'}

```

In [None]:
info_str.split("   ")[0].split(':')

In [None]:
def extract_prep_info(url):
    """ returns a dictionary of recipe preparation info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        prep_info_dict (dict): keys are features ('prep'), 
            vals are str that describe feature ('20 mins')
    """
    html = requests.get(url).text
    soup = BeautifulSoup(html)
    
    prep_str = soup.find_all(class_='mntl-recipe-details__content')[0].text.strip().replace('\n', ' ')
    prep_dict = dict()
    
    for line in prep_str.split('   '):
        line_list = line.split(':')
        prep_dict[line_list[0].strip()] = line_list[1].strip()
    
    return prep_dict

In [None]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutr_dict (dict): keys are molecule types ('fat'), 
            vals are str of quantity ('24 g')
    """
    html = requests.get(url).text
    soup = BeautifulSoup(html)
    
    nutr_dict = dict()
    nutr_list = soup.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')
    for fact in nutr_list:
        nutr_dict[fact.text] = fact.next_sibling.strip()
    
    return nutr_dict

In [None]:
url = 'https://www.allrecipes.com/recipe/283561/classic-cheese-fondue/'
extract_prep_info(url)

In [None]:
extract_nutrition(url)

### Grabbing numeric values (float/int) from messy strings

- We have strings which describe recipe nutrition info (`'100 mg'`)
- We want numeric data types (`float, int`) so that we can plot and operate on these values

In [None]:
# float from string
float('123')

In [None]:
# potential problem when dealing with a full string: replacing g also modifies sugar
nutr_val = 'sugars: 40 g'
nutr_val.replace('g', '')

In [None]:
# endswith is a method of strings.  allows us to test if a string ends with another string
s = 'youll never guess whats last'
s.endswith('t')

In [None]:
# startswith does the same for the beggining of the string
s = 'hello asdf!'
s.startswith('hello')

In [None]:
# removing the unit in the example above
nutr_val = 'sugars: 40 g'

if nutr_val.endswith('g'):
    # reset nutr_val to exclude this last values
    nutr_val = nutr_val[:-1]

In [None]:
nutr_val

In [None]:
# removing the unit in the example above (programmatically)
nutr_val = 'sugars: 40 g'
s_remove = 'g'
if nutr_val.endswith(s_remove):
    nutr_val = nutr_val[:-len(s_remove)]

In [None]:
nutr_val

In [None]:
# removing many units in a loop
nutr_val = 'sugars: 40 Grams'
for s_rm in ['Grams', 'mg', 'g']:
    if nutr_val.endswith(s_rm):
        nutr_val = nutr_val[:-len(s_rm)]

nutr_val.strip()

## Rest of Class (Go slowly; if we don't finish we can next week)
Complete the `extract_nutrition()` below such that:

```python
# get / extract a data frame of recipes (only name and href)
str_query = 'boston cream pie'
html_str = crawl_recipes(str_query)
df_recipe = extract_recipes(html_str)

for row_idx in range(df_recipe.shape[0]):
    # get / extract nutrition info for a particular recipe
    recipe_url = df_recipe.loc[row_idx, 'href']
    nutr_dict = extract_nutrition(recipe_url)
    
    # add each new nutrition feature to the dataframe
    # only if there ARE nutrition features
    if len(nutr_dict) != 0:
        for nutr_feat, nutr_val in nutr_dict.items():
            df_recipe.loc[row_idx, nutr_feat] = nutr_val
    else:
        df_recipe = df_recipe.drop(row_idx, axis=0)

```

generates the `df_recipe`:

|    | name                           | href                                              | Total Fat | Saturated Fat | Cholesterol | Sodium | Total Carbohydrate | Dietary Fiber | Total Sugars | Protein | Vitamin C | Calcium | Iron | Potassium |
|----|--------------------------------|---------------------------------------------------|-----------|---------------|-------------|--------|--------------------|---------------|--------------|---------|-----------|---------|------|-----------|
| 0  | Chef John's Boston Cream Pie   | https://www.allrecipes.com/recipe/220942/chef-... | 41        | 17            | 199         | 514    | 72                 | 2             | 46           | 10      | 0         | 168     | 2    | 230       |
| 1  | Boston Cream Pie               | https://www.allrecipes.com/recipe/8138/boston-... | 13        | 6             | 61          | 230    | 47                 | 1             | 34           | 5       | 0         | 101     | 2    | 134       |
| 2  | Boston Cream Pie I             | https://www.allrecipes.com/recipe/8137/boston-... | 15        | 9             | 94          | 223    | 43                 | 1             | 26           | 5       | 0         | 97      | 2    | 95        |
| 3  | Semi-Homemade Boston Cream Pie | https://www.allrecipes.com/recipe/278930/semi-... | 41        | 16            | 219         | 568    | 79                 | 3             | 53           | 11      | 0         | 186     | 3    | 194       |
| 9  | Hot Milk Sponge Cake II        | https://www.allrecipes.com/recipe/8159/hot-mil... | 3         | 2             | 52          | 231    | 34                 | 0             | 20           | 4       | NaN       | 61      | 2    | 60        |
| 17 | Boston Cream Dessert Cups      | https://www.allrecipes.com/recipe/213446/bosto... | 15        | 7             | 44          | 237    | 32                 | 0             | 22           | 3       | 0         | 41      | 1    | 101       |
| 19 | Boston Creme Mini-Cupcakes     | https://www.allrecipes.com/recipe/220809/bosto... | 12        | 4             | 32          | 253    | 34                 | 0             | 24           | 3       | 0         | 62      | 1    | 100       |

In [None]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutr_dict (dict): keys are molecule types ('fat'), 
            vals are floats of quantity ('24 g' = 24)
    """
    pass

In [None]:
# get / extract a data frame of recipes (only name and href)
str_query = 'boston cream pie'
html_str = crawl_recipes(str_query)
df_recipe = extract_recipes(html_str)

In [None]:
df_recipe

In [None]:
url = 'https://www.allrecipes.com/recipe/220942/chef-johns-boston-cream-pie/'

# get soup from url
html = requests.get(url).text
soup = BeautifulSoup(html)

nutr_dict = dict()
nutr_list = soup.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')
for fact in nutr_list:
    nutr_feat = fact.next_sibling.strip()
    # strip units
    for str_rm in ['mg', 'g']:
        if nutr_feat.endswith(str_rm):
            nutr_feat = nutr_feat[:-len(str_rm)]
            
    nutr_dict[fact.text] = float(nutr_feat)
    
nutr_dict

Some recipes will not have nutrition facts:

In [None]:
url2 = 'https://www.allrecipes.com/gallery/most-popular-dessert-from-each-state/'

# get soup from url
html2 = requests.get(url2).text
soup2 = BeautifulSoup(html2)

nutr_dict2 = dict()
nutr_list2 = soup2.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')
for fact in nutr_list2:
    nutr_feat = fact.next_sibling.strip()
    # strip units
    for str_rm in ['mg', 'g']:
        if nutr_feat.endswith(str_rm):
            nutr_feat = nutr_feat[:-len(str_rm)]
            
    nutr_dict2[fact.text] = float(nutr_feat)
    
nutr_dict2

In [None]:
len(nutr_dict2)

In [None]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutr_dict (dict): keys are molecule types ('fat'), 
            vals are floats of quantity ('24 g' = 24)
    """

    html = requests.get(url).text
    soup = BeautifulSoup(html)

    nutr_dict = dict()
    nutr_list = soup.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')
    for fact in nutr_list:
        nutr_feat = fact.next_sibling.strip()
        # strip units
        for str_rm in ['mg', 'g']:
            if nutr_feat.endswith(str_rm):
                nutr_feat = nutr_feat[:-len(str_rm)]
            
        nutr_dict[fact.text] = float(nutr_feat)
    
    return nutr_dict

In [None]:
extract_nutrition(url)

In [None]:
# get / extract a data frame of recipes (only name and href)
str_query = 'boston cream pie'
html_str = crawl_recipes(str_query)
df_recipe = extract_recipes(html_str)

for row_idx in range(df_recipe.shape[0]):
    # get / extract nutrition info for a particular recipe
    recipe_url = df_recipe.loc[row_idx, 'href']
    nutr_dict = extract_nutrition(recipe_url)
    
    # add each new nutrition feature to the dataframe
    # only if there ARE nutrition features
    if len(nutr_dict) != 0:
        for nutr_feat, nutr_val in nutr_dict.items():
            df_recipe.loc[row_idx, nutr_feat] = nutr_val
    else:
        df_recipe = df_recipe.drop(row_idx, axis=0)

In [None]:
df_recipe

## Putting it all together
- get list of dictionaries corresponding to recipes (done!)
- get dictionary of nutrition info per recipe (done!)
- aggregating info into dataframe (see below)
- scatter plot (up next)

In [None]:
def get_df_recipe(str_query, recipe_limit=None):
    """ searches for recipes and returns list, with nutrition info
    
    Args:
        str_query (str): search string
        recipe_limit (int): if passed, limits recipe (helpful
            to speed up nutrition scraping for teaching!)
        
    Returns:
        df_recipe (pd.DataFrame): dataframe, each row is recipe.
            includes columns href, name, and nutrition facts
    """    
    # get / extract a data frame of recipes (only name and href)
    html_str = crawl_recipes(str_query)
    df_recipe = extract_recipes(html_str)
    
    if recipe_limit is not None:
        # discard all but first few recipes
        df_recipe = df_recipe.iloc[:recipe_limit, :]

    for row_idx in range(df_recipe.shape[0]):
        # get / extract nutrition info for a particular recipe
        recipe_url = df_recipe.loc[row_idx, 'href']
        nutr_dict = extract_nutrition(recipe_url)
        
        # add each new nutrition feature to the dataframe
        # only if there ARE nutrition features
        if len(nutr_dict) != 0:
            for nutr_feat, nutr_val in nutr_dict.items():
                df_recipe.loc[row_idx, nutr_feat] = nutr_val
        else:
            df_recipe = df_recipe.drop(row_idx, axis=0)

    return df_recipe

In [None]:
query_list = ['pickles', 'truffles', 'peanut butter']

big_df_recipe = pd.DataFrame()
for str_query in query_list:
    # get recipes
    df_recipe_query = get_df_recipe(str_query)
    
    # record the query used to search for these recipes & aggregate
    df_recipe_query['query'] = str_query
    big_df_recipe = big_df_recipe.append(df_recipe_query)

In [None]:
big_df_recipe

In [None]:
import plotly.express as px

px.scatter(data_frame=big_df_recipe, x='Calcium', y='Potassium', color='query', hover_data=['name'])