# 010 Gathering Data - Web Scraping

**Author**: Andrew Yew Chean Yang <br>
**Date**: 2023-04-11

Table of contents are available for JupyterLab

# Project Introduction

Given supply disruptions due to recent global events such as the Covid 19 pandemic and the war in Ukraine, Canadian food prices have risen at an alarming rate of [11% year on year in 2022](https://www150.statcan.gc.ca/n1/pub/62f0014m/62f0014m2022014-eng.htm), putting pressure on the average consumer’s budget. This combined with the benefits of [home cooking](https://food-guide.canada.ca/en/healthy-eating-recommendations/cook-more-often/) has undoubtedly led to many busy working Canadians to cook more at home. However, assuming a busy work life, many adults need a way to ensure whatever they choose to cook is worth the precious time and effort after work. The recipe classifier seeks to address this issue for busy working adults by classifying recipes as worth the time and effort or not worth it, given the different elements available in online food recipes. 

# Notebook Introduction

This notebook details the process of using requests and BeautifulSoup packages to gather data from food recipes in [allrecipes.com](https://www.allrecipes.com). A total of 40,001 recipes were scraped using the following steps:

Step 1: Gather initial list of recipes
- As there was no single index that contained the uniform resourcee locator(url) of all recipes, one of the landing pages of allrecipes.com was chosen to initialize the recipe gathering process.

Step 2: Scrape individual recipe webpages
- Each of the webpages gathered from step 1, be it recipe webpages or landing pages, was scraped using requests.

Step 3: Soupify response
- The responses were converted into BeautifulSoup objects to gather specific parts of the response as some parts are not useful to the project, such as formatting information.

Step 4: Store recipe contents into DataFrame
- The scraped data was stored in a DataFrame, with 1 row representing 1 recipe

Step 5: Extract links to other recipes
- One feature of allrecipes.com webpages are that they contain links to other recipes. These links were extracted to find new recipes to further scrape.

Repeat steps 2-5 until sufficient data collected

## Disclaimer
As the entire web scraping process is dependent on the live webpages at allrecipes.com and the recipes recommended within each recipe, the data collection process is time dependent and may produce different results at each scrape. The data for this study was collected in approximately 100 hours scraping over two weeks in the month of February. This notebook is a tidier recreation of the original notebook, with truncated lists to simulate the original run. To repeat the original run, iterations that involve scraping the 40,001 recipes must have their truncation filters removed.

# Import Required Packages

In [114]:
# for array and data processing
import numpy as np
import pandas as pd
import re

# for sending and receiving url requests
import requests

# for cleaning up received http
from bs4 import BeautifulSoup

# for timing functions and diagnostics
import time

# for exiting the program
import sys
import joblib
import ast

# Define Functions and Global Settings

In [21]:
# Pandas option to display recipe urls in full for visual inspection
pd.set_option('display.max_colwidth', None)

In [65]:
def send_request(url_list):
    '''
    This function will send and receive requests for each uniform resource locator(url) in a list.
    
    **Usage**
    This function was used to gather even more recipe urls given a list of recipe urls.
    
    **Input**
    urllist : a list containing urls to send and receive requests for
    
    **Output**
    response_list : a list of gathered responses
    '''
    
    # perform assertions for early error detection due to data type
    assert isinstance(url_list, list), "url_list should be a list of urls"
    for url in url_list:
        assert isinstance(url, str), "url should be a string"
    
    # initiate empty response list
    response_list = []
    
    # send and get a response for each url in list
    for index, url in enumerate(url_list):
        # Initiate timer
        start = time.perf_counter()
        
        # Set cooldown time between requests
        time.sleep(0.1)
        
        # Send a GET request to gather response
        response = requests.get(url = url,
                                allow_redirects = True)
        
        # End timer
        end = time.perf_counter()
        
        # Statement to deal with unsuccessful responses (responses that are not 200)
        if response.status_code != 200:
            print(f"Problem url at index: {index} for link: {url}") 
            print(f"Response code: {response.status_code}")
            print(f"Response reason: {response.reason}")
            print(f"Response encoding: {response.encoding}")
            exit()
        else:
            # Append the text from each response into response_list
            response_list.append(response.text)
        
        # Print statement to check on progress through url_list
        print(f"{index+1} of {len(url_list)} done, time taken: {np.round(end-start)} seconds", end='\r')
    
    # return the response_list
    return response_list        

In [4]:
def convert_response(response_list):
    '''
    This function takes the output from the send_request function and converts the response into a list of dictionaries using BeautifulSoup
    
    **Usage**
    This function was used to soupify the output from the send_request function
    
    **Input**
    response_list: a list of response_texts
    
    **Output**
    soup_list: a list of converted(soupified) responses
    '''
    #Perform assertions for early error detection due to data type
    assert isinstance(response_list, list), "response_list should be a list of http responses"
    for response in response_list:
        assert isinstance(response, str), "response should be a string"
        
    # Initiate blank list to store soupified response text
    soup_list = []
    
    # Iterate through each response text in list
    for response in response_list:
        soup = BeautifulSoup(response, 'html.parser')
        soup_list.append(soup)
    
    # Return soupified responses
    return soup_list

In [5]:
def extract_link(soup_list, url_list, search_term):
    '''
    This function extracts recipe_urls from soupified responses. Specifically, 1 source recipe url may contain multiple recipe urls. This function converts the one to many relationship into a long table.
    
    **Usage**
    This function was used to gather more recipe_urls to further scrape more data.
    
    **Input**
    soup_list: The output from function 'convert_response'. A list of soupified responses.
    url_list: A list of urls from which the responses were gathered from. The same url list used in the function 'send_request',
    search_term: A string to identify recipe or specific types of urls from the soupified object.
    
    **Output**
    extracted_url_df: A DataFrame containing extracted urls for further scraping along with the original urls they were scraped from.
    '''
    
    # Perform assertions for early error detection in length of lists
    assert len(soup_list) == len(url_list)
    
    # Initiate a blank DataFrame to store extracted links
    extracted_url_df = pd.DataFrame()
    
    # Iterate through each soupified response
    for index, soup in enumerate(soup_list):
        
        # Initiate a blank list to store extracted urls
        extracted_url_list = []
        
        # Iterate through all url objects in the soupified response
        for link in soup.find_all('a'):
            
            # Append each url oject to the blank list
            extracted_url_list.append(link.get('href'))
        
        # Convert the url list into a DataFrame
        temp_extracted_url_df = pd.DataFrame(extracted_url_list, columns = ["extracted_url"])
        
        # Keep only relevant urls based on specified search term
        temp_extracted_url_df = temp_extracted_url_df[(temp_extracted_url_df['extracted_url'].str.contains('http'))
                                                      &(temp_extracted_url_df['extracted_url'].str.contains(search_term))
                                                     ].reset_index(drop = True)
        
        # Add the source url for back tracking if needed
        temp_extracted_url_df['source_url'] = url_list[index]
        
        # Concatenate the extracted urls back to DataFrame
        extracted_url_df = pd.concat([extracted_url_df, temp_extracted_url_df]).reset_index(drop = True)
    
    # Return all extracted DataFrame
    return extracted_url_df

# Initialize the Web Scrape

As there was no single index containing a complete list of recipe urls, an initial list of recipes was used to kick start the web scraping process. From this initial list of recipes, more recipes can be gathered from recipes since each recipe contains promotional links to other recipes. 

To achieve this, the landing page of Allrecipes ingredients was chosen for the initial scrape: [Ingredients A-Z](https://www.allrecipes.com/ingredients-a-z-6740416) 

In [6]:
# Store landing page for Allrecipes ingredients in variable
url_list = ["https://www.allrecipes.com/ingredients-a-z-6740416"]

In [7]:
# Use predefined function to send a GET request to the url
response_list = send_request(url_list)

1 of 1 done, time taken: 3.0 seconds

In [8]:
# Use predefined function to soupify receivedd response
soup_list = convert_response(response_list)

In [10]:
# Use predefined function to extract urls that contain recipes
# The search term "/recipes/" was used to identify urls that act as landing pages for more recipes
extracted_url_df = extract_link(soup_list, url_list, "/recipes/")

In [15]:
# DataFrame shape from initial scrape
print(f"Shape from initial scrape: {extracted_url_df.shape}.")

Shape from initial scrape: (168, 2).


In [13]:
# Checking for duplicated urls
print(f"Number of duplicated urls: {extracted_url_df.duplicated().sum()}.")

Number of duplicated urls: 56.


In [16]:
# Dropping duplicates to avoid web-scraping the same website twice
extracted_url_df.drop_duplicates(inplace = True)
print(f"Shape after dropping duplicates: {extracted_url_df.shape}.")

Shape after dropping duplicates: (112, 2).


In [20]:
# Examine the recipes from the initial scrape
extracted_url_df.head()

Unnamed: 0,extracted_url,source_url
0,https://www.allrecipes.com/recipes/17562/dinner/,https://www.allrecipes.com/ingredients-a-z-6740416
1,https://www.allrecipes.com/recipes/17057/everyday-cooking/more-meal-ideas/5-ingredients/main-dishes/,https://www.allrecipes.com/ingredients-a-z-6740416
2,https://www.allrecipes.com/recipes/15436/everyday-cooking/one-pot-meals/,https://www.allrecipes.com/ingredients-a-z-6740416
3,https://www.allrecipes.com/recipes/1947/everyday-cooking/quick-and-easy/,https://www.allrecipes.com/ingredients-a-z-6740416
4,https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/,https://www.allrecipes.com/ingredients-a-z-6740416


The initial scrape of recipes from the ingredients landing page at allrecipes.com did not yield any recipe urls to scrape. Instead, over 100 sub-landing pages were found, each leading to a list of recipes to web scrape. As such, the initialization of the web-scraping process was successful.

# Gather List of Recipe URLs

Now that a list of landing pages to gather recipe urls from has been established, an iterative process was used to gather a list of recipe urls from those landing pages and other recipe pages. 

In [43]:
# Initialize a blank DataFrame to store recipes
recipe_url_df = pd.DataFrame()

Similar to the initial web scrape, the below process was repeated for each iteration of data gathering:
- send_request
- convert_response
- extract_link

In [22]:
# Define new url list using the extracted urls from the initial scrape
url_list = extracted_url_df['extracted_url'].to_list()

In [35]:
# Use predefined function to send a GET request to the url
response_list = send_request(url_list)

# Use predefined function to soupify received response
soup_list = convert_response(response_list)

# Use predefined function to extract urls that contain recipes
# The search term "/recipe/" (singular) was used to identify recipe urls
temp_df = extract_link(soup_list,url_list,"/recipe/")

112 of 112 done, time taken: 3.0 seconds

In [39]:
print(f"Number of recipe urls gathered: {temp_df.shape[0]}.")

Number of recipe urls gathered: 6377.


In [41]:
# Remove duplicated recipes to avoid scraping the same recipe twice
temp_df.drop_duplicates(subset = ['extracted_url'], inplace = True)
print(f"Number of unique recipe urls gathered: {temp_df.shape[0]}.")

Number of unique recipe urls gathered: 4618.


In [45]:
# Store the gathered recipe urls into recipe_url_df
recipe_url_df = pd.concat([recipe_url_df,temp_df],axis = 0).reset_index(drop=True)

In [48]:
print(f"Number of unique recipe urls gathered: {recipe_url_df.shape[0]}.")
display(recipe_url_df.head())

Number of unique recipe urls gathered: 4618.


Unnamed: 0,extracted_url,source_url
0,https://www.allrecipes.com/recipe/83646/corned...,https://www.allrecipes.com/recipes/17562/dinner/
1,https://www.allrecipes.com/recipe/158799/stout...,https://www.allrecipes.com/recipes/17562/dinner/
2,https://www.allrecipes.com/recipe/8509102/chic...,https://www.allrecipes.com/recipes/17562/dinner/
3,https://www.allrecipes.com/recipe/8508920/miss...,https://www.allrecipes.com/recipes/17562/dinner/
4,https://www.allrecipes.com/recipe/255462/lasag...,https://www.allrecipes.com/recipes/17562/dinner/


Using the extracted landing pages from the initial scrape, about 4,600 recipe urls were gathered.

In [50]:
# From the same scrape, examine if any new landing pages were extracted
temp_df2 = extract_link(soup_list,url_list,"/recipes/")

print(f"Landing pages scraped: {temp_df2.shape[0]}.")

Landing pages scraped: 14310.


14 thousand landing pages appears to be too much given we started with 1. Duplicate landing pages were identified and removed.

In [51]:
# Remove duplicate landing pages
temp_df2.drop_duplicates(subset = ['extracted_url'], inplace = True)
print(f"Number of unique recipe landing pages gathered: {temp_df2.shape[0]}.")

Number of unique recipe urls gathered: 902.


Finally, for the next round of scraping, the above 900 urls were checked against previously scraped landing pages to avoid repetition.

In [58]:
# Define condition to check for previously scraped landing pages
cond1 = temp_df2['extracted_url'].isin(extracted_url_df['extracted_url'])
print(f"Number of previously scraped landing pages: {cond1.sum()}.")

# Apply condition to remove all duplicates
temp_df2 = temp_df2[~cond1]
print(f"Number of unique new recipe landing pages gathered: {temp_df2.shape[0]}.")

Number of previously scraped landing pages: 111.
Number of unique new recipe urls gathered: 791.


In [61]:
# Store unique new landing pages into a DataFrame
new_landing_pages = temp_df2.copy()

# Add in new landing pages into previous DataFrame of landing pages for reference
extracted_url_df = pd.concat([extracted_url_df,new_landing_pages], axis = 0).reset_index(drop = True)

Thus, to summarize the first round of gathering recipes, a total of approximately 4,600 recipes and 800 new recipe landing pages were gathered. This meant for the next round, a total of 5,400 requests will be sent out.

# Gather List of Recipe URLs (2nd iteration)

In [64]:
# Initiate list of urls from which to scrape
url_list = new_landing_pages['extracted_url'].to_list() + recipe_url_df['extracted_url'].to_list()
print(f"Number of urls to scrape : {len(url_list)}")

Number of urls to scrape : 5409


In [66]:
# Use predefined function to send a GET request to the url
response_list = send_request(url_list)

# Use predefined function to soupify received response
soup_list = convert_response(response_list)

# Use predefined function to extract urls that contain recipes
# The search term "/recipe/" (singular) was used to identify recipe urls
temp_df = extract_link(soup_list,url_list,"/recipe/")

5409 of 5409 done, time taken: 1.0 seconds

In [67]:
print(f"Number of recipe urls gathered: {temp_df.shape[0]}.")

Number of recipe urls gathered: 114520.


In [68]:
# Remove duplicated recipes to avoid scraping the same recipe twice
temp_df.drop_duplicates(subset = ['extracted_url'], inplace = True)
print(f"Number of unique recipe urls gathered: {temp_df.shape[0]}.")

Number of unique recipe urls gathered: 28350.


In [70]:
# Define condition to check for previously scraped recipes
cond1 = temp_df['extracted_url'].isin(recipe_url_df['extracted_url'])
print(f"Number of previously scraped recipes: {cond1.sum()}.")

# Apply condition to remove all duplicates
temp_df = temp_df[~cond1]
print(f"Number of unique new recipe urls gathered: {temp_df.shape[0]}.")

Number of previously scraped recipes: 0.
Number of unique new recipe urls gathered: 24256.


In [71]:
# Store the gathered recipe urls into recipe_url_df
recipe_url_df = pd.concat([recipe_url_df,temp_df],axis = 0).reset_index(drop=True)

In [72]:
print(f"Number of unique recipe urls gathered: {recipe_url_df.shape[0]}.")
display(recipe_url_df.head())

Number of unique recipe urls gathered: 28874.


Unnamed: 0,extracted_url,source_url
0,https://www.allrecipes.com/recipe/83646/corned...,https://www.allrecipes.com/recipes/17562/dinner/
1,https://www.allrecipes.com/recipe/158799/stout...,https://www.allrecipes.com/recipes/17562/dinner/
2,https://www.allrecipes.com/recipe/8509102/chic...,https://www.allrecipes.com/recipes/17562/dinner/
3,https://www.allrecipes.com/recipe/8508920/miss...,https://www.allrecipes.com/recipes/17562/dinner/
4,https://www.allrecipes.com/recipe/255462/lasag...,https://www.allrecipes.com/recipes/17562/dinner/


In [73]:
# From the same scrape, examine if any new landing pages were extracted
temp_df2 = extract_link(soup_list,url_list,"/recipes/")

print(f"Landing pages scraped: {temp_df2.shape[0]}.")

Landing pages scraped: 571592.


In [74]:
# Remove duplicate landing pages
temp_df2.drop_duplicates(subset = ['extracted_url'], inplace = True)
print(f"Number of unique recipe landing pages gathered: {temp_df2.shape[0]}.")

Number of unique recipe landing pages gathered: 2290.


In [75]:
# Define condition to check for previously scraped landing pages
cond1 = temp_df2['extracted_url'].isin(extracted_url_df['extracted_url'])
print(f"Number of previously scraped landing pages: {cond1.sum()}.")

# Apply condition to remove all duplicates
temp_df2 = temp_df2[~cond1]
print(f"Number of unique new recipe landing pages gathered: {temp_df2.shape[0]}.")

Number of previously scraped landing pages: 901.
Number of unique new recipe landing pages gathered: 1389.


In [76]:
# Store unique new landing pages into a DataFrame
new_landing_pages = temp_df2.copy()

# Add in new landing pages into previous DataFrame of landing pages for reference
extracted_url_df = pd.concat([extracted_url_df,new_landing_pages], axis = 0).reset_index(drop = True)

At the second iteration, a total of approximately 24,000 new recipe urls and 1,400 new landing pages were gathered. Although not shown in this notebook due to time constraints, the above steps were iterated one more time to arrive at the final dataset of 40,001 recipes.

In [93]:
# Save recipe url df as a pickle file
joblib.dump(recipe_url_df['extracted_url'], 'data/recipe_url_df.pkl')

# Scrape Data from Recipe URLs

Now that a list of recipe urls have been gathered, the next step is to iterate through each recipe url and scrape the data of each recipe. To identify what data is available from each recipe website and how to scrape the specific data, the 'inspect element' function of the Google Chrome web browser was used. Note that elements of a website can be inspected using a web browser of your choice.

The recipe webpage's elements were inspected to identify JSON dictionary key pairs that can be used with BeautifulSoup to extract the specific data. Note as most of a webpage's elements consists of design elements, this method of specifically targetting parts of the webpage to store reduces the size of the data gathered and keeps data gathered as simple as possible.

The beneath for loop goes through each recipe url and extracts 17 elements from each website that were deemed useful at this stage of the project. Note that each element has a try and except clause to deal with cases where the specified element is not present in the webpage. Regular expression was used with BeautifulSoup to extract specific elements of data.

Finally, a pause was set between each response to avoid sending too many requests consecutively too quickly, which may lead to allrecipes blocking the IP address of the local computer. As such, below code took appoximately 84 hours to run in entirety. (40,000 recipe url * 7.5seconds per url = 300,000 seconds or approximately 84 hours.) Below code appears to not be ran before as this was a tidied up version from the original notebook used to scrape the data. 

| Column_number | Column_name            | Data Type  | Description                                                                        |
|---------------|------------------------|------------|------------------------------------------------------------------------------------|
| 1             | url                    | string     | the url of the recipe                                                              |
| 2             | title                  | string     | the title of the recipe                                                            |
| 3             | image                  | list       | any image urls found within the recipe                                             |
| 4             | rating_average         | string     | average of ratings, the target feature                                             |
| 5             | rating_count           | string     | the number of ratings for the recipe                                               |
| 6             | review_count           | string     | the number of reviews for the recipe                                               |
| 7             | description            | string     | the description section beneath each title of the recipe                           |
| 8             | update_date            | string     | the last date of update for the recipe                                             |
| 9             | ingredient             | list       | a list of ingredients and their amounts                                            |
| 10            | direction              | list       | a list of cooking directions or instructions                                       |
| 11            | nutrition_summary      | dictionary | a dictionary of nutritional information summary                                    |
| 12            | nutrition_detail       | dictionary | a dictionary of detailed nutritional information                                   |
| 13            | time                   | list       | a dictionary containing time related values in the recipe                          |
| 14            | label                  | list       | a list containing the labels or tags associated with the recipe                    |
| 15            | review_dict            | dictionary | a JSON dictionary object containing reviews and other data elements of the webpage |
| 16            | description_additional | list       | additional description if available for the recipe                                 |

In [109]:
url_list = recipe_url_df['extracted_url'][0:3].to_list()

In [117]:
# Store the number of recipe urls to be scraped into a variable for progress status printing
# Only the first 10 rows were run for demonstration purposes
number_of_url = len(url_list)

# Initiate a blank DataFrame to store the scraped data
raw_data_df = pd.DataFrame()

# Iterate through each recipe url
for index, recipe_url in enumerate(url_list):
    
    # Send a get request for the recipe url and receive a response
    try:
        # Initiate timer
        start = time.perf_counter()
        
        # Use random integer to set cooldown time between scrapes
        time.sleep(np.random.randint(low = 5,high = 10))
        
        # Store the response in a variable
        response = requests.get(url = recipe_url, allow_redirects = True)
        
        # End timer
        end = time.perf_counter()
    
    # Except clause to deal with problem urls
    except:
        print(f"Problem url at {index+1} of {number_of_url}") 
        print(f"Response code: {response.status_code}")
        print(f"Response reason: {response.reason}")
        print(f"Response encoding: {response.encoding}")
        exit()
    
    
    # Convert response to soup object
    try:
        soup = BeautifulSoup(response.text, 'html.parser')
    except:
        print(f"soup_error at {index+1} of {number_of_url}")
    
    
    # Initiate a blank dictionary to store values
    temp_dict = dict()
    
    # column 00: url, the url of the recipe
    try:
        temp_dict.update({"recipe_url": recipe_url})
    except:
        temp_dict.update({"recipe_url": np.NaN})
        print(f"url_error at {index+1} of {number_of_url}")
    
    
    # column 01: title, the title of the recipe
    try:
        temp_dict.update({"title":
                          soup.find("h1", {"id": re.compile("^article-heading_*")}).get_text().strip(' \t\n\r')
                         })
    except:
        temp_dict.update({"title": np.NaN})
        print(f"title_error at {index+1} of {number_of_url}")

        
    # column 02: image, any image urls found within the recipes
    try:
        t_main_img = [img.get("src") for img in soup.find("div", {"class": "loc article-content"}).find_all("img") if img.get("src") != ""]
        t_sub_img = [img.get("data-src") for img in soup.find("div", {"class": "loc article-content"}).find_all("img") if img.get("data-src") != None]
        t_img = list((set(t_main_img+t_sub_img)))
        temp_dict.update({"image":t_img})
    except:
        temp_dict.update({"image": np.NaN})
        print(f"image_error at {index+1} of {number_of_url}")

    
    # column 03: rating_average, the target feature
    try:
        temp_dict.update({"rating_average":
                          float(soup.find("div", {"id": re.compile("mntl-recipe-review-bar__rating_*")}).get_text().strip(' \t\n\r'))
                         })
    except:
        temp_dict.update({"rating_average": np.NaN})
        print(f"rating_average_error at {index+1} of {number_of_url}")

        
        
    # column 04: rating_count, the number of ratings for the recipe
    try:
        temp_dict.update({"rating_count":
                          soup.find("div", {"id": re.compile("^mntl-recipe-review-bar__rating-count_*")}).get_text().strip(' \t\n\r()')
                         })
    except:
        temp_dict.update({"rating_count": np.NaN})
        print(f"rating_count_error at {index+1} of {number_of_url}")

        
        
    # column 05: review_count, the number of reviews for the recipe
    try:
        temp_dict.update({"review_count":
                         soup.find("div", {"id": re.compile("^mntl-recipe-review-bar__comment-count_*")}).get_text().strip(' \t\n\r()')
                         })
    except:
        temp_dict.update({"review_count": np.NaN})
        print(f"review_count_error at {index+1} of {number_of_url}")

        
    # column 06: description, the description section beneath each title of the recipe
    try:
        temp_dict.update({"description":
                         soup.find("p", {"id" : re.compile("^article-subheading_*")}).get_text().strip(' \t\n\r')
                         })
    except:
        temp_dict.update({"description": np.NaN})
        print(f"description_error at {index+1} of {number_of_url}")
        
        
        
    # column 07: update_date, the last date of update for the recipe
    try:
        temp_dict.update({"update_date":
                         soup.find_all("div", {"class": re.compile("^mntl-attribution__item-date*")})[0].get_text()
                         })
    except:
        temp_dict.update({"update_date": np.NaN})
        print(f"update_date_error at {index+1} of {number_of_url}")
        
        
    
    # column 08: ingredient, a list of ingredients and their amounts
    try:
        temp_dict.update({"ingredient":
                         [li.get_text().strip(' \t\n\r') for li in soup.find("div", {"id": re.compile("^mntl-structured-ingredients_*")}).find_all("li")]
                         })
    except:
        temp_dict.update({"ingredient": np.NaN})
        print(f"ingredient_error at {index+1} of {number_of_url}")
        
        
        
    # column 09: direction, a list of cooking directions or instructions
    try:
        temp_dict.update({"direction":
                          [li.get_text().strip(' \t\n\r') for li in soup.find("div", {"id": re.compile("^recipe__steps-content_*")}).find_all("li")]
                         })
    except:
        temp_dict.update({"direction": np.NaN})
        print(f"direction_error at {index+1} of {number_of_url}")
        
        
        
    # column 10: nutrition_summary, a dictionary of nutritional information summary
    try:
        tag = soup.find("div", {"id": re.compile("^mntl-nutrition-facts-summary_*")})
        
        t_value = [line.get_text() for line in tag.find_all("td",{"class":"mntl-nutrition-facts-summary__table-cell type--dog-bold"})]
        header_1 = [line.get_text() for line in tag.find_all("td",{"class":"mntl-nutrition-facts-summary__table-cell type--dogg"})]
        header_2 = [line.get_text() for line in tag.find_all("td",{"class":"mntl-nutrition-facts-summary__table-cell type--dog"})]
        t_header = header_1+header_2
        
        temp_dict.update({"nutrition_summary":
                          {key:value for (key,value) in zip(t_header,t_value)}
                         })
    except:
        temp_dict.update({"nutrition_summary": np.NaN})
        print(f"nutrition_summary_error at {index+1} of {number_of_url}")
        
        
        
    # column 11: nutrition_detail, a dictionary of detailed nutritional information
    try:
        temp_dict.update({"nutrition_detail":
                          pd.read_html(str(soup.find_all("table",{"class": "mntl-nutrition-facts-label__table"})))[0]\
                          .iloc[:,0].to_list()
                         })
    except:
        temp_dict.update({"nutrition_detail": np.NaN})
        print(f"nutrition_detail_error at {index+1} of {number_of_url}")
    
    
    
    # column 12: time, a dictionary containing time related values in the recipe
    try:
        t_value = [div.get_text().strip(' \t\n\r') for div in soup.find("div", {"id": re.compile("^recipe-details_*")}).find_all("div", {"class":re.compile("^mntl-recipe-details__val*")})]
        t_header = [div.get_text().strip(' \t\n\r') for div in soup.find("div", {"id": re.compile("^recipe-details_*")}).find_all("div", {"class":re.compile("^mntl-recipe-details__la*")})]
        temp_dict.update({"time":
                          {key:value for (key,value) in zip(t_header,t_value)}
                         })
    except:
        temp_dict.update({"time": np.NaN})
        print(f"time_error at {index+1} of {number_of_url}")

        
        
    # column 13: label, a list containing the labels or tags associated with the recipe
    try:
        temp_dict.update({"label":
                        [label.get_text() for label in soup.find("div", {"class":re.compile("^loc article-header")}).find_all("span",{"class":"link__wrapper"})]
                         })
    except:
        temp_dict.update({"label": np.NaN})
        print(f"label_error at {index+1} of {number_of_url}")
        
        
        
    # column 14: review_dict, dictionary containing a JSON dictionary of reviews and other data elements of the webpage
    try:
        temp_dict.update({"review_dict":
                         ast.literal_eval(
                             soup.find('script',{"class":"comp allrecipes-schema mntl-schema-unified"}).text
                         )})
    except:
        temp_dict.update({"review_dict": np.NaN})
        print(f"review_dict_error at {index+1} of {number_of_url}")
    
    
    
    # column 15: description_additional, additional description if available for the recipe
    try:
        temp_dict.update({"description_additional":
                         [p.get_text().strip(' \t\n\r') for p in soup.find_all('p',{"class":re.compile("^mntl-sc-block*")})]
                         })
    except:
        temp_dict.update({"description_additional": np.NaN})
        print(f"description_additional_error at {index+1} of {number_of_url}")
    
    
    
    # Create a DataFrame with 1 row using the above data scraped into temp_dict
    temp_df = pd.DataFrame({k: pd.Series([v]) for k,v in temp_dict.items()})
    
    # Concatenate the DataFrame with 1 row with the raw_data_df
    raw_data_df = pd.concat([raw_data_df,temp_df],ignore_index= True, axis = 0)
    
    # Progress check
    print(f"{index+1} of {number_of_url} done, time taken: {np.round(end-start)} seconds.", end='\r')

3 of 3 done, time taken: 6.0 seconds.

In [None]:
# Save recipe url df as a pickle file
joblib.dump(raw_data_df, 'data/raw_data_df.pkl')

# Conclusion for Notebook 010

To summarize, this notebook goes through the process of :
- Gathering an initial list of recipe urls to scrape from a single landing page.
- The feedback loop of using the initial list of recipe urls to scrap more recipe urls
- The process of gathering data for each of the gathered recipe urls using BeautifulSoup and regular expressions

Data for 40,001 recipes were gathered for the next notebook, which will detail the exploratory analysis and feature engineering of the data.