# Web Scraping for Data

## What is HTML and CSS?

[![Website](images/website.png)](https://www.homechef.com/our-menu)

*It's just glorified text.*

## Why learn how to web scrape?

![Items to scrape](images/items_to_scrape.png)

- To find information that isn't already contained somewhere
- To build your own data sets
- To quickly extract similar data from several different sources
- To solve niche questions that may not be answerable through publicly available data
- ... To feel like a cool hacker

## Grabbing a website's HTML text

In [1]:
# Using Splinter, an automated solution to do interactive web scraping
from splinter import Browser

# Using Chrome as our web browser
from webdriver_manager.chrome import ChromeDriverManager

# Initialize Chrome
executable_path = {"executable_path": ChromeDriverManager().install()}
browser = Browser("chrome", **executable_path, headless=False)

# Point Chrome to the website we want to visit (Home Chef Menu)
url = "https://www.homechef.com/our-menu"
browser.visit(url)

# Grab the HTML from the web page
html = browser.html

print(html)



Current google-chrome version is 100.0.4896
Get LATEST chromedriver version for 100.0.4896 google-chrome
Driver [C:\Users\Charles Phil\.wdm\drivers\chromedriver\win32\100.0.4896.60\chromedriver.exe] found in cache


<html class="js no-ie svg localstorage svgfilters csspointerevents csspositionsticky progressbar meter srcset inlinesvg supports svgclippaths no-touchevents svgasimg csscolumns csscolumns-width csscolumns-span csscolumns-fill csscolumns-gap csscolumns-rule csscolumns-rulecolor csscolumns-rulestyle csscolumns-rulewidth csscolumns-breakbefore csscolumns-breakafter csscolumns-breakinside cssfilters flexbox flexboxlegacy no-flexboxtweener flexwrap wf--loaded" data-rails-env="production" lang="en" style="" data-useragent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36" data-platform="Win32"><!--<![endif]--><head>
<meta charset="utf-8">
<title>Meals for the Week of May 08 | Home Chef</title>

<meta content="Our weekly deliveries of fresh, perfectly-portioned ingredients have everything you need to prepare home-cooked meals in about 30 minutes." name="description">
<meta content="Our weekly deliveries of fresh, perfectly-po

That is pretty disgusting to look at. How do we get the data we need?

## Using BeautifulSoup to parse HTML

![soup](images/soup.webp)

![Target title](images/target_title.png)

In [2]:
# Import BeautifulSoup
from bs4 import BeautifulSoup

# Parse the HTML from our automated Chrome browser using the lxml parser
soup = BeautifulSoup(html, "lxml")

# Target all of our meal titles (they all have the same class of "h6-refresh" attached to "h1" tags)
meals_html = soup.select("h1.h6-refresh")

# Print out the text of all of our meal titles
for meal in meals_html:
    print(meal.text.strip())

Apple Butter Demi Steak
Fried Mahi-Mahi and Lemon Garlic Aioli
Crispy Pepper Shrimp
Pecan-Crusted Chicken
Adobo Chicken Enchiladas
Acapulco-Style Beef Burger
Honey Mustard Pork Meatloaf
Chicken Taco Stuffed Peppers
Baked Italian Sausage Farfalle
Chipotle Cheddar Turkey Meatloaf
Mango Mostarda Pork Medallions
One-Pan Supreme Pizza Pasta Bake
Honey Teriyaki Fried Chicken Rice Bowl
Philly Cheesesteak Tacos
Buffalo-Style Turkey Meatballs
Spicy Dijonnaise Chicken
Hoisin BBQ Brisket
Chipotle Beef Tenderloin Penne
Sun-Dried Tomato Chicken Pasta
Carnitas Pork Tacos
Romesco-Style Salmon
Butter Cracker-Crusted Chicken
Honey Garlic Butter Shrimp
Italian Sausage Roasted Red Pepper Pasta
Buffalo Ranch Chopped Salad & Chicken
Margherita Pizza & Buffalo Ranch Chopped Salad
Apple Crisp Cake & Blueberry Lemon Butter Cake
Blueberry Lemon Butter Cake
Apple Pie Crisp
Chocolate Lava Cake
Three Cheese Asiago - Demi Loaf
Banana Bread Slices
Mini Sausage, Egg & Cheese Sandwich
Marionberry Greek Yogurt Cups
Mo

In [3]:
# Even better, we can target each meal card on the page and store a bunch of information

# Empty dictionary of lists to store information into (this will be used later)
meal_info = {
    "name": [], 
    "description": [], 
    "time_required": [], 
    "allergy": [], 
    "difficulty": [], 
    "spice_level": []
}

# Each meal card is denoted as an id called "meal"
for card in soup.select("#meal"):
    # Create empty list to hold list of allergies from each card later
    meal_allergies = []
    # Target the information in the card we want to store
    meal_name = card.select_one("h1.h6-refresh").text.strip()
    meal_desc = card.select_one("p.text-refresh").text.strip()
    meal_time = card.select_one("li.textSm-refresh").text.strip()
    # Get allergy tags and combine into a singular string
    for allergy in card.select("li.vertical-line-before"):
        meal_allergies.append(allergy.text.strip())
    # Remove any blank tags due to "Carb-Conscious" and "Calorie Conscious" tags
    meal_allergies = [allergy for allergy in meal_allergies if allergy]
    meal_allergies = ", ".join(meal_allergies)
    # Append our scraped information into the dictionary
    meal_info["name"].append(meal_name)
    meal_info["description"].append(meal_desc)
    meal_info["time_required"].append(meal_time)
    meal_info["allergy"].append(meal_allergies)

print(meal_info)

{'name': ['Apple Butter Demi Steak', 'Fried Mahi-Mahi and Lemon Garlic Aioli', 'Crispy Pepper Shrimp', 'Pecan-Crusted Chicken', 'Adobo Chicken Enchiladas', 'Acapulco-Style Beef Burger', 'Honey Mustard Pork Meatloaf', 'Chicken Taco Stuffed Peppers', 'Baked Italian Sausage Farfalle', 'Chipotle Cheddar Turkey Meatloaf', 'Mango Mostarda Pork Medallions', 'One-Pan Supreme Pizza Pasta Bake', 'Honey Teriyaki Fried Chicken Rice Bowl', 'Philly Cheesesteak Tacos', 'Buffalo-Style Turkey Meatballs', 'Spicy Dijonnaise Chicken', 'Hoisin BBQ Brisket', 'Chipotle Beef Tenderloin Penne', 'Sun-Dried Tomato Chicken Pasta', 'Carnitas Pork Tacos', 'Romesco-Style Salmon', 'Butter Cracker-Crusted Chicken', 'Honey Garlic Butter Shrimp', 'Italian Sausage Roasted Red Pepper Pasta', 'Buffalo Ranch Chopped Salad & Chicken', 'Margherita Pizza & Buffalo Ranch Chopped Salad', 'Apple Crisp Cake & Blueberry Lemon Butter Cake', 'Blueberry Lemon Butter Cake', 'Apple Pie Crisp', 'Chocolate Lava Cake', 'Three Cheese Asiago

## But wait, where's the cooking difficulty and spice level information?

In [4]:
# With Splinter, we can go one step further and click on each card for more information...
for card in soup.select("#meal"):
    browser.links.find_by_href(card.select_one("a")["href"]).click()
    html_inner = browser.html
    soup_inner = BeautifulSoup(html_inner, "lxml")
    difficulty = soup_inner.select("div.meal__overviewItem")[2].select("span")[2].text.strip()
    spice_level = soup_inner.select("div.meal__overviewItem")[3].select("span")[2].text.strip()
    meal_info["difficulty"].append(difficulty)
    meal_info["spice_level"].append(spice_level)
    print(card.select_one("h1.h6-refresh").text.strip())
    print(difficulty)
    print(spice_level)
    print("-"*10)
    browser.back()
    
browser.quit()

Apple Butter Demi Steak
Intermediate
Not Spicy
----------
Fried Mahi-Mahi and Lemon Garlic Aioli
Expert
Not Spicy
----------
Crispy Pepper Shrimp
Intermediate
Mild
----------
Pecan-Crusted Chicken
Intermediate
Mild
----------
Adobo Chicken Enchiladas
Intermediate
Medium
----------
Acapulco-Style Beef Burger
Intermediate
Mild
----------
Honey Mustard Pork Meatloaf
Intermediate
Not Spicy
----------
Chicken Taco Stuffed Peppers
Intermediate
Spicy
----------
Baked Italian Sausage Farfalle
Intermediate
Not Spicy
----------
Chipotle Cheddar Turkey Meatloaf
Intermediate
Spicy
----------
Mango Mostarda Pork Medallions
Easy
Medium
----------
One-Pan Supreme Pizza Pasta Bake
Intermediate
Not Spicy
----------
Honey Teriyaki Fried Chicken Rice Bowl
Intermediate
Mild
----------
Philly Cheesesteak Tacos
Easy
Mild
----------
Buffalo-Style Turkey Meatballs
Easy
Medium
----------
Spicy Dijonnaise Chicken
Easy
Mild
----------
Hoisin BBQ Brisket
Easy
Mild
----------
Chipotle Beef Tenderloin Penne
Easy
Mi

## Bringing it all together with Pandas

![Pandas](images/pandas.svg)

## What is Pandas?

- Organizes Python data structures into tables called **DataFrames**.
- Super powerful, can export and import data from many different sources
- Can run any tabular function your heart desires

## Exporting our scraped data into a CSV file

In [8]:
import pandas as pd

# Construct a dataframe using our meal information
df = pd.DataFrame(meal_info)

df

Unnamed: 0,name,description,time_required,allergy,difficulty,spice_level
0,Apple Butter Demi Steak,with rosemary potatoes and bacon Asiago Brusse...,35-45 Min,Milk,Intermediate,Not Spicy
1,Fried Mahi-Mahi and Lemon Garlic Aioli,with cheesy broccoli orzo,50-60 Min,"Fish (Mahi Mahi), Milk, Eggs, Wheat",Expert,Not Spicy
2,Crispy Pepper Shrimp,with ponzu sauce,25-35 Min,"Shellfish (Shrimp), Eggs, Wheat, Soy",Intermediate,Mild
3,Pecan-Crusted Chicken,with BBQ-spiced carrots,30-40 Min,"Tree Nuts (Pecans), Milk",Intermediate,Mild
4,Adobo Chicken Enchiladas,with jalapeño pepper and sour cream,45-55 Min,"Milk, Wheat",Intermediate,Medium
5,Acapulco-Style Beef Burger,with fresh pico de gallo and cilantro-lime fries,45-55 Min,"Milk, Eggs, Wheat",Intermediate,Mild
6,Honey Mustard Pork Meatloaf,with red cabbage and Brussels sprouts,35-45 Min,"Milk, Wheat",Intermediate,Not Spicy
7,Chicken Taco Stuffed Peppers,with pico de gallo and sour cream,30-40 Min,Milk,Intermediate,Spicy
8,Baked Italian Sausage Farfalle,with zucchini and garlic bread,30-40 Min,"Milk, Wheat",Intermediate,Not Spicy
9,Chipotle Cheddar Turkey Meatloaf,"with corn, poblano, and potato hash",35-45 Min,"Milk, Wheat",Intermediate,Spicy


In [9]:
# Export to CSV file
df.to_csv("menu.csv", index=False)