<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Web Scraping

_Author: Joseph Nelson (DC)_

---

## Before Class

#### Install Selenium

Selenium is a headless browser. It allows us to render JavaScript just as a human-navigated browser would.

To install Selenium, use one of the following:
- **Anaconda:** `conda install -c conda-forge selenium`
- **pip:** `pip install selenium`


#### Install GeckoDriver

You will also need GeckoDriver (this assumes you are using Homebrew for Mac): 

- ```brew install geckodriver```

#### Install Firefox

Additionally, you will need to have downloaded the [Firefox browser](https://www.mozilla.org/en-US/firefox/new/?utm_source=google&utm_medium=cpc&utm_campaign=Firefox-Brand-US-GGL-Exact&utm_term=firefox&utm_content=A144_A203_A006336&gclid=Cj0KEQjwnPLKBRC-j7nt1b7OlZwBEiQAv8lMLJUyReT6cPzSYdmEA6uD3YDoieuuuusddgAU7XH6smEaAoje8P8HAQ&gclsrc=aw.ds) for the application in this lesson.

## Learning Objectives
- Revisit how to locate elements on a webpage
- Aquire unstructure data from the internet using Beautiful soup.
- Discuss limitations associated with simple requests and urllib libraries
- Introduce Selenium as a solution, and implement a scraper using selenium

## Lesson Guide

- [Introduction](#intro)
- [Building a web scraper](#building-scraper)
- [Retrieving data from the HTML page](#retrieving-data)
    - [Retrieving the restaurant names](#retrieving-names)
    - [Challenge: Retrieving the restaurant locations](#retrieving-locations)
    - [Retrieving the restaurant prices](#retrieving-prices)
    - [Retrieving the restaurant number of bookings](#retrieving-bookings)


- [Introducting Selenium](#selenium)
    - [Running JavaScript before scraping](#selenium-js)
    - [Using regex to only get digits](#selenium-regex)
    - [Challenge: Use Pandas to create a DataFrame of bookings](#challenge-pandas)
    - [Auto-typing using Selenium](#selenium-typing)


- [Summary](#summary)

<a id="intro"></a>
## Introduction

In this codealong lesson, we'll build a web scraper using BeautifulSoup. We will also explore how to use a headless browser called Selenium.

We'll begin by scraping Resy's DC listings. We're interested in knowing the restaurant's **name, neighborhood, price, and star ratings.**

Resy provides all of this information on this given page: https://resy.com/cities/dc?seats=2&date=2022-08-02

---

<a id="retrieving-data"></a>
### Retrieving data from the HTML page

Let's first find each restaurant name listed on the page we've loaded. How do we find the page location of the restaurant? (Hint: We need to know where in the **HTML** the restaurant element is housed.) In order to find the HTML that renders the restaurant location, we can use Google Chrome's Inspect tool:

> https://resy.com/cities/dc?seats=2&date=2022-08-02

> 1. Visit the URL above. 

> 2. Right-click on an element you are interested in, then choose Inspect (in Chrome). 

> 3. This will open the Developer Tools and show the HTML used to render the selected page element. 

> Throughout this lesson, we will use this method to find tags associated with elements of the page we want to scrape.

See if you can find the restaurant name on the page. Keep in mind there are many restaurants loaded on the page.

In [4]:
import requests

result = requests.get("https://resy.com/cities/dc?seats=2&date=2022-08-02")

In [8]:
result.text

'<!doctype html><html ng-app="resyApp" ng-strict-di class="no-js"><head><base href="/"><title ng-bind="metadata.title">Resy | Right This Way</title><meta charset="utf-8"/><meta name="viewport" content="width=device-width,initial-scale=1,user-scalable=yes"/><meta name="description" content="Discover restaurants to love in your city and beyond. Get the latest restaurant intel and explore Resyâ\x80\x99s curated guides to find the right spot for any occasion. Book your table now through the Resy iOS app or Resy.com." ng-attr-content="{{metadata.description}}"/><meta name="prerender-status-code" content="200" ng-attr-content="{{prerender.responseCode}}"/><meta ng-if="prerender.location" name="prerender-header" ng-attr-content="{{prerender.location}}"/><meta name="robots" ng-attr-content="{{metadata.robots}}" content="index"/><link rel="canonical" ng-attr-href="{{ metadata.canonical | resySce: \'resourceUrl\' }}"/><link rel="shortcut icon" href="../images/favicon.ico?v=5" type="image/x-icon"

In [9]:
from bs4 import BeautifulSoup

In [10]:
soup = BeautifulSoup(result.text)

In [12]:
soup.find_all("div", {"class": "SearchResult__title--container"})

[]

<a id="selenium"></a>
## Introducing Selenium

Selenium is a headless browser. It allows us to render JavaScript just as a human-navigated browser would.

To install Selenium, use one of the following:
- **Anaconda:** `conda install -c conda-forge selenium`
- **pip:** `pip install selenium`

You will also need GeckoDriver (this assumes you are using Homebrew for Mac): 

- ```brew install geckodriver```

Additionally, you will need to have downloaded the [Firefox browser](https://www.mozilla.org/en-US/firefox/new/?utm_source=google&utm_medium=cpc&utm_campaign=Firefox-Brand-US-GGL-Exact&utm_term=firefox&utm_content=A144_A203_A006336&gclid=Cj0KEQjwnPLKBRC-j7nt1b7OlZwBEiQAv8lMLJUyReT6cPzSYdmEA6uD3YDoieuuuusddgAU7XH6smEaAoje8P8HAQ&gclsrc=aw.ds) for the application in this lesson.

In [13]:
# import
from bs4 import BeautifulSoup
from selenium import webdriver

Selenium requires us to determine a default browser to run. I'm going to opt for Firefox, but Chromium is also a very common choice. http://selenium-python.readthedocs.io/faq.html

In [None]:
# STOP
# what is going to happen when I run the next cell?

In [20]:
# create a driver called Firefox
driver = webdriver.Firefox()

Pretty crazy, right? Let's close that driver.

In [21]:
# close it
driver.close()

In [22]:
# let's boot it up, and visit a URL of our choice
driver = webdriver.Firefox()
driver.get("http://www.python.org")

Awesome. Now we're getting somewhere: programmatically controlling our browser like a human.

In [26]:
# visit our Resy page
driver = webdriver.Firefox()
driver.get("https://resy.com/cities/dc?seats=2&date=2022-08-02")

# always good to check we've got the page we think we do
assert "Resy" in driver.title

In [27]:
driver.title

'Discover Washington D.C. Restaurants to Love on Resy'

In [28]:
driver.close()

In [30]:
# import sleep
from time import sleep

In [31]:
# visit our relevant page
driver = webdriver.Firefox()
driver.get("https://resy.com/cities/dc?seats=2&date=2022-08-02")

# wait one second
sleep(1)

#grab the page source
html = driver.page_source

**Pop Quiz:** What do we need to do with this HTML?

In [32]:
# BeautifulSoup it!
html = BeautifulSoup(html, "lxml")

In [33]:
html

<html class="no-js" lang="en" ng-app="resyApp" ng-strict-di=""><head><style>@charset "UTF-8";[ng\:cloak],[ng-cloak],[data-ng-cloak],[x-ng-cloak],.ng-cloak,.x-ng-cloak,.ng-hide:not(.ng-hide-animate){display:none !important;}ng\:form{display:block;}.ng-animate-shim{visibility:hidden;}.ng-anchor{position:absolute;}</style><meta content="default-src 'self' 'unsafe-inline' 'unsafe-eval' blob: data: *.betrad.com *.evidon.com *.evidon.com 37g8q83dpternslae3eh1f8t-wpengine.netdna-ssl.com annotation.resy.com api.braintreegateway.com api.ipdata.co api.resy.com api.resyos.com assets.braintreegateway.com blog.resy.com client-analytics.braintreegateway.com dw5imgvvi8wn1.cloudfront.net gct.americanexpress.com icm.aexp-static.com image.resy.com js.braintreegateway.com js.stripe.com khms0.googleapis.com maps.googleapis.com maps.gstatic.com resy.com resynetworkinc.applytojob.com s3.amazonaws.com widgets.resy.com www.aexp-static.com www.google.com www.gstatic.com 189445.fls.doubleclick.net ad.doubleclic

It is important to always keep in mind the data types that were returned. Note this is a `list`, and we know that immediately by observing the outer square brackets and commas separating each tag.

Next, note the elements of the list are `Tag` objects, not strings. (If they were strings, they would be surrounded by quotes.) The Beautiful Soup authors chose to display a `Tag` object visually as a text representation of the tag and its contents. However, being an object, it has many methods that we can call on it. For example, next we will use the `encode_contents()` method to return the tag's contents encoded as a Python string.

<a id="retrieving-names"></a>
#### Retrieving the restaurant names

Now that we found a list of tags containing the restaurant names, let's think how we can loop through them all one-by-one. In the following cell, we'll print out the name (and **only** the clean name, not the rest of the html) of each restaurant.

In [34]:
html.find_all('div', {'class': "SearchResult__title--container"})

[<div class="SearchResult__title--container"><a class="Link SearchResult__container-link" data-test-id="search-result-link-details" href="cities/dc/chicken-and-whiskey?date=2022-08-02&amp;seats=2" tabindex="0"><h3 class="SearchResult__venue-name">Chicken + Whiskey</h3></a><div aria-checked="false" aria-label="Add Chicken + Whiskey to favorites" class="FavoriteButton FavoriteButton--off" role="checkbox" tabindex="0" variant="no-style"><div aria-live="polite" role="alert"></div><div class="IconHeart IconHeart--off"><i class="ResyIcon ResyIcon--heart"><svg height="17px" viewbox="-1 0 20 16" width="20px"><path d="m7.45512031 1.23104366c-1.62586449-1.64045822-4.24363991-1.64021083-5.86913275-.00012762l-.15715361.1585642c-1.90481563 1.92191327-1.90497285 5.05656284-.00130571 6.97731725l7.55545343 7.62327111c.01313023.013248.02000841.0132358.03312668-.0000002l7.55545325-7.62327091c1.9049054-1.92200365 1.9046986-5.05420481-.0013054-6.97731706l-.1571539-.15856442c-1.6265486-1.64114841-4.2422709

In [36]:
# for each element you find, print out the restaurant name
for entry in html.find_all('div', {'class': "SearchResult__title--container"}):
     print(entry.text)
        

Chicken + Whiskey
Oyster Oyster
The Green Zone
Cafe Fili DC
CHIKO Dupont Circle
The Dabney
The Red Hen
Rasika West End
Rasika Penn Quarter
All Purpose
Tail Up Goat
Rooster & Owl
Maydan
Sushi Nakazawa DC
Ellē
Fancy Radish
Reveler’s Hour
St. Vincent Wine


In [37]:
# first, see if you can identify the location for all elements -- print it out
html.find_all('div', {'class': 'neighborhood'})

[<div class="neighborhood"><i class="ResyIcon ResyIcon--pin"><svg height="1em" viewbox="0 0 20 20" width="1em"><path d="m10 1.5c3.5898509 0 6.5 2.91014913 6.5 6.5 0 3.1783736-1.8115848 5.4745544-4.9479735 8.6580075l-.8458482.8488788c-.611314.6003241-.75280912.6319201-1.31466848.094788l-.63253781-.6293629-.60151476-.6104719c-2.92840208-3.0013562-4.65745725-5.2721018-4.65745725-8.3618395 0-3.58985087 2.91014913-6.5 6.5-6.5zm0 4c-1.38071187 0-2.5 1.11928813-2.5 2.5s1.11928813 2.5 2.5 2.5c1.3807119 0 2.5-1.11928813 2.5-2.5s-1.1192881-2.5-2.5-2.5z" fill-rule="evenodd"></path></svg></i>Logan/ 14th Street Corridor</div>,
 <div class="neighborhood"><i class="ResyIcon ResyIcon--pin"><svg height="1em" viewbox="0 0 20 20" width="1em"><path d="m10 1.5c3.5898509 0 6.5 2.91014913 6.5 6.5 0 3.1783736-1.8115848 5.4745544-4.9479735 8.6580075l-.8458482.8488788c-.611314.6003241-.75280912.6319201-1.31466848.094788l-.63253781-.6293629-.60151476-.6104719c-2.92840208-3.0013562-4.65745725-5.2721018-4.65745725

In [38]:
# now print out EACH location for the restaurants
for entry in html.find_all('div', {'class': 'neighborhood'}):
    print(entry.text)

Logan/ 14th Street Corridor
Shaw
Adams Morgan
Capitol Hill
Dupont Circle
Shaw
Bloomingdale
West End
Penn Quarter
Shaw
Adams Morgan
14th Street NW
14th Street
East End
Mount Pleasant
H Street Corridor
Lanier Heights
Park View


Ok, we've figured out the restaurant name and neighborhood. Now we need to grab the price (number of dollar signs on a scale of one to four) for each restaurant. We'll follow the same process.

<a id="retrieving-prices"></a>
#### Retrieving the restaurant prices

That looks great, but what if I wanted just the number of dollar signs per restaurant? Can you figure out a way to simply print out the number of dollar signs per restaurant listed?

In [39]:
for entry in html.find_all('div', {'class': 'price'}):
    print(entry.text)

$$
$$$
$
$$
$$
$$$
$$
$$$
$$$
$$
$$
$$$
$$
$$$$
$$
$$
$$
$$


In [40]:
# print the number of dollars signs per restaurant

In [41]:
# do the same as above, but grabbing only the text content
for entry in html.find_all('div', {'class': 'cuisine'}):
    print(entry.text)

Cocktail Bar
Vegetarian
Middle Eastern
Mediterranean
Asian
American
Italian
Indian
Indian
Pizza
Mediterranean
American
Middle Eastern
Sushi
American
Vegan
American
Wine


In [42]:
# do the same as above, but grabbing only the text content
for entry in html.find_all('div', {'class': 'SearchResult__metadata--rating'}):
    print(entry.text)

5.0(30)·
4.9(868)·
4.9(731)·
4.9(188)·
4.9(117)·
4.8(16.7k)·
4.8(9.1k)·
4.8(9k)·
4.8(7.9k)·
4.8(7.6k)·
4.8(5.8k)·
4.8(5.6k)·
4.8(5.1k)·
4.8(3.3k)·
4.8(3.2k)·
4.8(2.9k)·
4.8(2.5k)·
4.8(1.7k)·


In [43]:
for entry in html.find_all('div', {'class': 'SearchResult__why-we-like-it body--sm color--text-secondary'}):
    print(entry.text)
    

Before chef Enrique Limardo made a name for himself helming the kitchens at Seven Reasons and Immigrant Food, he was cutting his teeth in D.C. ...
Don’t let the name fool you. Oyster Oyster serves up more than just bivalves in its charming and airy space. In fact, much of chef Rob Rubba’s exce...
In its mission to highlight the food of the Mid-Atlantic region, The Dabney excels fiercely. That most of its hyper-seasonal dishes are cooked in a ...
For those nights when all you want (or need) is a satisfying bowl of pasta and a glass of nice wine, you head straight to one of DC’s coziest and m...
One bite into the naan and you’ll understand why Rasika belongs to the city’s fine dining pantheon. This is acclaimed restaurateur Ashok Bajaj’...
One bite into the naan and you’ll understand why Rasika belongs to the city’s fine dining pantheon. This is acclaimed restaurateur Ashok Bajaj’...
The pizza here is very notably delicious. It’s no wonder chef Mike Friedman got a James Beard Award nom.


In [44]:
driver.close()

<a id="challenge-pandas"></a>
### Challenge: Use Pandas to create a DataFrame of bookings

However, the bonus is on you to now put all the pieces together.

Loop through each entry. For each entry, grab the relevant information we want (name, location, price, bookings). Produce a dataframe with the columns "name","location","price","bookings" that contains the 100 entries we would like.

In [47]:
import pandas as pd

In [77]:
# I'm going to create my empty df first
dc_eats = pd.DataFrame(columns=["name","location",
                                "price","star ratings",
                                "# of ratings"])

In [78]:
dc_eats

Unnamed: 0,name,location,price,star ratings,# of ratings


In [84]:
dc_eats["name"] = [
    entry.text for entry in html.find_all(
        'div', {'class': "SearchResult__title--container"}
    )
]
dc_eats["location"] = [
    entry.text for entry in html.find_all(
        'div', {'class': "neighborhood"}
    )
]
dc_eats["price"] = [
    entry.text.count("$") for entry in html.find_all(
        'div', {'class': "price"}
    )
]

dc_eats["star ratings"] = [
    entry.text for entry in html.find_all(
        'span', {'class': "score"}
    )
]

dc_eats["# of ratings"] = [
    entry.text.replace("(", "").replace(")", "") for entry in html.find_all(
        'span', {'class': "ratings"}
    )
]

In [85]:
dc_eats

Unnamed: 0,name,location,price,star ratings,# of ratings
0,Chicken + Whiskey,Logan/ 14th Street Corridor,2,5.0,30
1,Oyster Oyster,Shaw,3,4.9,868
2,The Green Zone,Adams Morgan,1,4.9,731
3,Cafe Fili DC,Capitol Hill,2,4.9,188
4,CHIKO Dupont Circle,Dupont Circle,2,4.9,117
5,The Dabney,Shaw,3,4.8,16.7k
6,The Red Hen,Bloomingdale,2,4.8,9.1k
7,Rasika West End,West End,3,4.8,9k
8,Rasika Penn Quarter,Penn Quarter,3,4.8,7.9k
9,All Purpose,Shaw,2,4.8,7.6k


In [None]:
# Put code here that populates the DataFrame using Selenium and BeautifulSoup!

In [None]:
# check out our work
dc_eats.head()

Awesome! We succeeded.

<a id="selenium-typing"></a>
### Auto-typing using Selenium

Now, let's explore some of the other functionality of a webdriver. We've barely scratched the surface.

In [91]:
# we can send keys as well

from selenium.webdriver.common.keys import Keys

In [92]:
# open Firefox
driver = webdriver.Firefox()

# visit Python
driver.get("http://www.python.org")

# verify we're in the right place
assert "Python" in driver.title

Let's try automatedly typing `pycon` in the search box and hitting the return key:

In [93]:
# find the search position
elem = driver.find_element_by_name("q")

# clear it
elem.clear()

# type in pycon
elem.send_keys("pycon")

# send those keys
elem.send_keys(Keys.RETURN)

In [94]:
# close
driver.close()

In [95]:
# all at once:
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title

elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
#assert "No results found." not in driver.page_source
driver.close()

The above example (and many others) are available in the Selenium docs: http://selenium-python.readthedocs.io/getting-started.html

What is especially important is exploring functionality like locating elements: http://selenium-python.readthedocs.io/locating-elements.html#locating-elements

FAQ:
http://selenium-python.readthedocs.io/faq.html

### Summary

In this lesson, we used the Beautiful Soup library to locate elements on a website then scrape their text. We also used the Selenium headless browser to run JavaScript first before retrieving the page contents.