<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Web Scraping

_Author: Dave Yerrington (SF)_

---

## Before Class

#### Install Selenium

Selenium is a headless browser. It allows us to render JavaScript just as a human-navigated browser would.

To install Selenium, use one of the following:
- **Anaconda:** `conda install -c conda-forge selenium`
- **pip:** `pip install selenium`


#### Install GeckoDriver

You will also need GeckoDriver (this assumes you are using Homebrew for Mac): 

- ```brew install geckodriver```

#### Install Firefox

Additionally, you will need to have downloaded the [Firefox browser](https://www.mozilla.org/en-US/firefox/new/?utm_source=google&utm_medium=cpc&utm_campaign=Firefox-Brand-US-GGL-Exact&utm_term=firefox&utm_content=A144_A203_A006336&gclid=Cj0KEQjwnPLKBRC-j7nt1b7OlZwBEiQAv8lMLJUyReT6cPzSYdmEA6uD3YDoieuuuusddgAU7XH6smEaAoje8P8HAQ&gclsrc=aw.ds) for the application in this lesson.

## Learning Objectives
- Revisit how to locate elements on a webpage
- Aquire unstructure data from the internet using Beautiful soup.
- Discuss limitations associated with simple requests and urllib libraries
- Introduce Selenium as a solution, and implement a scraper using selenium

## Lesson Guide

- [Introduction](#intro)
- [Building a web scraper](#building-scraper)
- [Retrieving data from the HTML page](#retrieving-data)
    - [Retrieving the restaurant names](#retrieving-names)
    - [Challenge: Retrieving the restaurant locations](#retrieving-locations)
    - [Retrieving the restaurant prices](#retrieving-prices)
    - [Retrieving the restaurant number of bookings](#retrieving-bookings)


- [Introducting Selenium](#selenium)
    - [Running JavaScript before scraping](#selenium-js)
    - [Using regex to only get digits](#selenium-regex)
    - [Challenge: Use Pandas to create a DataFrame of bookings](#challenge-pandas)
    - [Auto-typing using Selenium](#selenium-typing)


- [Summary](#summary)

<a id="intro"></a>
## Introduction

In this codealong lesson, we'll build a web scraper using urllib and BeautifulSoup. We will also explore how to use a headless browser called Selenium.

We'll begin by scraping OpenTable's DC listings. We're interested in knowing the restaurant's **name, location, price, and how many people booked it today.**

OpenTable provides all of this information on this given page: http://www.opentable.com/washington-dc-restaurant-listings

Let's inspect the elements of this page to assure we can find each of the bits of information in which we're interested.

---

<a id="building-scraper"></a>
## Building a web scraper

Now, let's build a web scraper for OpenTable using urllib and Beautiful Soup:

In [1]:
# import our necessary first packages
from bs4 import BeautifulSoup
import requests

In [2]:
# set the url we want to visit
url = "http://www.opentable.com/washington-dc-restaurant-listings"

# visit that url, and grab the html of said page
html = requests.get(url)

At this point, what is in html?

In [3]:
# .text returns the request content in Unicode
html.text[:500]

#printing the first 500 characters

'           <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Restaurant Reservation Availability</title>    <meta  name="robots" content="noindex,nofollow" > </meta>     <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-16.png" sizes="16x16"/><l'

We will need to convert this html objct into a soup object so we can parse it using python and BS4

In [4]:
# convert this into a soup object
soup = BeautifulSoup(html.text, 'html.parser')

<a id="retrieving-data"></a>
### Retrieving data from the HTML page

Let's first find each restaurant name listed on the page we've loaded. How do we find the page location of the restaurant? (Hint: We need to know where in the **HTML** the restaurant element is housed.) In order to find the HTML that renders the restaurant location, we can use Google Chrome's Inspect tool:

> http://www.opentable.com/washington-dc-restaurant-listings

> 1. Visit the URL above. 

> 2. Right-click on an element you are interested in, then choose Inspect (in Chrome). 

> 3. This will open the Developer Tools and show the HTML used to render the selected page element. 

> Throughout this lesson, we will use this method to find tags associated with elements of the page we want to scrape.

See if you can find the restaurant name on the page. Keep in mind there are many restaurants loaded on the page.

In [5]:
# print the restaurant names
soup.find_all(name='span', attrs={'class':'rest-row-name-text'})

#name of the html tag, and extra attributes



[<span class="rest-row-name-text">Durward Ziemann</span>,
 <span class="rest-row-name-text">Cales</span>,
 <span class="rest-row-name-text">Caroline Bergnaum</span>,
 <span class="rest-row-name-text">Et Bailey</span>,
 <span class="rest-row-name-text">Id</span>,
 <span class="rest-row-name-text">Nayeli Bartoletti</span>,
 <span class="rest-row-name-text">Fernes</span>,
 <span class="rest-row-name-text">Agloe Bar &amp; Grill</span>,
 <span class="rest-row-name-text">1138 Rippin</span>,
 <span class="rest-row-name-text">Glover</span>,
 <span class="rest-row-name-text">Ernestos</span>,
 <span class="rest-row-name-text">Centers</span>,
 <span class="rest-row-name-text">Omnis Heaney</span>,
 <span class="rest-row-name-text">Streich Ridge</span>,
 <span class="rest-row-name-text">Vincent Mission</span>,
 <span class="rest-row-name-text">Harum Roob</span>,
 <span class="rest-row-name-text">Mount</span>,
 <span class="rest-row-name-text">Et Route</span>,
 <span class="rest-row-name-text">1152 

It is important to always keep in mind the data types that were returned. Note this is a `list`, and we know that immediately by observing the outer square brackets and commas separating each tag.

Next, note the elements of the list are `Tag` objects, not strings. (If they were strings, they would be surrounded by quotes.) The Beautiful Soup authors chose to display a `Tag` object visually as a text representation of the tag and its contents. However, being an object, it has many methods that we can call on it. For example, next we will use the `encode_contents()` method to return the tag's contents encoded as a Python string.

<a id="retrieving-names"></a>
#### Retrieving the restaurant names

Now that we found a list of tags containing the restaurant names, let's think how we can loop through them all one-by-one. In the following cell, we'll print out the name (and **only** the clean name, not the rest of the html) of each restaurant.

In [6]:
# for each element you find, print out the restaurant name
for entry in soup.find_all(name='span', attrs={'class':'rest-row-name-text'}):
    print(entry.text)

Durward Ziemann
Cales
Caroline Bergnaum
Et Bailey
Id
Nayeli Bartoletti
Fernes
Agloe Bar & Grill
1138 Rippin
Glover
Ernestos
Centers
Omnis Heaney
Streich Ridge
Vincent Mission
Harum Roob
Mount
Et Route
1152 Wyman
Bosco Fords
1113 DuBuque
Et
Vel Schaefer
Qui Dicki
Florida Wells
Squares
Alexandro Ankunding
Jannie Johns
Dolor
Kianas
1324 Gerlach
Inventore Gateway
Island
Sunt
643 Bode
Overpass
Est Cliffs
Radial
Dicta
Johnny Will
Alaniss
Oberbrunner Village
Euna Tunnel
Explicabo
789 Rice
1133 Hayes
Dolor
Recusandae Treutel
Tevins
Herman Junction
Vel
567 Rutherford
Mustafa Dam
Court
Nulla Abshire
Stream
Ville
Non Village
Harum
Trycia Walk
Chyna Graham
White
At Lemke
Adipisci Bednar
Fuga Bergnaum
Odit Pines
Hayes
Beatae Mertz
Lexi O'Keefe
Occaecati Gaylord
Birdie McGlynn
Non Lakin
Sit Howe
Quis Hyatt
Et
Vitae Paucek
Mann
Gerhard Koch
Ducimus Lodge
Brekke Cape
409 Corkery
Kunde
Landing
Okuneva
Bergnaum
Lueilwitz
Elizabeth Hartmann
Daniel
Ferne Parisian
Canyon
Chance Lebsack
Bogisich
Ezekiel Pol

Great!

<a id="retrieving-locations"></a>
#### Challenge: Retrieving the restaurant locations

Can you repeat that process for finding the location? For example, barmini by Jose Andres is in the location listed as "Penn Quarter" in our search results.

In [7]:
# first, see if you can identify the location for all elements -- print it out
soup.find_all('span', {'class':'rest-row-meta--location rest-row-meta-text'})

[<span class="rest-row-meta--location rest-row-meta-text">South Mathew</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Darianstad</span>,
 <span class="rest-row-meta--location rest-row-meta-text">North Jovanimouth</span>,
 <span class="rest-row-meta--location rest-row-meta-text">New Jean</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Greysonfort</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Lake Amirbury</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Wolffmouth</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Murrayfurt</span>,
 <span class="rest-row-meta--location rest-row-meta-text">West Erynfurt</span>,
 <span class="rest-row-meta--location rest-row-meta-text">West Simone</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Lake Hans</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Braulioside</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Sou

In [8]:
# now print out EACH location for the restaurants
for entry in soup.find_all('span', {'class':'rest-row-meta--location rest-row-meta-text'}):
    print(entry.text)

South Mathew
Darianstad
North Jovanimouth
New Jean
Greysonfort
Lake Amirbury
Wolffmouth
Murrayfurt
West Erynfurt
West Simone
Lake Hans
Braulioside
South Clay
Turnerland
Port Aracelyburgh
Mrazland
Kleinton
New Chaim
East Keira
South Prince
Victoriaburgh
Toyfurt
North Jaycee
Stephonstad
Carmellafort
Lake Juliaborough
Wisokytown
Port Travonview
Juddview
Jaydeview
Lake Alvisport
Lake Bernhardfurt
Mertzside
Gerardohaven
Gaybury
Alland
North Madaline
Lake Lacyhaven
Kirstinmouth
Angelitachester
Cadeside
Weimannfurt
New Amarashire
East Kielfurt
South Alexaneview
Port Winona
East Rachelle
Runolfsdottirport
New Karinaview
Domingotown
Port Steveshire
Handburgh
Eastonmouth
Dickimouth
South Ashtynport
Vandervortmouth
Grimesfort
O'Connershire
Vonbury
West Viviane
Port Charles
Kochhaven
Lake Elnoraberg
Lake Arden
Port Duaneside
Dickiborough
Oswaldoland
Buckmouth
Bodeview
Barrowstown
East Daphneville
Johnathonshire
Elsieland
New Kenyattafurt
Wisokyfort
Augustinebury
Reynoldschester
Hobartport
Lake Fab

<a id="retrieving-prices"></a>
#### Retrieving the restaurant prices

Ok, we've figured out the restaurant name and location. Now we need to grab the price (number of dollar signs on a scale of one to four) for each restaurant. We'll follow the same process.

In [9]:
# print out all prices
soup.find_all('div', {'class':'rest-row-pricing'})

[<div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="

In [10]:
# print out EACH number of dollar signs per restaurant
# this one is trickier to eliminate the html. Hint: try a nested find
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    print(entry.find('i').text)

  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $      
  $    $    $    
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    $  
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    
  $    $    $    
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $      
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $      
  $    $    $    $  
  $    $    $    $  
  $ 

That looks great, but what if I wanted just the number of dollar signs per restaurant? Can you figure out a way to simply print out the number of dollar signs per restaurant listed?

In [11]:
# print the number of dollars signs per restaurant
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    price = entry.find('i').text
    print(price.count('$'))

2
4
4
4
4
4
4
2
4
3
4
4
4
2
4
4
3
3
4
2
3
2
4
4
3
3
2
4
2
4
4
4
3
3
3
3
2
3
3
3
3
4
3
3
4
2
3
3
4
3
2
4
4
4
3
4
2
3
2
4
3
2
4
2
2
3
4
4
2
3
2
4
3
2
2
3
3
2
2
4
4
3
3
4
2
3
4
3
4
2
2
3
2
3
3
4
3
3
3
4


Phew, nice work. 

<a id="retrieving-bookings"></a>
#### Retrieving the restaurant number of bookings

One more, right? We only need to find the number times a restaurant was booked. In the next cell, print out all objects that contain the number of times the restaurant was booked.

In [12]:
# print out all objects that contain the number of times the restaurant was booked
soup.find_all('div', {'class':'booking'})

[]

That's weird -- an empty set. Did we find the wrong element? What's going on here? Discuss.

How can we debug this? Any ideas?

In [13]:
# let's first try printing out all 'div' objects
#  NOTE: This is a too many objects to store in this notebook!
#        So, uncomment the code below to run it.

for entry in soup.find_all('div'):
    print(entry)

<div class="master-container" id="search-master-container"> <script>
  if (window.Cypress) {
    window.__TestGaCalls = [];
    window.MapTrackGA = function(dataPoint) {
      window.ga = function(_) {
        window.__TestGaCalls.push(arguments);
      }
      var data = {};
      data[dataPoint] = '1';
      ga('gtm1.send', 'event', 'map_event', 'map_event', data);
    }
  }
  else {
    window.MapTrackGA = function(dataPoint) {
      if (typeof ga === 'function') {
        var data = {};
        data[dataPoint] = '1';
        ga('gtm1.send', 'event', 'map_event', 'map_event', data);
      }
    }
  }
</script> <style>.icon-font{font-family:icons;speak:none;font-style:normal;font-weight:400;font-variant:normal;text-transform:none;line-height:1;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale}.breadcrumb li.icon-visible a:before{font-family:icons;speak:none;font-style:normal;font-weight:400;font-variant:normal;text-transform:none;line-height:1;-webkit-font-smoothin

<div class="overall-search-container" id="no-re-render-container"> <div class="filters-bar show-filter-content" id="filters-bar"><div class="max-width-wrapper js-sticky-target" id="sticker"><div class="row"><div class="column"><div class="visual_map_icon" data-view="map" role="button" tabindex="0"><div class="visual-button button"> Map</div></div> <div aria-label="filters" id="search_filters" role="search"> <ul class="filters-list"> </ul> </div></div></div></div></div> <div class="stack-selected-filters"> <div class="toggle-filter-bar"><div id="filter-toggle"><div class="filter-toggle-icon"></div> <span class="show-filter-text">Show filters</span> <span class="hide-filter-text">Hide filters</span></div></div><div class="selected-filters js-selected-filters full-width-wrapper"><div class="row"><div class="column" id="js-selected-filters-column"></div></div></div> <div aria-label="search results" class="search-results-container page-main-content max-width-wrapper" id="search_results_cont

<div aria-label="search results" class="search-results-container page-main-content max-width-wrapper" id="search_results_container" role="main"><div class="close-filters"></div> <div class="loader" id="loading_animation"><div class="spinner"></div><div aria-live="assertive" class="loader-content" id="loading_error_container"></div><span aria-live="assertive" id="loading_message" style="opacity: 0"></span></div> <div class="results-set results-table search-results" data-name="ResultsTable" id="search_results"> <div class="content-section-header"> <div class="flex-row-justify results-header"> <h3 class="results-title color-dark" id="results-title">  100 Restaurants </h3> <div class="sort-view-filters"><div class="filter-option right filter-option-sort-orders" id="sort-filters"> <label class="sort-dropdown" data-target="sort-filter-menu" id="js-sort-dropdown"><div class="sort-dropdown__container"> <input class="sort-dropdown__checkbox" id="js-toggle-sort-menu" type="checkbox"/><div class=

<div class="results-set results-table search-results" data-name="ResultsTable" id="search_results"> <div class="content-section-header"> <div class="flex-row-justify results-header"> <h3 class="results-title color-dark" id="results-title">  100 Restaurants </h3> <div class="sort-view-filters"><div class="filter-option right filter-option-sort-orders" id="sort-filters"> <label class="sort-dropdown" data-target="sort-filter-menu" id="js-sort-dropdown"><div class="sort-dropdown__container"> <input class="sort-dropdown__checkbox" id="js-toggle-sort-menu" type="checkbox"/><div class="sort-dropdown__button"> A-Z</div><ul class="sort-dropdown__options"> <li class="sort-dropdown__option"> <label class="menu-list-label sort-dropdown__option-label" for="SortOrder_Popular"><input class="menu-list-input sort-option sort-dropdown__radio" id="SortOrder_Popular" name="SortOrder" type="radio" value="Popularity"/> <span class="sort-dropdown__option-text">Featured</span></label></li><li class="sort-drop

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



I still don't see it. Let's search our entire soup object:

In [14]:
# print out soup, do command+f for "booked ". 
#   Uncomment the below to run.

soup

 <!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"/><meta content="IE=9; IE=8; IE=7; IE=EDGE" http-equiv="X-UA-Compatible"/> <title>Restaurant Reservation Availability</title> <meta content="noindex,nofollow" name="robots"/> <link href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-16.png" rel="icon" sizes="16x16"/><link href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-32.png" rel="icon" sizes="32x32"/><link href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-48.png" rel="icon" sizes="48x48"/><link href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-64.png" rel="icon" sizes="64x64"/><link href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-128.png" rel="icon" sizes="128x128"/><link href="//components.otstatic.com/components/favicon/1.0.6/favicon/a

What do you notice? Why is this happening?

In [15]:
html =  requests.get('https://santabarbara.craigslist.org/d/apts-housing-for-rent/search/apa')
soup_cl = BeautifulSoup(html.text, 'html.parser')

In [16]:
results = soup_cl.find_all('p', attrs={'class': 'result-info'})
listings = []
for result in results:
    result_dict = {}
    result_dict['title'] = result.find('a').text
    result_dict['price'] = result.find('span', attrs={'class': 'result-price'}).text
    listings.append(result_dict)

listings

[{'title': '3BR in Peaceful Neighborhood', 'price': '$3700'},
 {'title': '3BR 2BATHS close to Butterfly beach.', 'price': '$3800'},
 {'title': '2BR In the Heart of Downtown SB', 'price': '$3500'},
 {'title': 'Buellton 3bed 2bath', 'price': '$2995'},
 {'title': '4 beds 3 baths 2,357 sqft home available now', 'price': '$3000'},
 {'title': 'Adorable 3beds 2baths for rent', 'price': '$4000'},
 {'title': 'Newly Remodeled 4bed 2bath home for rent', 'price': '$3200'},
 {'title': 'Newly Renovated 4bed 2bath home', 'price': '$3200'},
 {'title': 'House for lease ,14 acres for Commercial Cannabis or Horses (Oklahoma',
  'price': '$5500'},
 {'title': 'Family Home', 'price': '$4950'},
 {'title': 'Luminous 1 bedroom 1 bath condo', 'price': '$2200'},
 {'title': '4 Bed_3 Baths with air conditioning, hardwood floors, walk-in shower',
  'price': '$596'},
 {'title': 'Beautiful Cottage 2 blocks to Cottage Hospital/Oak Park',
  'price': '$3900'},
 {'title': 'East Beach /2+2 condo/Furnished w/ Private Garag

<a id="selenium"></a>
## Introducing Selenium

Selenium is a headless browser. It allows us to render JavaScript just as a human-navigated browser would.

To install Selenium, use one of the following:
- **Anaconda:** `conda install -c conda-forge selenium`
- **pip:** `pip install selenium`

You will also need GeckoDriver (this assumes you are using Homebrew for Mac): 

- ```brew install geckodriver```

Additionally, you will need to have downloaded the [Firefox browser](https://www.mozilla.org/en-US/firefox/new/?utm_source=google&utm_medium=cpc&utm_campaign=Firefox-Brand-US-GGL-Exact&utm_term=firefox&utm_content=A144_A203_A006336&gclid=Cj0KEQjwnPLKBRC-j7nt1b7OlZwBEiQAv8lMLJUyReT6cPzSYdmEA6uD3YDoieuuuusddgAU7XH6smEaAoje8P8HAQ&gclsrc=aw.ds) for the application in this lesson.

In [17]:
# import
from selenium import webdriver

Selenium requires us to determine a default browser to run. I'm going to opt for Firefox, but Chromium is also a very common choice. http://selenium-python.readthedocs.io/faq.html

In [18]:
# STOP
# what is going to happen when I run the next cell?

In [19]:
# create a driver called Chrome
driver = webdriver.Chrome(executable_path="./practice/chromedriver/chromedriver")

Pretty crazy, right? Let's close that driver.

In [20]:
# close it
driver.close()

In [21]:
# let's boot it up, and visit a URL of our choice
driver = webdriver.Chrome(executable_path="./practice/chromedriver/chromedriver")
driver.get("http://www.python.org")

In [22]:
driver.close()

Awesome. Now we're getting somewhere: programmatically controlling our browser like a human.

Let's return to our problem at hand. We need to visit the OpenTable listing for DC. Once there, we need to get the html to load. In the next cell, prove you can programmatically visit the page.

In [23]:
# visit our OpenTable page
driver = webdriver.Chrome(executable_path="./practice/chromedriver/chromedriver")
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

# always good to check we've got the page we think we do
assert "OpenTable" in driver.title

In [24]:
driver.title

'Washington, D.C. Area Restaurants List | OpenTable'

In [25]:
driver.close()

<a id="selenium-js"></a>
### Running JavaScript before scraping

Now, to resolve our JavaScript problem, there's a few things we can do. What I'll do in this case is request that the page load, wait one second, and then I'm going to grab the source html from the page. Because the page should believe I'm visiting from a live connection on a browser client, the JavaScript should render to be a part of the page source. I can then grab the page source.

In [26]:
# import sleep
from time import sleep

In [27]:
# visit our relevant page
driver = webdriver.Chrome(executable_path="./practice/chromedriver/chromedriver")
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

# wait one second
sleep(1)

#grab the page source
html = driver.page_source

**Pop Quiz:** What do we need to do with this HTML?

In [28]:
# BeautifulSoup it!
html = BeautifulSoup(html, "lxml")

Now, let's return to our earlier problem: How do we locate bookings on the page?

In [29]:
# print out the number bookings for all restaurants
html.find_all('div', {'class':'booking'})

[<div class="booking"><span class="tadpole"></span>Booked 31 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 6 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 43 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 43 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 54 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 1 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 107 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 28 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 12 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 46 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 58 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 64 times today</div>,
 <div class="booking"><span class="tadpol

In [30]:
# now print out each booking for the listings using a loop
for entry in html.find_all('div', {'class':'booking'}):
    print(entry)

<div class="booking"><span class="tadpole"></span>Booked 31 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 6 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 43 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 43 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 54 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 1 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 107 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 28 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 12 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 46 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 58 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 64 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 33 times

Let's grab just the text of each of these entries.

In [31]:
# do the same as above, but grabbing only the text content
for entry in html.find_all('div', {'class':'booking'}):
    print(entry.text)

Booked 31 times today
Booked 6 times today
Booked 43 times today
Booked 43 times today
Booked 54 times today
Booked 1 times today
Booked 107 times today
Booked 28 times today
Booked 12 times today
Booked 46 times today
Booked 58 times today
Booked 64 times today
Booked 33 times today
Booked 41 times today
Booked 15 times today
Booked 40 times today
Booked 40 times today
Booked 37 times today
Booked 113 times today
Booked 38 times today
Booked 38 times today
Booked 29 times today
Booked 32 times today
Booked 23 times today
Booked 10 times today
Booked 22 times today
Booked 3 times today
Booked 22 times today
Booked 207 times today
Booked 15 times today
Booked 30 times today
Booked 50 times today
Booked 534 times today
Booked 61 times today
Booked 166 times today
Booked 5 times today
Booked 63 times today
Booked 13 times today
Booked 56 times today
Booked 56 times today
Booked 77 times today
Booked 9 times today
Booked 24 times today
Booked 10 times today
Booked 12 times today
Booked 28 

In [32]:
driver.close()

We've succeeded!

<a id="selenium-regex"></a>
### Using regex to only get digits

But we can clean this up a little bit. We're going to use regular expressions (regex) to grab only the digits that are available in each of the text.

The best way to get good at regex is to, well, just keep trying and testing: http://pythex.org/

In [33]:
# import regex
import re

Given we haven't covered regex, I'll show you how to use the search function to match any given digit.

In [34]:
# for each entry, grab the text
for booking in html.find_all('div', {'class':'booking'}):
    # match all digits
    match = re.search(r'\d+', booking.text)
    
    if match:
        # print if found
        print(match.group())
    else:
        # otherwise pass
        pass

31
6
43
43
54
1
107
28
12
46
58
64
33
41
15
40
40
37
113
38
38
29
32
23
10
22
3
22
207
15
30
50
534
61
166
5
63
13
56
56
77
9
24
10
12
28
17
3
11
5
27
37
3
38
15
26
63
49
15
150
73
31
89
74
40
7
58
1
49
3
57
30
71
49
31
33
2
19
35
31
68
70
71
9
44
70
14
38
36
193
112
261
257
79
145
17
23
60
15


Before we demonstrate all the other amazing things about headless browsers, let's finish up collecting the data we want from this current example. Do you suppose the html parsing we wrote above will still work on the page source we've grabbed from our headless browser?

To be most efficient, we want to only do a single loop for each entry on the page. That means we want to find what element all of other other elements (name, location, price, bookings) is housed within. Where on the page is each entry located?

In [35]:
# print out all entries
#   NOTE: Has many entries. Uncomment the below code to run it!

soup.find_all('div', {'class':'result content-section-list-row cf with-times'})

[<div class="result content-section-list-row cf with-times" data-id="0" data-index="1" data-is-marketed="true" data-is-promoted="undefined" data-lat="69.0280" data-lon="-145.1894" data-offers="" data-restaurant-availability-token="" data-rid="18251"><div class="rest-row with-image"> <div class="rest-row-image"> <a href="undefined" onclick="OT.BestAnalytics.logRestaurantVisit(18251)" target="_blank"><img alt="photo of durward ziemann restaurant" class="lazy rest-image" data-src="//resizer.otstatic.com/v2/profiles/legacy/18251.jpg" src="//media.otstatic.com/search-result-node/images/no-image.png"/></a></div> <div class="rest-row-info"><div class="rest-row-header"> <a class="rest-row-name rest-name" href="undefined" onclick="OT.BestAnalytics.logRestaurantVisit(18251)" target="_blank"><span class="rest-row-name-text">Durward Ziemann</span> </a> </div> <div class="flex-row-justify"> <div class="rest-row-review"> <div class="review-container"> Reviews coming soon</div> </div> <div class="res

Look over the page. Does every single entry have each element we're seeking?
> I did this previously. I know for a fact that not every element has a number of recent bookings. That's probably exactly why OpenTable houses this in JavaScript: they want to continously update the number of bookings with the most relevant number of values.

In [36]:
# We want to only retrieve the text of the bookings.
# But what would happen if we just naively print the text of each node?

for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    print(entry.find('div', {'class':'booking'}))   # try adding .text

<div class="booking"><span class="tadpole"></span>Booked 31 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 6 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 43 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 43 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 54 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 1 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 107 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 28 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 12 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 46 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 58 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 64 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 33 times

In [37]:
# if we find the element we want, we print it. Otherwise, we print 'ZERO'
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
        divs = entry.find('div', {'class':'booking'})
        
        if divs:
            print(divs.text)
        else:
            print('ZERO')

Booked 31 times today
Booked 6 times today
Booked 43 times today
Booked 43 times today
Booked 54 times today
Booked 1 times today
Booked 107 times today
Booked 28 times today
Booked 12 times today
Booked 46 times today
Booked 58 times today
Booked 64 times today
Booked 33 times today
Booked 41 times today
Booked 15 times today
Booked 40 times today
Booked 40 times today
Booked 37 times today
Booked 113 times today
Booked 38 times today
Booked 38 times today
Booked 29 times today
Booked 32 times today
Booked 23 times today
Booked 10 times today
Booked 22 times today
Booked 3 times today
Booked 22 times today
Booked 207 times today
Booked 15 times today
Booked 30 times today
Booked 50 times today
Booked 534 times today
Booked 61 times today
Booked 166 times today
Booked 5 times today
Booked 63 times today
Booked 13 times today
Booked 56 times today
Booked 56 times today
Booked 77 times today
Booked 9 times today
Booked 24 times today
Booked 10 times today
Booked 12 times today
Booked 28 

What do you notice takes the place when booking is not found?

We could use exception handling (`try`/`except` blocks) to resolve this. However, exceptions should only be used to handle rare or unexpected errors -- never for normal program flow.

In this case, we expect that some entries will be zero. So, we can just use an `if` statement that tests whether there are any `divs` present; if not, display `'ZERO'`. Here's a demo:

In [38]:
# if we find the element we want, we print it. Otherwise, we print 'ZERO'
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    booking_tag = entry.find('div', {'class':'booking'})
    
    if booking_tag:
        print(booking_tag.text)
    else:
        print('ZERO')

Booked 31 times today
Booked 6 times today
Booked 43 times today
Booked 43 times today
Booked 54 times today
Booked 1 times today
Booked 107 times today
Booked 28 times today
Booked 12 times today
Booked 46 times today
Booked 58 times today
Booked 64 times today
Booked 33 times today
Booked 41 times today
Booked 15 times today
Booked 40 times today
Booked 40 times today
Booked 37 times today
Booked 113 times today
Booked 38 times today
Booked 38 times today
Booked 29 times today
Booked 32 times today
Booked 23 times today
Booked 10 times today
Booked 22 times today
Booked 3 times today
Booked 22 times today
Booked 207 times today
Booked 15 times today
Booked 30 times today
Booked 50 times today
Booked 534 times today
Booked 61 times today
Booked 166 times today
Booked 5 times today
Booked 63 times today
Booked 13 times today
Booked 56 times today
Booked 56 times today
Booked 77 times today
Booked 9 times today
Booked 24 times today
Booked 10 times today
Booked 12 times today
Booked 28 

After previously completing this, we observed that all other elements WILL be returned. This means we do not have to always handle these cases.

<a id="challenge-pandas"></a>
### Challenge: Use Pandas to create a DataFrame of bookings

However, the onus is on you to now put all the pieces together.

Loop through each entry. For each entry, grab the relevant information we want (name, location, price, bookings). Produce a dataframe with the columns "name","location","price","bookings" that contains the 100 entries we would like.

In [39]:
import pandas as pd

In [45]:
# I'm going to create my empty df first
dc_eats = pd.DataFrame(columns=["name","cuisine","location","price","bookings"])

#don't specify any values

dc_eats

Unnamed: 0,name,cuisine,location,price,bookings


In [47]:
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    # grab the name
    name = entry.find('span', {'class':'rest-row-name-text'}).text
    
    # grab the cuisine
    cuisine = entry.find('span', {'class': 'rest-row-meta--cuisine rest-row-meta-text'}).text
    # grab the location
    location = entry.find_all('span', {'class': 'rest-row-meta--cuisine rest-row-meta-text sfx1388addContent'})[1].text
    
    #the [1] is because it returns a list and you want the second element 
    
    # grab the price
    price = entry.find('div', {'class':'rest-row-pricing'}).find('i').text.count('$')
    
    # try to find the number of bookings
    bookings = 'NA'
    booking_tag = entry.find('div', {'class':'booking'})
    if booking_tag:
        match = re.search('\d+', booking_tag.text)
        
        if match:
            bookings = match.group()
    
    dc_eats.loc[len(dc_eats)]=[name, cuisine, location, price, bookings]

In [48]:
# check out our work
dc_eats.head()

Unnamed: 0,name,cuisine,location,price,bookings
0,Medium Rare - Cleveland Park,Steak,Cleveland Park,2,31
1,Maya Bistro,Mediterranean,Arlington,2,6
2,BlackSalt,Seafood,Palisades Northwest,3,43
3,Bistro Aracosia,Afghan,Palisades Northwest,2,43
4,El Centro D.F. - Georgetown,Contemporary Mexican,Georgetown,2,54


Awesome! We succeeded.

<a id="selenium-typing"></a>
### Auto-typing using Selenium

Now, let's explore some of the other functionality of a webdriver. We've barely scratched the surface.

In [50]:
# we can send keys as well
# import
from selenium.webdriver.common.keys import Keys

In [52]:
# open Firefox
driver = webdriver.Chrome(executable_path="./practice/chromedriver/chromedriver")
# visit Python
driver.get("http://www.python.org")
# verify we're in the right place
assert "Python" in driver.title

Let's try automatedly typing `pycon` in the search box and hitting the return key:

In [53]:
# find the search position
elem = driver.find_element_by_name("q")

# clear it
elem.clear()

# type in pycon
elem.send_keys("pycon")

# send those keys
elem.send_keys(Keys.RETURN)

In [None]:
# close
driver.close()

In [55]:
# all at once:
driver = webdriver.Chrome(executable_path="./practice/chromedriver/chromedriver")
driver.get("http://www.python.org")
assert "Python" in driver.title

elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
#assert "No results found." not in driver.page_source
#driver.close()

The above example (and many others) are available in the Selenium docs: http://selenium-python.readthedocs.io/getting-started.html

What is especially important is exploring functionality like locating elements: http://selenium-python.readthedocs.io/locating-elements.html#locating-elements

FAQ:
http://selenium-python.readthedocs.io/faq.html

### Summary

In this lesson, we used the Beautiful Soup library to locate elements on a website then scrape their text. We also used the Selenium headless browser to run JavaScript first before retrieving the page contents.