<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Webscraping OpenTable with Selenium: Guided Lab

_Authors: Joseph Nelson (DC)_

---

> *Note: this lab is intended to be instructor-guided.*


In today's codealong lab, we will build a scraper using urllib and BeautifulSoup. We will remedy some of the pitfalls of automated scraping by using a a "headless" browser called Selenium.

You will be scraping OpenTable's DC listings. We're interested in knowing the restaurant's **name, location, price, and how many people booked it today.**

OpenTable provides all of this information on this given page: http://www.opentable.com/washington-dc-restaurant-listings.

### 1. Inspect the elements of this page to assure we can find each of the bits of information in which we're interested.

### 2. Use `urllib` and `BeautifulSoup` to read the contents of the HTML.

In [40]:
from bs4 import BeautifulSoup as bs
import urllib

In [103]:
# set the url we want to visit
url = "https://www.bloomberg.com/asia"

# A:
contents = urllib.urlopen(url).read()

In [104]:
len(contents)

553139

### 3. Print out the HTML (only print a fraction of it). What is in it?

In [43]:
# A:
contents[:100000]

'<!DOCTYPE html>\n<html xmlns:og="http://ogp.me/ns#" data-view-uid="0"><head>\n<base href=\'https://www.bloomberg.com/\'> <meta charset="utf-8"> <title>Bloomberg -  Asia Edition</title> <meta http-equiv="X-UA-Compatible" content="IE=11,10,9"> <script type="text/javascript">!function(e,t){function u(){return t.getElementById("bb-nav")}function M(){return document.querySelector(".bb-unsupported-message__text")}function L(){var e=document.getElementById("bb_unsupported_custom_message");if(e)return JSON.parse(e.innerHTML).message}function o(){var o=\'body.bb-unsupported-browser .bb-nav-placeholder{height:auto}body.bb-unsupported-browser #bb-that,body.bb-unsupported-browser .bb-nav-root{display:none;height:auto}body.bb-unsupported-browser .bb-nav-root{display:block}.bb-unsupported-message{background-color:#000;padding:20px 0}@media screen and (max-width: 63.6875rem){.bb-unsupported-message__logo-box{padding-bottom:10px}.bb-unsupported-message__copy{margin:0}}@media screen and (min-width: 63

### 4. Use Beautiful Soup to convert the raw HTML into a soup object.

In [44]:
# A:
html_obj = bs(contents, 'html.parser', from_encoding='utf-8')

### 5. Extract the name of each restaurant.

Let's first find each restaurant name listed on the page we've loaded. How do we find the page location of the restaurant? 

> *Hint: we need to know where in the **html** the restaurant element is housed.*

**5.A See if you can find the restaurant name on the page. Keep in mind there are many restaurants loaded on the page.**

In [45]:
# A:
print html_obj.find_all('a', {'class': 'hero-v6-story__headline-link'})

[<a class="hero-v6-story__headline-link" data-tracker-action="click" data-tracker-label="headline" href="/news/articles/2017-12-22/un-imposes-new-sanctions-targeting-north-korean-oil-imports"> North Korea Sanctions Tightened by UN Following Missile Test </a>, <a class="hero-v6-story__headline-link" data-tracker-action="click" data-tracker-label="headline" href="/news/articles/2017-12-22/novogratz-shelves-hedge-fund-sees-bitcoin-dropping-to-8-000"> Novogratz Halts Hedge Fund, Says Bitcoin May Drop to $8,000 </a>, <a class="hero-v6-story__headline-link" data-tracker-action="click" data-tracker-label="headline" href="/news/articles/2017-12-22/apple-sanctioned-in-qualcomm-ftc-case-for-withholding-documents"> Apple Sanctioned in Qualcomm FTC Case Over Withheld Evidence </a>]


**5.B Create a list of _only_ the restaurant names (no tags).**


In [46]:
# A:
a_names = []
# for each element you find, print out the article name
for article in html_obj.find_all('a', {'class': 'hero-v6-story__headline-link'}):
    title = article.renderContents()
    a_names.append(title.strip())

### 6. Repeat this process but for location.

For example, barmini by Jose Andres is in the location listed as "Penn Quarter" in our search results.

In [47]:
# A:
for article in html_obj.find_all('a', {'class': 'highlights-v6-story__headline-link'}):
    title = article.renderContents()
    a_names.append(title.strip())

In [48]:
a_names

['North Korea Sanctions Tightened by UN Following Missile Test',
 'Novogratz Halts Hedge Fund, Says Bitcoin May Drop to $8,000',
 'Apple Sanctioned in Qualcomm FTC Case Over Withheld Evidence',
 'Trump Travel Ban Dealt Blow by San Francisco Appeals Court',
 'Even With Trump Minerals Order, U.S. Miners Face Uphill Battle With China',
 'MLB Teams Up With Bejing Enterprises to Build Baseball Academies',
 'Another Multimillion-Dollar da Vinci Is Hiding in Plain Sight',
 'Goldman Is Setting Up a Cryptocurrency Trading Desk',
 'Hauling a Christmas Tree in a Ferrari Is Totally Normal',
 'The Best Wines I Tasted in 2017',
 'The 16 Very Best Dishes I Ate in 2017',
 'Bitcoin Lost Almost 20% of Its Value This Week',
 'Apple Plans Combined iPhone, iPad &amp; Mac Apps to Create One User Experience']

### 7. Get the price for each restaurant.

The price is number of dollar signs on a scale of one to four for each restaurant. We'll follow the same process.

In [82]:
# A:
# Get homepage link then get price change worse
asset_classes = html_obj.find_all('iframe', {'class': 'market-summary-v3'})

In [85]:
asset_classes

[<iframe class="market-summary-v3" data-view-uid="1|0_2_1_1_1" src="//www.bloomberg.com/markets/components/data-drawer?linksType=nav"></iframe>]

**7.B Convert the dollar sign strings to a count of the number of dollar signs.**

Can you figure out a way to simply print out the number of dollar signs per restaurant listed?

In [51]:
# A:

### 8. Can you find the number of times a restaurant was booked.

In the next cell, print out a sample of objects that contain the number of times the restaurant was booked.

> *Note: if you can't, why do you think this is happening?*

In [52]:
# A:

## Enter Selenium

---

Selenium is a headless browser. That means it enables us to mock human browsing behavior -- even waiting for JavaScript elements to load.

If you do not already have Selenium installed, you can do so via pip. Simply: `pip install selenium`

In [92]:
# import
from selenium import webdriver

Selenium requires us to determine a default browser to run. I'm going to opt for Firefox, but Chromium is also a very common choice. http://selenium-python.readthedocs.io/faq.html

### 9. What is going to happen when I run the next cell?

The chromedriver has been provided in the 'chromedriver' folder so no reason to download another.

In [93]:
# create a driver called driver
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

Pretty crazy, right? Let's close that driver.

In case you're wondering. this should have opened up a new browswer window.  Check all of your desktop displays if you didn't see it automatically pop up.

In [94]:
# close it
driver.close()

### 10. Use the driver to visit `www.python.org`

In [95]:
# A:
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
driver.get("https://www.bloomberg.com/asia")

### 11. Visit the OpenTable page using the driver

Let's return to our problem at hand. We need to visit the OpenTable listing for DC. Once there, we need to get the html to load. 

In the next cell, prove you can programmatically visit the page.

In [98]:
# A:
assert 'Bloomberg - Asia Edition' in driver.title

### 12. Resolve the javascript issue using the driver and find the bookings.

What we can do in this case is:
1. Request that the page load
2. wait one second
3. grab the source html from the page 

Because the page should believe I'm visiting from a live connection on a browser client, the JavaScript should render to be a part of the page source. I can then grab the page source.

**Once you have the HTML with the javascript rendered, repeat the processes above to find the bookings.**

In [99]:
# import sleep
from time import sleep

In [110]:
# A:
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
driver.get("https://www.bloomberg.com/asia")
sleep(20)
blmberg_content = driver.page_source

In [111]:
# there is a difference between scraping from selenium and urllib
len(blmberg_content)

676538

In [112]:
# convert to beautiful soup first
blmberg_soup = BeautifulSoup(blmberg_content, 'lxml')

In [113]:
# get asset class summaries on page
asset_classes = blmberg_soup.find_all('iframe', {'class':'market-summary-v3'})

In [114]:
asset_classes

[<iframe class="market-summary-v3" data-view-uid="1|0_2_1_1_1" src="//www.bloomberg.com/markets/components/data-drawer?linksType=nav"></iframe>]

### 13. Can we get all of the items we want from the page in a single `find_all`?

To be most efficient, we want to only do a single loop for each entry on the page. That means we want to find what element all of other other elements (name, location, price, bookings) is housed within. Where on the page is each entry located?

In [60]:
# A:

### 14. Does every single entry have each element we want?

In [61]:
# A:

### 15. Use python exceptions to handle cases when bookings aren't found.

When a booking is not found, store `'ZERO'`.

In [62]:
# A:

### 16. Putting it all together in a dataframe.

**Loop through each entry. For each entry:**
1. Grab the relevant information we want (name, location, price, bookings). 
2. Produce a dataframe with the columns "name","location","price","bookings" that contains the 100 entries we would like.

In [63]:
# A:

### 17. [Bonus] Sending keys over the driver.

We can send keys to the page using the driver. Below is a demonstration of how to search the page using the Selenium driver.

In [64]:
# we can send keys as well
# from selenium.webdriver.common.keys import Keys

In [65]:
# # open the driver
# driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
# # visit Python
# driver.get("http://www.python.org")
# # verify we're in the right place
# assert "Python" in driver.title

In [66]:
# # find the search position
# elem = driver.find_element_by_name("q")
# # clear it
# elem.clear()
# # type in pycon
# elem.send_keys("pycon")


In [67]:
# # send those keys
# elem.send_keys(Keys.RETURN)
# # no results
# assert "No results found." not in driver.page_source

In [68]:
# driver.close()

In [69]:
# # all at once:
# driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
# driver.get("http://www.python.org")
# assert "Python" in driver.title
# elem = driver.find_element_by_name("q")
# elem.clear()
# elem.send_keys("pycon")
# elem.send_keys(Keys.RETURN)
# assert "No results found." not in driver.page_source
# driver.close()

## Additional resources

---

The above example (and many others) are available in the Selenium docs: http://selenium-python.readthedocs.io/getting-started.html

What is especially important is exploring functionality like locating elements: http://selenium-python.readthedocs.io/locating-elements.html#locating-elements

FAQ:
http://selenium-python.readthedocs.io/faq.html