<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Webscraping with Selenium

---


## Enter Selenium

---

Selenium is a headless browser. That means it enables us to mock human browsing behavior -- even waiting for JavaScript elements to load.

If you do not already have Selenium installed, you can do so via pip. Simply: `pip install selenium`

In [3]:
# import
from selenium import webdriver

Selenium requires us to determine a default browser to run. I'm going to opt for Firefox, but Chromium is also a very common choice. http://selenium-python.readthedocs.io/faq.html

### 1. What is going to happen when I run the next cell?

The chromedriver has been provided in the 'chromedriver' folder so no reason to download another.

In [4]:
# create a driver called driver
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

Pretty crazy, right? Let's close that driver.

In case you're wondering. this should have opened up a new browswer window.  Check all of your desktop displays if you didn't see it automatically pop up.

In [5]:
# close it
driver.close()

### 2. Use the driver to visit `www.python.org`

In [6]:
# A:
#visit www.python.org
driver = webdriver.Chrome(executable_path = './chromedriver/chromedriver.exe')
driver.get('https://www.python.org/')

#close
driver.close()

### 3. Visit the OpenTable page using the driver

Let's return to our problem at hand. We need to visit the OpenTable listing for DC. Once there, we need to get the html to load. 

In the next cell, prove you can programmatically visit the page.

In [11]:
# A:
# visit our OpenTable page.
driver = webdriver.Chrome(executable_path='./chromedriver/chromedriver')
driver.get('http://www.opentable.com/washington-dc-restaurant-listings')

In [12]:
driver.close()

### 4. Resolve the javascript issue using the driver and find the bookings.

What we can do in this case is:
1. Request that the page load
2. wait one second
3. grab the source html from the page 

Because the page should believe I'm visiting from a live connection on a browser client, the JavaScript should render to be a part of the page source. I can then grab the page source.

**Once you have the HTML with the javascript rendered, repeat the processes above to find the bookings.**

In [13]:
# import sleep
from time import sleep

In [14]:
# A:
# visiting our relevant page.
driver = webdriver.Chrome(executable_path = './chromedriver/chromedriver')
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

# waiting one second.
sleep(1)

#getting the html from the page source
html = driver.page_source

### 5. Can we get all of the items we want from the page in a single `find_all`?

To be most efficient, we want to only do a single loop for each entry on the page. That means we want to find what element all of other other elements (name, location, price, bookings) is housed within. Where on the page is each entry located?

In [15]:
#import 
from bs4 import BeautifulSoup

In [16]:
# A:
html = BeautifulSoup(html, "lxml")

In [19]:
for entry in html.find_all('div', {'class':'booking'}):
    print(entry)

<div class="booking"><span class="tadpole"></span>Booked 416 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 138 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 243 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 55 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 116 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 77 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 98 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 26 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 54 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 55 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 40 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 39 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 59 

In [20]:
for entry in html.find_all('div', {'class':'booking'}):
    print(entry.text)

Booked 416 times today
Booked 138 times today
Booked 243 times today
Booked 55 times today
Booked 116 times today
Booked 77 times today
Booked 98 times today
Booked 26 times today
Booked 54 times today
Booked 55 times today
Booked 40 times today
Booked 39 times today
Booked 59 times today
Booked 40 times today
Booked 61 times today
Booked 32 times today
Booked 41 times today
Booked 42 times today
Booked 45 times today
Booked 90 times today
Booked 55 times today
Booked 73 times today
Booked 44 times today
Booked 70 times today
Booked 40 times today
Booked 26 times today
Booked 28 times today
Booked 54 times today
Booked 82 times today
Booked 96 times today
Booked 66 times today
Booked 7 times today
Booked 75 times today
Booked 12 times today
Booked 26 times today
Booked 14 times today
Booked 86 times today
Booked 40 times today
Booked 70 times today
Booked 45 times today
Booked 17 times today
Booked 11 times today
Booked 84 times today
Booked 49 times today
Booked 74 times today
Booked 

### 6. Does every single entry have each element we want?

In [21]:
# A:
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    print(entry.find('div', {'class':'booking'}))

None
None
<div class="booking"><span class="tadpole"></span>Booked 416 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 138 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 243 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 55 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 116 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 77 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 98 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 26 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 54 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 55 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 40 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 39 times today</div>
<div class="booking"><span class="tadpole"></span>

### 7. Use python exceptions to handle cases when bookings aren't found.

When a booking is not found, store `'ZERO'`.

In [22]:
# A:
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    booking_tag = entry.find('div', {'class':'booking'})
    
    if booking_tag:
        print(booking_tag.text)
    else:
        print('ZERO')

ZERO
ZERO
Booked 416 times today
Booked 138 times today
Booked 243 times today
Booked 55 times today
Booked 116 times today
Booked 77 times today
Booked 98 times today
Booked 26 times today
Booked 54 times today
Booked 55 times today
Booked 40 times today
Booked 39 times today
Booked 59 times today
Booked 40 times today
Booked 61 times today
Booked 32 times today
Booked 41 times today
Booked 42 times today
Booked 45 times today
Booked 90 times today
Booked 55 times today
Booked 73 times today
Booked 44 times today
Booked 70 times today
Booked 40 times today
Booked 26 times today
Booked 28 times today
Booked 54 times today
Booked 82 times today
Booked 96 times today
Booked 66 times today
Booked 7 times today
Booked 75 times today
Booked 12 times today
Booked 26 times today
Booked 14 times today
Booked 86 times today
Booked 40 times today
Booked 70 times today
Booked 45 times today
Booked 17 times today
Booked 11 times today
Booked 84 times today
Booked 49 times today
Booked 74 times tod

### 8. Putting it all together in a dataframe.

**Loop through each entry. For each entry:**
1. Grab the relevant information we want (name, location, price, bookings). 
2. Produce a dataframe with the columns "name","location","price","bookings" that contains the 100 entries we would like.

In [23]:
# A:
import pandas as pd
import re

In [24]:
dc_eats = pd.DataFrame(columns=["name","location","price","bookings"])

In [25]:
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    booking_tag = entry.find('div', {'class':'booking'})
    bookings= 'NA'
    # get bookings
    if booking_tag:
        match = re.search('\d+', booking_tag.text)
        if match:
            bookings = match.group()
    # name
    name =  entry.find('span', {'class':'rest-row-name-text'}).text
    location =  entry.find('span', {'class':'rest-row-meta--location rest-row-meta-text'}).text
    price =  entry.find('div', {'class':'rest-row-pricing'}).find('i').text.count('$')
    
    result = {'price': price, 'location': location, 'name': name, 'bookings': bookings}
    dc_eats = dc_eats.append(result,  ignore_index=True) 

In [26]:
dc_eats.head()

Unnamed: 0,name,location,price,bookings
0,Ruffino's - Arlington,Arlington,2,
1,Joe's Place Pizza and Pasta,Arlington,2,
2,Founding Farmers - DC,Foggy Bottom,2,416.0
3,Filomena Ristorante,Georgetown,3,138.0
4,Farmers Fishers Bakers,Georgetown,2,243.0


### 9. [Bonus] Sending keys over the driver.

We can send keys to the page using the driver. Below is a demonstration of how to search the page using the Selenium driver.

In [27]:
# we can send keys as well
from selenium.webdriver.common.keys import Keys

In [28]:
# open the driver
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

# visit Python
driver.get("http://www.python.org")

# verify we're in the right place
assert "Python" in driver.title

In [29]:
# find the search position
elem = driver.find_element_by_name("q")
# clear it
elem.clear()
# type in pycon
elem.send_keys("pycon")


In [30]:
# send those keys
elem.send_keys(Keys.RETURN)

# no results
assert "No results found." not in driver.page_source

In [31]:
driver.close()

In [32]:
# # all at once:
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

## Additional resources

---

The above example (and many others) are available in the Selenium docs: http://selenium-python.readthedocs.io/getting-started.html

What is especially important is exploring functionality like locating elements: http://selenium-python.readthedocs.io/locating-elements.html#locating-elements

FAQ:
http://selenium-python.readthedocs.io/faq.html