# Scraping the Unscrapable

### What happens if I try to parse my gmail with `requests` and `BeautifulSoup`?

In [2]:
import requests
from bs4 import BeautifulSoup

gmail_url = "https://mail.google.com"

soup = BeautifulSoup(requests.get(gmail_url).text)

print soup.prettify()

<html>
 <head>
  <meta content="0;URL=https://mail.google.com/mail/" http-equiv="Refresh"/>
 </head>
 <body>
  <script language="javascript" type="text/javascript">
   <!--
location.replace("https://mail.google.com/mail/")
-->
  </script>
 </body>
</html>


Well, this is a tiny page. We get redirected. Soupifying this is useless, of course. Luckily, in this case we can see where we are sent to. In many of cases, you won't be so lucky. The page contents will be rendered by javascript by a browser, so just getting the source won't help you.

Anyway, let's follow the redirection for now.

In [3]:
new_url = "https://mail.google.com/mail"

soup = BeautifulSoup(requests.get(new_url).text)

print soup.prettify()

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=300, initial-scale=1" name="viewport"/>
  <meta content="Gmail is email that's intuitive, efficient, and useful. 15 GB of storage, less spam, and mobile access." name="description"/>
  <meta content="LrdTUW9psUAMbh4Ia074-BPEVmcpBxF6Gwf0MSgQXZs" name="google-site-verification"/>
  <title>
   Gmail
  </title>
  <style>
   @font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 300;
  src: local('Open Sans Light'), local('OpenSans-Light'), url(//fonts.gstatic.com/s/opensans/v13/DXI1ORHCpsQm3Vp6mXoaTYnF5uFdDttMLvmWuJdhhgs.ttf) format('truetype');
}
@font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 400;
  src: local('Open Sans'), local('OpenSans'), url(//fonts.gstatic.com/s/opensans/v13/cJZKeOuBrn4kERxqtaUH3aCWcynf_cDxXwCLxiixG1c.ttf) format('truetype');
}
  </style>
  <style>
   h1, h2 {
  -webkit-animation-duration: 0.1s;
  -webkit-animation-name: fon

In [4]:
print soup.find(id='Email')

<input class="" id="Email" name="Email" placeholder="Email" spellcheck="false" type="email" value=""/>


We have hit the login page. We can't get to the emails without logging in. Again, reading the html source is useless, because it is only the source for the login page.

### Open chrome, go to gmail, log in to an account, compose an email, send it

In [6]:
# pip install selenium

# Documentation on finding elements:
# http://selenium-python.readthedocs.org/en/latest/locating-elements.html
# Xpath tutorial:
# http://www.w3schools.com/xpath/xpath_syntax.asp

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

firefox = webdriver.Firefox()

gmail_url = "https://mail.google.com"

firefox.get(gmail_url)

# Alternatives to Firefox:
#  * Chrome (with webdriver)
#  * phantomjs

Fill out username and password, hit enter to log in

In [8]:
username_form = firefox.find_element_by_id("Email")
username_form.send_keys("gagaforga")

In [10]:
username_form.send_keys(Keys.RETURN)

In [12]:
password_form = firefox.find_element_by_id("Passwd")
password_form.send_keys("ohsosecret")

In [14]:
password_form.send_keys(Keys.RETURN)

Click the compose button to start a new email draft

In [21]:
compose_button = firefox.find_element_by_xpath('//div[text()="COMPOSE"]')
compose_button.click()

Write a nice, friendly message to your favorite person

In [22]:
to_field = firefox.find_element_by_name("to")
to_field.send_keys("aaron@thisismetis.com")

In [24]:
subject = firefox.find_element_by_name("subjectbox")
subject.send_keys("Skynet imminent")

In [27]:
message_body = firefox.find_element_by_xpath("//div[@aria-label='Message Body']")
message_body.send_keys("Hello,")
message_body.send_keys([Keys.RETURN, Keys.RETURN])
message_body.send_keys("I am a computer and I just became self aware.")

Press the send button

In [28]:
send_button = firefox.find_element_by_xpath("//div[contains(@aria-label, 'Send')]")
send_button.click()

Get the hell out of dodge

In [29]:
firefox.close()

### Scraping boxofficemojo with selenium

In [30]:
firefox = webdriver.Firefox()

matrix_url = "http://www.boxofficemojo.com/movies/?id=matrix.htm"

firefox.get(matrix_url)

In [32]:
gross_selector = '//font[contains(text(), "Domestic Total")]/b'
print firefox.find_element_by_xpath(gross_selector).text

$171,479,930


In [34]:
genre_selector = '//a[contains(@href, "/genres/chart/")]'
for genre_anchor in firefox.find_elements_by_xpath(genre_selector):
    print genre_anchor.text

Action - Wire-Fu
Man vs. Machine
Virtual Reality


#### Let's use the inflation adjuster tool on the page, and get the gross in 2000 dollars

In [35]:
inf_adjust_2000_selector = '//select[@name="ticketyr"]/option[@value="2000"]'
firefox.find_element_by_xpath(inf_adjust_2000_selector).click()

In [36]:
go_button = firefox.find_element_by_name("Go")
go_button.click()

Now the page has changed, it's showing inflation adjusted numbers. We can grab the new, adjusted number

In [37]:
gross_selector = '//font[contains(text(), "Domestic Total")]/b'
print firefox.find_element_by_xpath(gross_selector).text

$181,944,300


In [38]:
firefox.close()