# Collecting Digital Trace Data: Web Scraping / APIs
June 19th, 2018 - Javier Garcia-Bernardo & Allie Morgan

In [41]:
## Requirements
import requests               # Simple HTTP operations (GET and POST)
import selenium               # Loads dynamic (javascript) pages
import json                   # Parsing the responses from APIs
import re                     # Python library for parsing regular expressions
from bs4 import BeautifulSoup # Parsing HTML

## Web Scraping

[Web scraping](https://en.wikipedia.org/wiki/Web_scraping) is a method for extracting data from the web. There are many techniques which can be used for web scraping — ranging from requiring human involvement (“human copy-paste”) to fully automated systems (using computer vision). Somewhere in the middle is HTML parsing, which we will describe here.

Web scraping using [HTML parsing](https://en.wikipedia.org/wiki/Web_scraping#HTML_parsing) is often used on webpages which share similar HTML structure. For example, you might want to scrape the ingredients from chocolate chip cookie recipes to identify correlations between ingredients and five-star worthy cookies, or you might want to predict who will win March Madness by looking at game play-by-plays, or you want to know all the local pets up for adoption.

### Static Webpages

In [42]:
pet_pages = ["https://www.boulderhumane.org/animals/adoption/dogs", 
             "https://www.boulderhumane.org/animals/adoption/cats", 
             "https://www.boulderhumane.org/animals/adoption/adopt_other"]

r = requests.get(pet_pages[0])
html = r.text
print(html[:500]) # Print the first 500 characters of the HTML

<!DOCTYPE html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta charset="utf-8" />
<link rel="shortcut icon" href="https://www.boulderhumane.org/sites/default/files/favicon.ico" type="image/vnd.microsoft.icon" />
<meta name="Generator" content="Drupal 7 (http://drupal.org)" />
<meta name="viewport" content="width=1000px, initial-scale=1.0, maximum-scale=1.0" />
<title>Dogs Available for Adoption | Humane Society of Boulder Valley</title>
<link type="text/css" rel="stylesheet


When you visit a webpage, your web browser renders an HTML document with CSS and Javascript to produce a visually appealing page. (See the HTML above.) [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is a Python library for parsing HTML. We'll use it to extract all of the names, ages, and breeds of the [dogs](https://www.boulderhumane.org/animals/adoption/dogs), [cats](https://www.boulderhumane.org/animals/adoption/cats), and [small animals](https://www.boulderhumane.org/animals/adoption/adopt_other) currently up for adoption at the Boulder Humane Society.

In [43]:
soup = BeautifulSoup(html, 'html.parser')

Note, that the feature of these pages which we are exploiting is their repeated HTML structure. Every animal listed has the following HTML variant:
```{html}
<div class="views-row ... ">
  ...
  <div class="views-field views-field-field-pp-animalname">
    <div class="field-content">
      <a href="/animals/adoption/" title="Adopt Me!">Romeo</a>
    </div>
  </div>
  <div class="views-field views-field-field-pp-primarybreed">
    <div class="field-content">New Zealand</div>
  </div>
  <div class="views-field views-field-field-pp-secondarybreed">
    <div class="field-content">Rabbit</div>
  </div>
  <div class="views-field views-field-field-pp-age">
    ...
    <span class="field-content">0 years 2 months</span>
  </div>
  <div class="views-field views-field-field-pp-gender">
    ...
    <span class="field-content">Male</span>
  </div>
  ...
</div>
``` 
So to get at the HTML object for each pet, we can run the following:

In [48]:
pets = soup.find_all('div', {'class': re.compile('.*views-row.*')})

That is, find all of the `div` tags with the `class` attribute which contains the string `views-row`. 

Next to grab the name, breeds, and ages of these pets, we’ll grab the children of each pet HTML object. For example:

In [49]:
head = "views-field views-field-field-pp-"
for pet in pets:
    name = pet.find('div', {'class': head + 'animalname'}).get_text(strip=True)
    primary_breed = pet.find('div', {'class': head + 'primarybreed'}).get_text(strip=True)
    secondary_breed = pet.find('div', {'class': head + 'secondarybreed'}).get_text(strip=True)
    age = pet.find('div', {'class': head + 'age'}).get_text(strip=True)
    print(name, primary_breed, secondary_breed, age)

(u'Audi', u'Belgian Malinois', u'', u'Age:5 years 7 months')
(u'Kobe', u'Cane Corso', u'', u'Age:1 year 8 months')
(u'Roxy', u'Terrier, American Pit Bull', u'Mix', u'Age:1 year 6 months')
(u'Bear', u'Retriever, Chesapeake Bay', u'Mix', u'Age:2 years 5 months')
(u'Teddy', u'Terrier, Jack Russell', u'Mix', u'Age:6 years 0 months')
(u'Chloe', u'Akita', u'Mix', u'Age:5 years 0 months')
(u'Drew', u'Retriever, Labrador', u'Retriever, Golden', u'Age:2 years 0 months')
(u'Harley', u'Terrier, American Pit Bull', u'Mix', u'Age:2 years 0 months')
(u'Sadie', u'Great Dane', u'Retriever, Labrador', u'Age:8 years 10 months')
(u'Megrita', u'Dutch Shepherd', u'Mix', u'Age:5 years 0 months')
(u'Molly', u'Mastiff', u'Rottweiler', u'Age:8 years 0 months')
(u'Butterbean', u'Terrier, Jack Russell', u'Mix', u'Age:7 years 0 months')
(u'Rowdy', u'Rottweiler', u'Mix', u'Age:6 years 0 months')
(u'Marco', u'Spaniel, American Cocker', u'', u'Age:5 years 0 months')
(u'Eva', u'Boxer', u'Mix', u'Age:6 years 0 months'

where each call to `find` is getting the children of a pet object, in particular, the `div`s with `class` attributes which look like `views-field views-field-field-pp-*`. Feel free to replace the above code with the cat or small animal pages provided and see how the output changes.

### Dynamic (Javascript) Webpages

In the above example, the webpages we were loading required no [Javascript](https://en.wikipedia.org/wiki/JavaScript). In other words, there was no input required on the users end to view the content of the page (e.g. a login). Let's try a more complicated example of webscraping where content is loaded dynamically.


Some characterirstics:
- It's slow
- It can handle javascript
- You get **html** code back
- Behave like a person


Requirements (one):
- Firefox + geckodriver (https://github.com/mozilla/geckodriver/releases)
- Chrome + chromedriver

    
geckodriver/chromedriver must have execution permissions (chmod +x geckodriver)

We first define two functions to wait until the page has finished loading. Most times this is not needed but it doesn't hurt.

In [61]:
import selenium.webdriver

Start the browser and define how much are you willing to wait for a page to load

In [62]:
#open the driver (change the executable path to geckodriver_mac or geckodriver.exe)
driver = selenium.webdriver.Firefox(executable_path="./geckodriver")
driver.implicitly_wait(10)
driver.set_page_load_timeout(30)

Get xkcd and click through the comics

In [63]:
#get a website
driver.get("https://xkcd.com/")

In [64]:
#let's find the random buttom
element = driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[1]/li[3]/a')
element.click()

Find an attribute

In [65]:
element = driver.find_element_by_xpath('//*[@id="comic"]/img')
element.get_attribute("title")

"This happens in geek circles every so often. The 'Hey, this is just a system I can figure out easily!' is also a problem among engineers first diving into the stock market."

Log in in spotify

In [66]:
##DO NOT WRITE YOUR PASSWORD IN NOTEBOOKS!!
fb_email, fb_pass = "f1692418@mvrht.com","pedropalotes"

In [67]:
#go to facebook
driver.get("https://www.facebook.com/")

In [68]:
#send email and password
driver.find_element_by_xpath('//*[@id="email"]').send_keys(fb_email)
driver.find_element_by_xpath('//*[@id="pass"]').send_keys(fb_pass)

In [69]:
#click on login
driver.find_element_by_xpath('//*[@id="loginbutton"]').click()

In [74]:
#find JP
element = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[1]/div/div/div/div[1]/div[2]/div/form/div/div/div/div/input[2]')
element.send_keys("john paul gonzales")
element = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[1]/div/div/div/div[1]/div[2]/div/form/button')
element.click()

In [77]:
#click on him
element = driver.find_element_by_xpath('/html/body/div[1]/div[3]/div[1]/div/div[3]/div[2]/div/div/div[3]/div/div/div/div[1]/div/div/div/div/div/div[2]/div[1]/div/div[1]/a/div')
element.click()

In [None]:
#send a friend request
element = driver.find_element_by_xpath('/html/body/div[1]/div[3]/div[1]/div/div[2]/div[2]/div[2]/div/div[1]/div/div[4]/div/div[2]/div/div[2]/span/span/span[1]/a')
element.click()

## APIs

To allow users to access large amounts of data, companies may provide an [Application Programming Interface (API)](https://en.wikipedia.org/wiki/Application_programming_interface). Often these request are handled via PUT and POST HTTP requests. For example, to make a request from the Twitter API:

```{bash}
curl --request GET 
 --url 'https://api.twitter.com/1.1/search/tweets.json?q=nasa&result_type=popular' 
 --header 'authorization: OAuth oauth_consumer_key="consumer-key-for-app", ... , 
 oauth_token="access-token-for-authed-user", oauth_version="1.0"'
 ```

APIs often return data in the format of [Javascript Object Notation (JSON)](https://en.wikipedia.org/wiki/JSON). For example:

```{json}
{"status": 200, "message": "hello world"}
```

### "Hidden" APIs

### Explicit APIs