# Collecting Digital Trace Data: Web Scraping / APIs
June 19th, 2018 - Javier Garcia-Bernardo & Allie Morgan

In [1]:
## Requirements
import requests               # Simple HTTP operations (GET and POST)
import selenium               # Loads dynamic (javascript) pages
import json                   # Parsing the responses from APIs
import re                     # Python library for parsing regular expressions
from bs4 import BeautifulSoup # Parsing HTML
import pandas as pd           # Read tables

## Web Scraping

[Web scraping](https://en.wikipedia.org/wiki/Web_scraping) is a method for extracting data from the web. There are many techniques which can be used for web scraping — ranging from requiring human involvement (“human copy-paste”) to fully automated systems (using computer vision). Somewhere in the middle is HTML parsing, which we will describe here.

Web scraping using [HTML parsing](https://en.wikipedia.org/wiki/Web_scraping#HTML_parsing) is often used on webpages which share similar HTML structure. For example, you might want to scrape the ingredients from chocolate chip cookie recipes to identify correlations between ingredients and five-star worthy cookies, or you might want to predict who will win March Madness by looking at game play-by-plays, or you want to know all the local pets up for adoption.

### Static Webpages

In [2]:
pet_pages = ["https://www.boulderhumane.org/animals/adoption/dogs", 
             "https://www.boulderhumane.org/animals/adoption/cats", 
             "https://www.boulderhumane.org/animals/adoption/adopt_other"]

r = requests.get(pet_pages[0])
html = r.text
print(html[:500]) # Print the first 500 characters of the HTML

<!DOCTYPE html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta charset="utf-8" />
<link rel="shortcut icon" href="https://www.boulderhumane.org/sites/default/files/favicon.ico" type="image/vnd.microsoft.icon" />
<meta name="Generator" content="Drupal 7 (http://drupal.org)" />
<meta name="viewport" content="width=1000px, initial-scale=1.0, maximum-scale=1.0" />
<title>Dogs Available for Adoption | Humane Society of Boulder Valley</title>
<link type="text/css" rel="stylesheet


When you visit a webpage, your web browser renders an HTML document with CSS and Javascript to produce a visually appealing page. (See the HTML above.) [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is a Python library for parsing HTML. We'll use it to extract all of the names, ages, and breeds of the [dogs](https://www.boulderhumane.org/animals/adoption/dogs), [cats](https://www.boulderhumane.org/animals/adoption/cats), and [small animals](https://www.boulderhumane.org/animals/adoption/adopt_other) currently up for adoption at the Boulder Humane Society.

In [50]:
soup = BeautifulSoup(html, 'html.parser')

Note, that the feature of these pages which we are exploiting is their repeated HTML structure. Every animal listed has the following HTML variant:
```{html}
<div class="views-row ... ">
  ...
  <div class="views-field views-field-field-pp-animalname">
    <div class="field-content">
      <a href="/animals/adoption/" title="Adopt Me!">Romeo</a>
    </div>
  </div>
  <div class="views-field views-field-field-pp-primarybreed">
    <div class="field-content">New Zealand</div>
  </div>
  <div class="views-field views-field-field-pp-secondarybreed">
    <div class="field-content">Rabbit</div>
  </div>
  <div class="views-field views-field-field-pp-age">
    ...
    <span class="field-content">0 years 2 months</span>
  </div>
  <div class="views-field views-field-field-pp-gender">
    ...
    <span class="field-content">Male</span>
  </div>
  ...
</div>
``` 
So to get at the HTML object for each pet, we can run the following:

In [51]:
pets = soup.find_all('div', {'class': re.compile('.*views-row.*')})

That is, find all of the `div` tags with the `class` attribute which contains the string `views-row`. 

Next to grab the name, breeds, and ages of these pets, we’ll grab the children of each pet HTML object. For example:

In [5]:
head = "views-field views-field-field-pp-"
for pet in pets:
    name = pet.find('div', {'class': head + 'animalname'}).get_text(strip=True)
    primary_breed = pet.find('div', {'class': head + 'primarybreed'}).get_text(strip=True)
    secondary_breed = pet.find('div', {'class': head + 'secondarybreed'}).get_text(strip=True)
    age = pet.find('div', {'class': head + 'age'}).get_text(strip=True)
    print(name, primary_breed, secondary_breed, age)

Kobe Cane Corso  Age:1 year 8 months
Roxy Terrier, American Pit Bull Mix Age:1 year 6 months
Bear Retriever, Chesapeake Bay Mix Age:2 years 5 months
Chloe Akita Mix Age:5 years 0 months
Drew Retriever, Labrador Retriever, Golden Age:2 years 0 months
Harley Terrier, American Pit Bull Mix Age:2 years 0 months
Sadie Great Dane Retriever, Labrador Age:8 years 11 months
Megrita Dutch Shepherd Mix Age:5 years 0 months
Molly Mastiff Rottweiler Age:8 years 0 months
Butterbean Terrier, Jack Russell  Age:7 years 0 months
Rowdy Rottweiler Mix Age:6 years 0 months
Marco Spaniel, American Cocker  Age:5 years 0 months
Eva Boxer Mix Age:6 years 0 months
Stew Chihuahua, Short Coat Mix Age:7 years 0 months
Rudy Doberman Pinscher  Age:3 years 2 months
Tom Beagle  Age:10 years 0 months
Walnut Terrier, American Pit Bull Mix Age:4 years 0 months
Sandy May Coonhound Mix Age:7 years 0 months
Bruno Vizsla, Smooth Haired Beagle Age:1 year 6 months
Fen Pointer Mix Age:1 year 7 months
Madre Terrier, American Pit

where each call to `find` is getting the children of a pet object, in particular, the `div`s with `class` attributes which look like `views-field views-field-field-pp-*`. Feel free to replace the above code with the cat or small animal pages provided and see how the output changes.

#### Aside: Read Tables from Webpages

In [54]:
table = pd.read_html("https://en.wikipedia.org/wiki/List_of_sandwiches",header=0)[0]

# Write table to CSV
#table.to_csv("filenamehere.csv")

table.head((20))

Unnamed: 0,Name,Image,Origin,Description
0,Bacon,,United Kingdom,Often eaten with ketchup or brown sauce
1,"Bacon, egg and cheese",,United States,"Breakfast sandwich, usually with fried or scra..."
2,Bagel toast,,Israel,"Pressed, toasted bagel filled with vegetables ..."
3,Baked bean,,United States,"Canned baked beans on white or brown bread, so..."
4,Bánh mì[4],,Vietnam,"Filling is typically meat, but can contain a w..."
5,Barbecue[5][6][7],,United States,"Served on a bun, with chopped, sliced, or shre..."
6,Barros Jarpa,,Chile,"Ham and cheese, usually mantecoso, which is si..."
7,Barros Luco,,Chile,Beef (usually thin-cut steak) and cheese
8,Bauru,,Brazil,"Melted cheese, roast beef, tomato, and pickled..."
9,Beef on weck,,"United States (Buffalo, New York)",Roast beef on a Kummelweck roll


### Dynamic (Javascript) Webpages

Above, we requested webpages that required no [Javascript](https://en.wikipedia.org/wiki/JavaScript). In other words, there was no input required on the users' end to view the content of the page (e.g. a login). Let's try a more complicated example of webscraping where content is loaded dynamically.

Some characteristics of HTML scraping with [Selenium](https://www.seleniumhq.org/download/) it: (b) can handle javascript, (c) get **HTML** back after the Javascript has been rendered, (d) can behave like a person, though it (a) can be slow. 

Requirements (one of the below):
- Firefox + geckodriver (https://github.com/mozilla/geckodriver/releases)
- Chrome + chromedriver (https://sites.google.com/a/chromium.org/chromedriver/)
    
Note: geckodriver/chromedriver must have execution permissions (chmod +x geckodriver)

In [55]:
import selenium.webdriver

Start the browser and define how much are you willing to wait for a page to load. (Many times this is not needed but it doesn't hurt.)

In [8]:
# Open the driver (change the executable path to geckodriver_mac.exe or geckodriver.exe)
driver = selenium.webdriver.Chrome(executable_path="./chromedriver")
#driver = selenium.webdriver.Chrome()

Visit [xkcd](https://xkcd.com) and click through the comics.

In [57]:
# Get the xkcd website
driver.get("https://xkcd.com/")

In [58]:
# Let's find the 'random' buttom
element = driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[1]/li[3]/a')
element.click()

Find an attribute of this page.

In [11]:
element = driver.find_element_by_xpath('//*[@id="comic"]/img')
element.get_attribute("title")

'Although the Markov chain-style text model is still rudimentary; it recently gave me "Massachusetts Institute of America". Although I have to admit it sounds prestigious.'

In [62]:
driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[1]/li[3]/a').click()

Let's visit a webpage which requires a login. Signing in to Facebook ...

In [63]:
##DO NOT WRITE YOUR PASSWORD IN NOTEBOOKS!!
fb_email, fb_pass = "<email>", "<password>"

In [64]:
# Go to Facebook
driver.get("https://www.facebook.com/")

In [14]:
# Send email and password
driver.find_element_by_xpath('//*[@id="email"]').send_keys(fb_email)
driver.find_element_by_xpath('//*[@id="pass"]').send_keys(fb_pass)

In [66]:
# Click on login
driver.find_element_by_xpath('//*[@id="loginbutton"]').click()

In [67]:
# Find JP
element = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[1]/div/div/div/div[1]/div[2]/div/form/div/div/div/div/input[2]')
element.send_keys("<friend's name>")
element = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[1]/div/div/div/div[1]/div[2]/div/form/button')
element.click()

In [68]:
# Click on him
element = driver.find_element_by_xpath('//*[@class="_2yet"]')
element.click()

In [69]:
# Send a friend request (only run during workshop)
element = driver.find_element_by_xpath('/html/body/div[1]/div[3]/div[1]/div/div[2]/div[2]/div[2]/div/div[1]/div/div[4]/div/div[2]/div/div[2]/span/span/span[1]/a')
element.click()

In [70]:
# Always remember to close your browser!
driver.close()

## APIs

To allow users to access large amounts of data, companies may provide an [Application Programming Interface (API)](https://en.wikipedia.org/wiki/Application_programming_interface). Often these request are handled via PUT and POST HTTP requests. For example, to make a request from the Twitter API:

```{bash}
curl --request GET 
 --url 'https://api.twitter.com/1.1/search/tweets.json?q=nasa&result_type=popular' 
 --header 'authorization: OAuth oauth_consumer_key="consumer-key-for-app", ... , 
 oauth_token="access-token-for-authed-user", oauth_version="1.0"'
 ```

APIs often return data in the format of [Javascript Object Notation (JSON)](https://en.wikipedia.org/wiki/JSON). For example:

```{json}
{"status": 200, "message": "hello world"}
```

### "Hidden" APIs

First, let's try and access what we are calling a "hidden" API. That is, we investigate the resources requested by a webpage (e.g. a list of faculty), and make requests directly to that API.

In [71]:
import requests
import json

def get_names(letters):
    params = (
        ('name', letters),
        ('request_num', '1'),
    )

    response = requests.get('https://www.uvm.edu/directory/api/query_results.php', params=params)
    if response.ok == True:
        return response.text
    else:
        return None

In [74]:
response = get_names("ab")

In [77]:
response_json = json.loads(response)

In [22]:
for i, person in enumerate(response_json["data"]):
    if i == 10: 
        break # Make sure we don't print too much
        
    #print(person["edupersonprimaryaffiliation"]["0"], person["edupersonprincipalname"]["0"], person["cn"]["0"])

### Explicit APIs

Next, let's try a more typical example of an API. The perks of this approach: (a) send a request and get back JSON, (b) chances are that somebody else has created a Python wrapper for you, but keep in mind that (c) APIs have limits.

Let's consider a common API example -- Twitter. To get started:
- Get a key: https://apps.twitter.com/
- Documentation: https://dev.twitter.com/rest/public
- Find a library: https://dev.twitter.com/resources/twitter-libraries (We'll use https://github.com/tweepy/tweepy)

Limitations: 100 messages / query, 180 messages every 15 min, & only the last seven days of data 

In [57]:
!pip install tweepy



In [23]:
import tweepy
import time
import pickle

def twitter(d_keys,query):
    # Authtentify
    auth = tweepy.OAuthHandler(d_keys["CONSUMER_KEY"], d_keys["CONSUMER_SECRET"])
    auth.set_access_token(d_keys["ACCESS_KEY"], d_keys["ACCESS_SECRET"])
    api = tweepy.API(auth)

    # We want 1000 tweets
    num_results = 1000
    result_count = 0
    last_id = None
    
    # Max 180 tweets 15 min
    cumulative = 0

    #While we don't have them
    while (result_count <  num_results):
        previous_tweets = result_count
        # Ask for more tweets, starting in the 'last_id' (identifier of the tweet)
        results = api.search(q = query,
                              count = 90, max_id = last_id, result_type="recent")
                                # geocode = "{},{},{}km".format(latitude, longitude, max_range) #for geocode

        # For each tweet extract some info (JSON structure)
        for result in results:
            result_count += 1
            user = result.user.screen_name
            text = result.text
            followers_count = result.user.followers_count
            time_zone = result.user.time_zone
            print("_"*10)
            print(user,time_zone,followers_count)
            print(text)

        # Keep the last_id to know where to continue
        last_id = int(result.id)-1
        new_tweets = result_count - previous_tweets

        print ("Number of results: {} ({} new)".format(result_count,new_tweets))

        # If we don't get new tweets exit
        if new_tweets == 0: 
            break
        
        time.sleep(1)
        
        if ((result_count + 90) // 150) > cumulative:
            cumulative += 1
            time.sleep(15*60)


d_keys = pickle.load(open(".key","rb")) # Don't share your keys ;)
twitter(d_keys,"from:sfiscience")

__________
sfiscience None 26176
Live now: panel discussion on the nature of time with David Krakauer, Jim Hartle, and @seanmcarroll, moderated by s… https://t.co/KjEdjIoYJv
__________
sfiscience None 26176
RT @michaelgarfield: All the panel discussion live notes, 360º videos, and musical recordings I made at InterPlanetary Festival last week:…
__________
sfiscience None 26176
Enroll now for @ComplexExplorer's MOOC: Introduction to Agent-based Modeling with Bill Rand. Students learn to crea… https://t.co/BSz6XqxsPN
__________
sfiscience None 26176
Seats are sold out, but you can catch the livestream on our YouTube channel: https://t.co/Sf8LLAYOxk https://t.co/6O2gFDc03D
__________
sfiscience None 26176
We're hiring a full-time social media specialist. This is one of the most dynamic positions @sfiscience. https://t.co/MBgWWTjUNC
__________
sfiscience None 26176
Happening now at SFI: workshop brings together thinkers in disciplines ranging from cosmology to chronobiology to n… https://t