In [1]:
#!pip3 install requests
#!pip3 install beautifulsoup4
#!pip3 install selenium

In [10]:
import requests
from bs4 import BeautifulSoup
import selenium

# Collecting Digital Trace Data: Web Scraping

This morning we learned about how to srape data from the web using Python's packages `requests` and `BeautifulSoup`. Some of you have already had experience with web scraping, but for others, this may have been your first time collecting digital trace data. This group exercise is designed to find a balance between practicing rudimentary skills (for those of you with little or no experience in this area) to cutting edge techniques (for those of you with extensive expertise in this area). As an added bonus, this exercise not only challenges you to practice your coding skills, but to think about how to ask questions that contribute new knowledge to sociological theory as well.

<ol>
<li>First, independently brainstorm one or two research questions that you believe can be answered using online data sources and web scraping. </li>
<li>Divide yourselves into groups of three or four. Try to join a group with people you haven't worked with.</li>
<li>For 10 minutes, work together to identify a research question based on one of the data sources proposed by your group members.</li>
<li>Evaluate the strengths and weaknesses of the data you plan to collect.</li>
<li>Outline a hybrid research design (e.g. an app or a bot) that could be used to address the weaknesses of the data you collected, or otherwise improve your ability to answer the research question.</li>
<li>(If you have time, write code to collect data from each unit of analysis in your sample. See the code below for help.)</li>
</ol>

There is only one requirement: the group member with the least amount of experience coding should be responsible for typing the code into a computer. After 45 minutes, we will share our work with the group. Let us know if you'd like to present your group's potential project. Remember that these daily exercises are a way for you to explore new possible topics and to get to know each other better.

## Demonstration (Static HTML)

Web scraping using HTML parsing is often used on webpages which share similar HTML structure. For example, you might want to scrape the ingredients from chocolate chip cookie recipes to identify correlations between ingredients and five-star worthy cookies, or you might want to predict who will win March Madness by looking at game play-by-plays, or you want to know all the local pets up for adoption.


### Case Study #1: Boulder Humane Society

In [3]:
pet_pages = ["https://www.boulderhumane.org/animals/adoption/dogs", 
             "https://www.boulderhumane.org/animals/adoption/cats", 
             "https://www.boulderhumane.org/animals/adoption/adopt_other"]

r = requests.get(pet_pages[0])
html = r.text
print(html[:500]) # Print the first 500 characters of the HTML

<!DOCTYPE html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta charset="utf-8" />
<link rel="shortcut icon" href="https://www.boulderhumane.org/sites/default/files/favicon.ico" type="image/vnd.microsoft.icon" />
<meta name="Generator" content="Drupal 7 (http://drupal.org)" />
<meta name="viewport" content="width=1000px, initial-scale=1.0, maximum-scale=1.0" />
<title>Dogs Available for Adoption | Humane Society of Boulder Valley</title>
<link type="text/css" rel="stylesheet


When you visit a webpage, your web browser renders an HTML document with CSS and Javascript to produce a visually appealing page. (See the HTML above.) BeautifulSoup is a Python library for parsing HTML. We'll use it to extract all of the names, ages, and breeds of the dogs, cats, and small animals currently up for adoption at the Boulder Humane Society.

In [4]:
soup = BeautifulSoup(html, 'html.parser')

Note, that the feature of these pages which we are exploiting is their repeated HTML structure. Every animal listed has the following HTML variant:

```html
<div class="views-row ... ">
  ...
  <div class="views-field views-field-field-pp-animalname">
    <div class="field-content">
      <a href="/animals/adoption/" title="Adopt Me!">Romeo</a>
    </div>
  </div>
  <div class="views-field views-field-field-pp-primarybreed">
    <div class="field-content">New Zealand</div>
  </div>
  <div class="views-field views-field-field-pp-secondarybreed">
    <div class="field-content">Rabbit</div>
  </div>
  <div class="views-field views-field-field-pp-age">
    ...
    <span class="field-content">0 years 2 months</span>
  </div>
  <div class="views-field views-field-field-pp-gender">
    ...
    <span class="field-content">Male</span>
  </div>
  ...
</div>
```

So to get at the HTML object for each pet, we can run the following:

In [5]:
pets = soup.find_all('div', {'class': 'views-row'})

That is, find all of the div tags with the class attribute which contains the string `views-row`. 

Next to grab the name, breeds, and ages of these pets, we’ll grab the children of each pet HTML object. For example:

In [6]:
head = "views-field views-field-field-pp-"
for pet in pets:
    name = pet.find('div', {'class': head + 'animalname'}).get_text(strip=True)
    primary_breed = pet.find('div', {'class': head + 'primarybreed'}).get_text(strip=True)
    secondary_breed = pet.find('div', {'class': head + 'secondarybreed'}).get_text(strip=True)
    age = pet.find('div', {'class': head + 'age'}).get_text(strip=True)
    print([name, primary_breed, secondary_breed, age])

['Rocky', 'Retriever, Labrador', 'Mix', 'Age:3 years 2 months']
['Max', 'Terrier, American Pit Bull', '', 'Age:6 years 11 months']
['Angel', 'Terrier, American Pit Bull', 'Mix', 'Age:3 years 3 months']
['Bruno', 'Terrier, American Pit Bull', 'Mix', 'Age:6 years 6 months']
['Zeus', 'Boxer', 'Mix', 'Age:2 years 1 month']
['Pitkin', 'Retriever, Labrador', 'Mix', 'Age:3 years 0 months']
['Penny', 'Australian Cattle Dog', 'Boxer', 'Age:7 years 0 months']
['Valentino', 'Chihuahua, Short Coat', 'Mix', 'Age:3 years 6 months']
['Bellina', 'Schnauzer, Miniature', 'Mix', 'Age:1 year 0 months']
['Mona', 'Rottweiler', 'Mix', 'Age:4 years 0 months']
['Little Bit', 'Terrier, Jack Russell', 'Mix', 'Age:7 years 0 months']
['Tina Roo', 'Beagle', '', 'Age:6 years 0 months']
['Trooper', 'Chihuahua, Short Coat', 'Mix', 'Age:0 years 5 months']
['Linus', 'Terrier, Airedale', 'Mix', 'Age:8 years 0 months']
['Woofie', 'Siberian Husky', '', 'Age:4 years 0 months']
['Willow Moon', 'Terrier, American Pit Bull', '

This may seem like a fairly silly example of webscraping, but one could imagine several research questions using this data. For example, if we collected this data over time, could we identify what features of pets -- names, breeds, ages -- make them more likely to be adopted? Are there certain names that are more common for certain breeds?

### Case Study #2: Box Office Mojo

Let's say we wanted to know for how many movie franchises, were their sequels more successful than their originals. For example, <a href="https://www.boxofficemojo.com/franchises/chart/?view=main&id=fastandthefurious.htm&sort=gross&order=DESC&p=.htm">Furious 7</a> (see 'Adjusted for Ticket Price Inflation') was the most lucrative <a href="https://en.wikipedia.org/wiki/The_Fast_and_the_Furious">Fast and the Furious</a> movie. We can collect data from the site <a href="https://www.boxofficemojo.com">"Box Office Mojo"</a> to answer this question.

Notice <a href="https://www.boxofficemojo.com/franchises/?view=Franchise&sort=gross&order=DESC&adjust_yr=2016&p=.htm">this table of movie franchises</a>, sorted by adjusted gross income:

<img src="box_office_mojo.png" width="500px" align="left">

Let's pick off the first ten franchises and find out how many of their sequels were most successful. First, let's request the page with the table and pass it to BeautifulSoup for parsing its HTML.

In [15]:
table = "https://www.boxofficemojo.com/franchises/?view=Franchise&sort=gross&order=DESC&adjust_yr=2016&p=.htm"
r = requests.get(table)
soup = BeautifulSoup(r.text)

In [24]:
len(soup.find_all('table')[2])

1

Visit the webpage in the browser, open the "Web Inspector" (Safari) or "Developer Tools" (Chrome). Within the page's HTML source, hover over the `<table>...</table>` containing the HTML object we'd like to collect.

![Copy XPath](copy_xpath.gif "Copy XPath")

## Demonstration (Dynamic HTML)

## List of Online Open Source Data & Websites

Here, is a short list of open source data and websites. If you have any to add please tell Allie, or better yet, [start an issue or submit a pull request](https://github.com/allisonmorgan/sicss_boulder).
<ul>
    <li><a href="https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0">Data is Plural" Newsletter</a></li>
    <li><a href="http://archive.ics.uci.edu/ml/index.php">UCI Machine Learning Repository</a></li>
    <li><a href="https://icon.colorado.edu/#!/">University of Colorado Network Dataset</a></li> 
    <li><a href="https://snap.stanford.edu/data/">Stanford Network Datasets</a></li>
    <li><a href="https://networkdata.ics.uci.edu/resources.php">UCI Network Datasets</a></li>
    <li><a href="https://aws.amazon.com/datasets/8172056142375670">Google Books nGrams</a></li>
    <li><a href="http://about.reuters.com/researchandstandards/corpus/">Reuters News Corpus</a></li>
    <li><a href="http://www.nltk.org/nltk_data/">NLTK Corpora</a></li>
    <li><a href="https://www.cs.cmu.edu/~./enron/">Enron Emails Dataset</a></li>
    <li><a href="http://snap.stanford.edu/class/cs224w-2012/projects/cs224w-013-final.pdf">Political Blogs 2008</a></li>
    <li><a href="https://aws.amazon.com/public-data-sets/common-crawl/">The Common Crawl</a></li>
    <li><a href="https://meta.wikimedia.org/wiki/Data_dumps">Wikipedia</a></li>
    <li><a href="http://archive.ics.uci.edu/ml/datasets/Amazon+Commerce+reviews+set">Amazon Commerce Reviews Set</a></li>
    <li><a href="http://archive.ics.uci.edu/ml/datasets/NSF+Research+Award+Abstracts+1990-2003">NSF Research Award Abstracts</a></li>
    <li><a href="http://scikit-learn.org/stable/datasets/twenty_newsgroups.html">20 Newsgroups</a></li>
    <li><a href="https://catalog.ldc.upenn.edu/LDC2008T19">New York Times Annotated Corpus</a></li>
    <li><a href="https://code.google.com/p/graphlabapi/downloads/detail?name=daily_kos.tar.bz2&can=2&q=">Daily Kos Blog Posts</a></li>
    <li><a href="http://deepdive.stanford.edu/doc/opendata/">Stanford DeepDive Open Datasets</a></li>
    <li><a href="http://trec.nist.gov/data/tweets/">Tweets 2011</a></li>
    <li><a href="http://www.boards.ie/">Irish Discussion Boards</a></li>
    <li><a href="http://www.cs.cornell.edu/people/pabo/movie-review-data/">Movie Review Data</a></li>
    <li><a href="https://www.yelp.com/academic_dataset">Yelp Dataset</a></li>
    <li><a href="http://www.biomedcentral.com/1471-2105/15/S11/S11">BMC BioInformatics</a></li>
    <li><a href="http://socialcomputing.asu.edu/datasets/Twitter">ASU Twitter Dataset</a></li>
    <li><a href="https://snap.stanford.edu/data/higgs-twitter.html">Higgs Twitter Network Dataset</a></li>
    <li><a href="http://icwsm.org/2013/datasets/datasets/">ICWSM (Various Datasets)</a></li>
    <li><a href="http://labrosa.ee.columbia.edu/millionsong/musixmatch">Million Song Dataset</a></li>
    <li><a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/XPCVEI">EUSpeech</a></li>
    <li><a href="http://humanrightstexts.org/">Human Rights texts</a></li>
    <li><a href="http://textlab.econ.columbia.edu/~jjacobs/marx/">Marx Corpus</a></li>
</ul>