In [5]:
!pip3 install requests



In [6]:
!pip3 install beautifulsoup4



In [9]:
!pip3 install selenium



In [10]:
import requests
from bs4 import BeautifulSoup
import selenium

# Collecting Digital Trace Data: Web Scraping

This morning we learned about how to srape data from the web using Python's packages `requests` and `BeautifulSoup`. Some of you have already had experience with web scraping, but for others, this may have been your first time collecting digital trace data. This group exercise is designed to find a balance between practicing rudimentary skills (for those of you with little or no experience in this area) to cutting edge techniques (for those of you with extensive expertise in this area). As an added bonus, this exercise not only challenges you to practice your coding skills, but to think about how to ask questions that contribute new knowledge to sociological theory as well.

<ol>
<li>Divide yourselves into groups of four by counting off in order around the room.</li>
<li>For 10 minutes, work together to identify a research question that you believe can be answered using some of the methods we discussed this morning.</li>
<li>Identify a sampling frame to help you answer this research question.</li>
<li>Evaluate the strengths and weaknesses of the data you plan to collect.</li>
<li>Outline a hybrid research design (e.g. an app or a bot) that could be used to address the weaknesses of the data you collected, or otherwise improve your ability to answer the research question.</li>
<li>(If you have time, write code to collect data from each unit of analysis in your sample. See the code below for help.)</li>
</ol>

There is only one requirement: the group member with the least amount of experience coding should be responsible for typing the code into a computer. You need not take the steps above in chronological order. However, after 1 hour you should be prepared to give a 5 minute presentation of your activities. Remember that these daily exercises are a way for you to explore new possible topics and to get to know each other better.

## Demonstration (Static HTML)

Web scraping using HTML parsing is often used on webpages which share similar HTML structure. For example, you might want to scrape the ingredients from chocolate chip cookie recipes to identify correlations between ingredients and five-star worthy cookies, or you might want to predict who will win March Madness by looking at game play-by-plays, or you want to know all the local pets up for adoption.


In [11]:
pet_pages = ["https://www.boulderhumane.org/animals/adoption/dogs", 
             "https://www.boulderhumane.org/animals/adoption/cats", 
             "https://www.boulderhumane.org/animals/adoption/adopt_other"]

r = requests.get(pet_pages[0])
html = r.text
print(html[:500]) # Print the first 500 characters of the HTML

<!DOCTYPE html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta charset="utf-8" />
<link rel="shortcut icon" href="https://www.boulderhumane.org/sites/default/files/favicon.ico" type="image/vnd.microsoft.icon" />
<meta name="Generator" content="Drupal 7 (http://drupal.org)" />
<meta name="viewport" content="width=1000px, initial-scale=1.0, maximum-scale=1.0" />
<title>Dogs Available for Adoption | Humane Society of Boulder Valley</title>
<link type="text/css" rel="stylesheet


When you visit a webpage, your web browser renders an HTML document with CSS and Javascript to produce a visually appealing page. (See the HTML above.) BeautifulSoup is a Python library for parsing HTML. We'll use it to extract all of the names, ages, and breeds of the dogs, cats, and small animals currently up for adoption at the Boulder Humane Society.

In [12]:
soup = BeautifulSoup(html, 'html.parser')

Note, that the feature of these pages which we are exploiting is their repeated HTML structure. Every animal listed has the following HTML variant:

```html
<div class="views-row ... ">
  ...
  <div class="views-field views-field-field-pp-animalname">
    <div class="field-content">
      <a href="/animals/adoption/" title="Adopt Me!">Romeo</a>
    </div>
  </div>
  <div class="views-field views-field-field-pp-primarybreed">
    <div class="field-content">New Zealand</div>
  </div>
  <div class="views-field views-field-field-pp-secondarybreed">
    <div class="field-content">Rabbit</div>
  </div>
  <div class="views-field views-field-field-pp-age">
    ...
    <span class="field-content">0 years 2 months</span>
  </div>
  <div class="views-field views-field-field-pp-gender">
    ...
    <span class="field-content">Male</span>
  </div>
  ...
</div>
```

So to get at the HTML object for each pet, we can run the following:

In [16]:
import re
pets = soup.find_all('div', {'class': re.compile('.*views-row.*')})

That is, find all of the div tags with the class attribute which contains the string `views-row`. 

Next to grab the name, breeds, and ages of these pets, we’ll grab the children of each pet HTML object. For example:

In [17]:
head = "views-field views-field-field-pp-"
for pet in pets:
    name = pet.find('div', {'class': head + 'animalname'}).get_text(strip=True)
    primary_breed = pet.find('div', {'class': head + 'primarybreed'}).get_text(strip=True)
    secondary_breed = pet.find('div', {'class': head + 'secondarybreed'}).get_text(strip=True)
    age = pet.find('div', {'class': head + 'age'}).get_text(strip=True)
    print(name, primary_breed, secondary_breed, age)

Rocky Retriever, Labrador Mix Age:3 years 2 months
Max Terrier, American Pit Bull  Age:6 years 10 months
Angel Terrier, American Pit Bull Mix Age:3 years 3 months
Rylie Coonhound, Black and Tan Mix Age:8 years 1 month
Jules Chihuahua, Short Coat Mix Age:0 years 4 months
Bruno Terrier, American Pit Bull Mix Age:6 years 6 months
Zeus Boxer Mix Age:2 years 0 months
Oscar Dachshund, Standard Smooth Haired Mix Age:5 years 0 months
Jackson Terrier, Rat Mix Age:8 years 0 months
Pitkin Retriever, Labrador Mix Age:3 years 0 months
Penny Australian Cattle Dog Boxer Age:7 years 0 months
Valentino Chihuahua, Short Coat Mix Age:3 years 5 months
Ethel Terrier, Norwich Mix Age:8 years 0 months
Lucy Chihuahua, Short Coat Mix Age:5 years 0 months
Mona Rottweiler Mix Age:4 years 0 months
Sugar Welsh Corgi, Pembroke  Age:8 years 0 months
Dylan Terrier, American Pit Bull Mix Age:0 years 4 months
Little Bit Terrier, Jack Russell Mix Age:7 years 0 months
Tina Roo Beagle  Age:6 years 0 months
Marty Chihuahua