# A. Web Scraping

- October 24 2024

## The roadmap

Moving in to our "getting data" modules

- Today: web scraping, APIs, Census data, (natural language processing - some basics)
- Then: big data, geo data science in the wild, dashboarding & web servers, machine learning

*The final project will ask you to combine several of these topics/techniques to analyze a data sets and produce a web-based data visualization*

## Today: web scraping & API

- Why web scraping? 
- Getting familiar with the Web
- Web scraping: extracting data from static sites
- API (weather, google, etc...)
- (we won't cover dynamic contents)

## What is web scraping? 

Using software to gather and extract data/content from websites

## Why is web scraping useful? 

- Not every data source provides an API
- The Web contains **a lot** of information
- Unique data sources that may not be available elsewhere

## What is possible: 11 million rental listings from Craigslist


<center>
<img src="imgs/scraping-craigslist.jpeg" width=500></img>
</center>

[Source: Geoff Boeing](https://geoffboeing.com/2016/08/craigslist-rental-housing-insights/)

## Why isn't web scraping incredibly popular?

- It can be time consuming and difficult to extract large volumes 
- You are at the mercy of website maintainers — if the website structure changes, your code breaks
- Most importantly, there are ethical and legal concerns

<center>
    <img src="imgs/google-search.png" width=700></img>
</center>

## Legal concerns

RadPad scraped the entirety of Craiglist, Craigslist sued RadPad, and they were [awarded $60 million](http://labusinessjournal.com/news/2017/apr/14/radpad-ordered-pay-605-million-judgment-craigslist/)

<center>
<img src="imgs/radpad.png" width=400></img>
</center>

## Two types of legal issues

1. Copyright infringement
    - For example: pictures, rental listing text
2. Terms of Use violations
    - **Unauthorized**: Is scraping prohibited in the website’s terms of use?
    - **Intentional**: Was the person aware of the terms? Did they check an “I agree to these terms” box?
    - **Causes damage**: Did the scraping overload the website, blocking user access?


## Web scraping public sites is legal

- Ruling from 2022 said that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act
- Linkedin had sued a competitor for scraping publicly available information from user profiles


[More info on the case](https://techcrunch.com/2022/04/18/web-scraping-legal-court/)

## Some more problematic use cases

- The facial recognition startup Clearview AI scraped billions of photos from social media websites. They [recently settled](https://www.mediapost.com/publications/article/389525/facial-recognition-company-clearview-settles-priva.html) a class action lawsuit that alleged they violated privacy laws
- Web scraping at a massive scale has been a key ingredient in generating the training datasets for generative AI models like ChatGPT. Companies like OpenAI and Meta [have been sued](https://apnews.com/article/openai-lawsuit-authors-grisham-george-rr-martin-37f9073ab67ab25b7e6b2975b2a63bfe) by authors and other content creators for violating copyright laws.


## When is web scraping probably okay?

- .gov sites and, to a lesser degree, .edu sites
- Website owner has no business reason to protect the information
- Not prohibited in terms of use
- Limited number of requests
- Not too many requests all at once
- Done at night, when web traffic is low


## When is it less likely to be okay?

- search engines
- E-commerce sites (e.g. Zillow, Expedia, Amazon)
- Social media
- Prohibited in terms of use
- Large number of requests
- High frequency of requests

**With that being said, let's do some web scraping...**

## A primer on Web definitions

So many acronyms:

- HTML
- CSS
- The DOM (for dynamic contents - not covered this year)
  

## 1. HTML: HyperText Markup Language

- The language most websites are written in
- The browser knows how to read this language and renders the output for you
- HTML is what a web crawler will see

### HTML tags

- There are a standard set of tags to define the different structural components of a webpage
- For example: 
    - `<h1>`, `<h2>` tags define headers
    - `<p>` tags define paragraphs
    - `<ol>` and `<ul>` are ordered and unordered lists

### Jupyter notebooks can render HTML

Use the `%%html` magic cell command

In [39]:
%%html

<html>
  <head>
    <title>TITLE GOES HERE</title>
  </head>
  <body>
    <h3>MAIN CONTENT GOES IN THE BODY TAG</h3>
    <p>This is a paragraph tag</p>
    <p>This is a second paragraph tag</p>
  </body>
</html>

### Elements, tags, and attributes

Learning the notation:

In [40]:
%%html

<a id="my-link" style="color: orange;" href="https://www.design.upenn.edu" target="blank_">This is my link</a>

**The element:** ![](imgs/atag-1.png)

**The tag:**
![](imgs/atag-2.png)

**The attributes:**

![](imgs/atag-3.png)

### Some attributes have special meaning

- In particular: `id` and `class`
- Allows you to: 
    - select and manipulate specific elements
    - apply styling to specific types of elements

## 2. CSS: Cascading Stylesheets

- A language for styling HTML pages
- CSS styles (also known as selectors) are applied to HTML tags based on their name, class, or ID.

<center>
    <img src="imgs/css.png" width="1200"></img>
</center>

### Basic Web selectors

- Class
    - e.g., `.red`
- ID
    - e.g., `#some-id`
- Tag
    - e.g., `p`, `li`, `div`

- **IDs:** unique identifiers
    - no two elements on a page will have the same ID.
- **Classes:** not unique
    - many elements will have the same class
    - a single element can have multiple classes
    
And many more: look up the syntax when you need it

[https://www.w3schools.com/cssref/css_selectors.asp](https://www.w3schools.com/cssref/css_selectors.asp)

### Inspecting a webpage

- Modern web browsers provide tools for inspecting the source HTML and DOM of websites
- Also tells you data sources that have been loaded by the page
- This should also be your first step when starting to scrape a page

::: {.callout-tip}
To load the Web Inspector in most modern browsers, you can simply hit the F12 button
:::

![](imgs/web-inspector-1.png)

### The Elements tab

- Allows you to inspect the DOM directly
- The tool that will allow you to identify what data you want to scrape from a website

![](imgs/web-inspector-2.png)

## Web scraping demo: Philadelphia Health Inspections

Let's scrape data for restaurant inspections using the searchable database maintained by the Philadelphia Inquirer, available at: [https://data.inquirer.com/inspections/](https://data.inquirer.com/inspections/)

![](imgs/clean-plates.png)

### Getting the HTML content

We'll use the built-in "requests" module to request the content of the website and load it into Python.

In [1]:
import requests
import pandas as pd

Use a "get" request to get the content:

In [2]:
url = "https://data.inquirer.com/inspections/"
r = requests.get(url)

In [3]:
type(r)

requests.models.Response

In [4]:
r.status_code

200

### BeautifulSoup makes this much more manageable

[BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) makes it much easier to extract out different parts of a website.

In [8]:
from bs4 import BeautifulSoup

Initialize the "soup" object, using the content of our get request:

In [9]:
soup = BeautifulSoup(r.content, 'html.parser')

### Making the HTML "pretty"

In [10]:
#print(soup.prettify())

This is what you'll see if you use the Web Inspector

### How to extract the content we want?

**Two important functions**

1. `soup.select_one(selector)`: finds the first element matching the selector query and returns **one** element
1. `soup.select(selector)`: finds **all** elements matching the selector 

**Recommended reading:** Note on beautiful soup and css selectors in [this week's repository](https://github.com/MUSA-550-Fall-2023/week-6/blob/master/css-selectors.md)

### To the Web Inspector!

We can use the web inspector to understand the structure of the website and identify the HTML tags that we want to extract content from.

### Let's select the first row

Web browsers will let us copy the CSS selector for individual elements.

Use: **Right Click > Copy > Copy Selector**

In [11]:
selector = "#inspection_unit_0"

In [12]:
# Select the first row
# NOTE: we are using "select_one()" to select only one matching element
first_row = soup.select_one(selector)

In [13]:
first_row

<div class="inspectionUnit inspectionUnitEven transitionBackground" id="inspection_unit_0"><a href="https://data.inquirer.com/inspections/philly/?detail=Bamba%20and%20Saduty%20Produce%20Market|4603%20FRANKFORD%20AVE%2019124"><div class="inspectionUnitInner"><div class="inspectionNameWrapper"><div class="inspectionUnitName transitionAll">Bamba and Saduty Produce Market</div><div class="inspectionUnitDate"><span class="inspectionUnitDateTitle">Inspection date:</span> Apr 18, 2024</div><div class="clearAll"></div></div><div class="inspectionUnitInfoWrapper"><div class="inspectionUnitAddress">4603 FRANKFORD AVE 19124</div><div class="inspectionUnitNeigborhood"></div><div class="clearAll"></div></div><div class="inspectionUnitCountWrapper"><span class="inspectionCountLabel">Violations</span><li class="inspectionUnitCount inspectionUnitCountFoodborne inspectionUnitCountFirst"><span class="inspectionCountNumber">5</span><span class="inspectionUnitInfoItemTitle"><span class="inspectionUnitInfo

### But we need all of the rows!

When you use Copy -> Copy Selector, the copied css selector will only match the specific element you've highlighted, no others!

#### Generalizing your selectors

We need to **generalize the selector** to just select all rows from the table, not just the first one. To do this, we'll need to go back to the web inspector and understand the structure of the website. 


When trying to identify a general selector, try to look for common patterns, like shared class names or id strings, across the tags you want to extract.


In our case, it looks like the "inspectionUnit" class is shared across all of the row div elements

![](imgs/clean-plates-2.png)

In [14]:
# Get all tags with the inspectionUnit class name
# Note we are using select() to select ALL elements
rows = soup.select('.inspectionUnit')

In [15]:
len(rows)

50

In [16]:
# get the first row
row = rows[0]

#print(row.prettify())

### Now, let's extract out the content from each row

We'll look for the following items:

1. The link to the full inspection report
1. The name of the restaurant
1. The restaurant address
1. The number of *food-borne violations*


#### 1. The report link

The link is stored as the "href" attribute of the first "a" element:


In [17]:
a = row.select_one("a")

a

<a href="https://data.inquirer.com/inspections/philly/?detail=Bamba%20and%20Saduty%20Produce%20Market|4603%20FRANKFORD%20AVE%2019124"><div class="inspectionUnitInner"><div class="inspectionNameWrapper"><div class="inspectionUnitName transitionAll">Bamba and Saduty Produce Market</div><div class="inspectionUnitDate"><span class="inspectionUnitDateTitle">Inspection date:</span> Apr 18, 2024</div><div class="clearAll"></div></div><div class="inspectionUnitInfoWrapper"><div class="inspectionUnitAddress">4603 FRANKFORD AVE 19124</div><div class="inspectionUnitNeigborhood"></div><div class="clearAll"></div></div><div class="inspectionUnitCountWrapper"><span class="inspectionCountLabel">Violations</span><li class="inspectionUnitCount inspectionUnitCountFoodborne inspectionUnitCountFirst"><span class="inspectionCountNumber">5</span><span class="inspectionUnitInfoItemTitle"><span class="inspectionUnitInfoItemTitleLabel">Foodborne Illness Risk Factors</span></span></li><li class="inspectionUnitC

Attributes can be extracted from the "attrs" attribute

In [18]:
a.attrs

{'href': 'https://data.inquirer.com/inspections/philly/?detail=Bamba%20and%20Saduty%20Produce%20Market|4603%20FRANKFORD%20AVE%2019124'}

In [19]:
link = a.attrs['href']

link

'https://data.inquirer.com/inspections/philly/?detail=Bamba%20and%20Saduty%20Produce%20Market|4603%20FRANKFORD%20AVE%2019124'

#### 2. The restaurant name

Use the "inspectionUnitName" class name to identify the right element.

![](imgs/clean-plates-name.png)

In [20]:
# Use the . to specify class name
name_tag = row.select_one(".inspectionUnitName")

name_tag

<div class="inspectionUnitName transitionAll">Bamba and Saduty Produce Market</div>

In [21]:
name = name_tag.text

name

'Bamba and Saduty Produce Market'

#### 3. The restaurant address

Use the "inspectionUnitAddress" class name to identify the right element.

![](imgs/clean-plates-address.png)

In [22]:
# Use the . to specify class name
addr_tag = row.select_one(".inspectionUnitAddress")

addr_tag

<div class="inspectionUnitAddress">4603 FRANKFORD AVE 19124</div>

In [23]:
address = addr_tag.text

address

'4603 FRANKFORD AVE 19124'

#### 4. The number of food-borne violations

It looks like the count number is within an element with class "inspectionCountNumber". BUT: this class is repeated on the retail violations element as well as the food-borne violations element. So, we'll need to use *nested* selectors

First, select elements with the "inspectionUnitCountFoodborne" class name and then the "inspectionCountNumber" class name.

![](imgs/clean-plates-violations.png)

In [24]:
# The number of foodborne violations
count = row.select_one(".inspectionUnitCountFoodborne .inspectionCountNumber")

int(count.text)

5

If the violations count is zero, there won't be any element that matches the above selector (the website instead uses a "inspectionUnitCountZero" class. 

If the element doesn't exist, the `select_one()` function will return "None"

### Putting it all together

Now, we can put this code into a for loop and extract out the content from every row on the page:

In [25]:
# Store the data from each row
data = []

# Step 1: Get all rows
rows = soup.select(".inspectionUnit")

# Loop over all rows
for this_row in rows:
    
    # Step 2: Get the report link
    # Note: we are using the "this_row" variable from the for loop
    a = this_row.select_one("a")
    url = a.attrs["href"]

    # Step 3: Get the name
    name_tag = this_row.select_one(".inspectionUnitName")
    name = name_tag.text

    # Step 4: Get the name
    addr_tag = this_row.select_one(".inspectionUnitAddress")
    address = addr_tag.text

    # Step 5: Get the violation count
    count_tag = this_row.select_one(".inspectionUnitCountFoodborne .inspectionCountNumber")

    # If there were no matches (None was returned), it means the count was zero
    if count_tag is None:
        count = 0
    else:
        count = int(count_tag.text)

    # Step 6: Save it
    data.append(
        {
            "name": name,
            "address": address,
            "foodborne_count": count,
            "url": url,
        }
    )

In [26]:
#data

In [27]:
# Make a dataframe
scraped_df = pd.DataFrame(data)

Sort by violation count:

In [28]:
scraped_df.sort_values("foodborne_count", ascending=False, ignore_index=True)

Unnamed: 0,name,address,foodborne_count,url
0,Bamba and Saduty Produce Market,4603 FRANKFORD AVE 19124,5,https://data.inquirer.com/inspections/philly/?...
1,Fado Pub,1500 LOCUST ST 19102,5,https://data.inquirer.com/inspections/philly/?...
2,David Gvinianidze Arts and Music Center,716 RED LION RD 19115,5,https://data.inquirer.com/inspections/philly/?...
3,DiBruno Brothers,1730 CHESTNUT ST 19103,5,https://data.inquirer.com/inspections/philly/?...
4,Hayashi Sushi & Poke,814 S 47TH ST 19143,4,https://data.inquirer.com/inspections/philly/?...
5,Asaad Halal Gyro/MC Crepes/V03396,7300 E ROOSEVELT BLVD 19149,4,https://data.inquirer.com/inspections/philly/?...
6,A & S Deli,2848 S 17TH ST 19145,4,https://data.inquirer.com/inspections/philly/?...
7,Carangi Baking Company,2655 S ISEMINGER ST 19148,3,https://data.inquirer.com/inspections/philly/?...
8,Espinal and Ramos Grocery,2000 MEDARY AVE 19138,2,https://data.inquirer.com/inspections/philly/?...
9,DiBruno Brothers Events & Catering,435 FAIRMOUNT AVE 19123,2,https://data.inquirer.com/inspections/philly/?...


### See any restaurants you recognize?

## Web scraping exercise

Use the Web Inspector to inspect the structure of the relevant web page, and identify the HTML content you will need to scrape with Python.


### How many millions of people are currently experiencing drought?

Relevant URL: [https://www.drought.gov/current-conditions](https://www.drought.gov/current-conditions)

**Hint:** We're interested in just a single HTML element so you can inspect the website, identify the right element, and copy the selector for the element.

In [29]:
# Make the request
url = "https://www.drought.gov/current-conditions"
response = requests.get(url)

# Initialize the soup for this page
soup2 = BeautifulSoup(response.content, "html.parser")

In [30]:
selector = "#block-uswds-drought-content > div > div > div.grid-container.grid-container--standard.padding-top-6 > div:nth-child(5) > div > div:nth-child(3) > div > div > div.u--color--accent.text-center.font-sans-xl.field.field--name-field-number-stat.field--type-string.field--label-hidden"

In [31]:
soup2.select_one(selector).text

'150.3 Million'

### Scrape the Weitzman School directory

The Weitzman School lists their directory of people on this page: [https://www.design.upenn.edu/people/list](https://www.design.upenn.edu/people/list). From this site, let's extract out following information:

- The person's name;
- title, and;
- associated department.

The info we want for each person is wrapped up in a `<div>` element. You can select all of those elements, loop over each one in a "for" loop, extract the three pieces of content we want from each `<div>`, and then save the result to a list.



In [32]:
# Make the request
url = "https://www.design.upenn.edu/people/list"
response = requests.get(url)

# Initialize the soup for this page
soup3 = BeautifulSoup(response.content, "html.parser")

In [33]:
# Select all rows
rows = soup3.select(".views-row")

In [34]:
len(rows)

578

In [35]:
# Extract out specific content from each row
print(rows[0].prettify())

<div class="views-row">
 <a class="list-item profile-item" href="/people/zhan-shi">
  <span class="text">
   <span class="title heading-5">
    Zhan Shi
   </span>
   <span class="meta body-small">
    Ph.D. student
   </span>
  </span>
  <span class="dept body-subhead">
   Thermal Architecture Lab
  </span>
  <span class="arrow">
   <span aria-hidden="true" class="fa fa-arrow-right">
   </span>
  </span>
 </a>
 <div class="views-field views-field-edit-node-1 edit">
 </div>
</div>



In [36]:
# Save content here
data = []

# Loop over all rows
for row in rows:
    # Person name
    person = row.select_one(".title").text

    # Title
    title = row.select_one(".meta").text

    # Deptarment
    dept = row.select_one(".dept").text.strip()

    data.append({"person": person, "title": title, "dept": dept})


data = pd.DataFrame(data)

data

Unnamed: 0,person,title,dept
0,Zhan Shi,Ph.D. student,Thermal Architecture Lab
1,Tom Abel,External Faculty Collaborator,Center for Environmental Building & Design
2,Dr. Mostafa Akbari,"PhD Architecture Alum, 2024",Architecture
3,Masoud Akbarzadeh,Assistant Professor of Architecture,Architecture
4,Scott Aker,Lecturer,Architecture
...,...,...,...
573,Cynthia Zhou,Animator Researcher//Design and Fine Arts,Penn Animation as Research Lab
574,Emily Zimmerman,Guest Curator and Visiting Critic 2024-2025,Fine Arts
575,Jessica Zofchak,Lecturer,Architecture
576,Syd Zolf,"Artist in Residence CPCW & GSWS, FNAR Lecturer",Fine Arts
