# 0. Intro to "web scraping"

A primer to get us all on the same _page_.

## What is scraping?

Automating the steps of **gathering** information published on the internet and **processing** it into a convenient format for analysis.

 - **Gathering**: Could be as simple as downloading the results of running a query on an API, or as complicated as simulating human interaction with a web form to "type" search parameters "click" the search button and save the results.
 - **Processing**: Could be as simple as converting JSON to CSV, or as complicated as using CV to extract text from images and NLP to convert that text into tabular data.

## Why do we scrape?

We scrape because we have questions, but the dataset that could help answer these questions is missing.

Sometimes the dataset is intentionally missing.
Courts, for example, have databases full of tantalizing information for researchers and journalists, but they are notoriously stubborn about sharing it.
But they are often obligated to publish some amount of information on the internet, at least for a short time.
We can scrape the court websites to build our dataset.
This can help uncover, for example, the major players in consumer debt collection lawsuits and how prevalent they are.
 
Sometimes the data are piecemeal.
Often we need to collect data from multiple sources to find a more compelling story.
In the court data example, consumer debt is a _national_ issue and there simply doesn't exist a one single national database of consumer debt lawsuits.
So we have to scrape one together ourselves.

Sometimes the data just aren't formatted well.
This is less true today, but in the past many websites served static HTML, possibly without a database at all backing them up.
In order to use that data you'd need to download and format it.

## Alternatives to scraping

Ideally, we **don't** scrape.

Scraping can be tedious and time-consuming.
Many websites use tactics to block programmatic access.
It's good to try other means of getting your underlying data first:

 - Check if there's a "download" button. (In 2022, this has become much more common than in the past.)
 - Ask for access. For non-sensitive / non-antagonistic projects the owners might be happy to provide you an export of their data. They likely have more and better data than you realized just by looking at the site.
 - File a public records request. This is a class unto itself, and it may not pan out the way you hope, but it's worth a shot going through the sanctioned channels.
 
Think of scraping as a scrappy, **last resort** method of getting your dataset, when nothing else works.

## Is scraping legal?

[Yes!](https://techcrunch.com/2022/04/18/web-scraping-legal-court/)

But! many websites actively discourage you, largely for three reasons:

1. Data can be sensitive. Keep this in mind whenever _you_ are scraping data, say from a court. You will have names and details about the court cases -- and just because this is technically "public" information doesn't mean it's widely known, and it can negatively impact individuals to be included in a dataset you build. Be mindful of and careful with personal information.
2. The web isn't free. Web servers cost money and have finite resources. With human users, web serves are cheap and have ample resources. But even a simple scraping bot is capable of making _dozens_ of page requests _per second_. This spike in traffic can make the website unavailable for human users and can also balloon the cost of running the server.
3. Data is gold. The business model of the richest companies in the world is to sell ~your~ their data.

Keep reasons (1) and (2) in mind as you scrape websites. Store your data ethically, censor and redact it where appropriate, and place a rate-limit on your scraping bot to keep the website available for others.

Do what you want with reason (3).

### What does "actively discourage" mean?

Often, some form of CAPTCHA.
Your IP is also logged and subject to banning.
There are ways around these techniques, but we will not cover them here.

# 1. Scoping the project

The first things to identify _before_ you start scraping are:
1. What's your question?
2. What data do you need to answer your question?
3. Where can you get this data?

For today's project, we're going to look at housing and evictions in San Francisco.

San Francisco is one of the most [expensive](https://www.dbresearch.com/PROD/RPS_EN-PROD/PROD0000000000494405.pdf?undefined&realload=IfbV/lNZWzJUGuuE7hHWFmrrGl3IRC7Wm1wixcHx0ltY2AZL6G3khovJo4kh22HV) rental markets in the world.
People often point to the basics of supply and demand to explain this: there aren't enough rental units.
While there are certainly multiple reasons why rental supply is limited, one factor that seems awfully dubious is known as the [_Ellis Act_](https://sftu.org/ellis/).

While California (and San Francisco in particular) typically has strong protections for tenants, under Ellis, landlords can unconditionally evict tenants if they are taking the building out of the rental market.
The law was intended to give landlords an out if they wanted to, say, have a family and stop renting out their downstairs unit.

However, there is a suspicion that real estate speculators have started abusing Ellis to take rental units off the market and flip them as houses.


### Question

How has the Ellis Act been used in San Francisco?

_Big question! We will only be able to begin to address this in this workshop!_


### What data do we want?

Today, we will start by gathering only one statistic:

 - How many evictions have been filed under the Ellis Act
 
To follow up on this work, we'd want to see also:

 - Who has been filing the Ellis Act
 - What are the addresses of units that have been Ellised
 - What's the transaction timeline of the houses in question
 
### Where can we get data?

Today we will look at:

 - The [SF Rent Board](https://sfrb.org/monthly-statistics) publishes monthly statistics, including how many units have been Ellised
 
In the future we might also want to see data from:

 - The [SF Assessor/Recorder](https://sfplanninggis.org/pim/?pub=true), who track information about property ownership for tax purposes, including who owns the property and what the property is used for.


## 1.2. Checking out the site

So today we will get data the SF Rent Board.

Let's check out their statistics site: https://sfrb.org/monthly-statistics

Cool, about 20 years of data!!
Let's check out what format it's in. Click on one of the links, like: https://sfrb.org/sites/default/files/Workload%20Stats%20April%202022.pdf

_Gahhh, a PDF!_

Well, the good news is this form contains an _Ellis_ field, and it looks like it's been filled in every month since ... 2003?
Looks like from 2000-02 they actually published this in HTML! 🙃

### Sometimes there are easier ways

For this site, we're going to have to extract data from PDFs.
No way around it.

For newer sites, it's worth checking the "Network" tab in your browser - there may be a query API "under the hood" that you can interact with directly, without having to process HTML.

# 2. Setting up your environment

Python seems like a good choice here.
For highly dynamic websites in which you have to simulate "human" behavior, it's often easier to write scrapers in JavaScript.
But for this project, the Python code will end up being simpler and cleaner.

We'll install a few packages to help:

### [`requests`](https://requests.readthedocs.io/en/latest/)

Python includes built-in libraries to make HTTP requests, but they are very _low-level_, meaning they have lots of options and tend to be tedious to use. Most people use a library called `requests` to simplify this process.

### [`beautifulsoup4`](https://www.crummy.com/software/BeautifulSoup/)

There are many ways to process HTML. BeautifulSoup is a library developed specifically for scraping web-pages. It is able to parse messy (and potentially invalid) HTML and provides flexible ways to query for information in the parsed document.

### [`PyPDF2`](https://pypdf2.readthedocs.io/en/latest/)

Python does not include a built-in way to work with PDF files, so we use this 3rd-party library to help.

In [None]:
pip install requests beautifulsoup4 PyPDF2

# 3. Scraping the index

Time to start scraping!

Our first goal is to get a list of all the documents that contain the information we want.
Very commonly when scraping websites, we run into this kind of "index" page that contains links to the actual content.
On the SFRB site, this actual content is (mostly) in the form of PDFs.

### Index scraping goals:
1. Fetch the HTML of the "index" page
2. Find links to every PDF containing housing data
3. Download PDFs

## 3.1. Fetch the index page

Use `requests` to download the HTML of the index page.

In [None]:
import requests

URL = "https://sfrb.org/monthly-statistics"
response = requests.get(URL)
print(response.text)

That was easy! but ... what a **_soup!_**

## 3.2. Find PDF links

All that HTML is going to be a pain to sift through.
This is where BeautifulSoup comes in handy.

Let's take a look at some tools BeautifulSoup gives us.

In [None]:
# Import the `BeautifulSoup` class.
# Note that `beautifulsoup4` installs itself as `bs4` so you don't have to type as much.
from bs4 import BeautifulSoup

# Creating a `BeautifulSoup` object will parse the HTML.
soup = BeautifulSoup(response.text)

# Now `soup` has a few methods for interacting with the parsed HTML.

In [None]:
print("Query for all the `h1` elements:\n")
print(soup.find_all('h1'))

In [None]:
print("Query for all the elements with a `block-title` class:\n")
print(soup.select('.block-title'))

In [None]:
print("Extract the text content of all the `.block-title` elements:\n")
for el in soup.select('.block-title'):
    print(el.text)

`BeautifulSoup` has *lots* more features you may need at some point. For now, let's just find the PDF links.

Using the "inspect" tool in your browser, you want to figure out: what makes the links I'm interested in _unique_, compared with any other link on the page?

In [None]:
# Way too broad!
all_links = soup.select('a')

In [None]:
# This is better, but still too broad ...
li_links = soup.select('li a')

In [None]:
# Even better, but what about those non-PDF links?
table_links = soup.select('table li a')

In [None]:
# This is actually too narrow because they used a CMS at some point!!!
pdf_ext_links = soup.select('a[href*=pdf]')

In [None]:
# What if we use the fact that all the PDF link texts indicate they're PDFs with the `(pdf)`?
table = soup.find('table')
pdf_links = [a for a in table.select('a') if '(pdf)' in a.text]

In [None]:
# That worked! now let's pull the full URL:
urls = [a['href'] for a in pdf_links]

# Oops, except some of those are relative links. So let's normalize it.
base_url = 'https://sfrb.org'

def fix_url(href):
    if href.startswith('/'):
        return base_url + href
    return href

urls = [fix_url(a['href']) for a in pdf_links]

# Great!! But let's make sure we have a good name for the files, too.
urls = {a.text : fix_url(a['href']) for a in pdf_links}

# Well... that's not really a *good* name. Let's clean it up a bit more.
import re
def fix_name(text):
    # Split the text after the four digit year, since the end is just junk.
    parts = re.split(r'(\d{4})', text)
    # Recombine the month and the year parts from the split result.
    return parts[0] + parts[1] + '.pdf'

urls = {fix_name(a.text): fix_url(a['href']) for a in pdf_links}

print(f"Found {len(urls)} PDFs to download!")

## 3.3. Download the PDFs

Now we just need to loop through the PDF links and save them.

Remember, we are going to make over 200 requests to the server to download documents.
This is a red flag to see this burst of activity from one IP address.
It's very possible they will ban your IP automatically.
It's also possible for you to degrade performance for other users.

You can mitigate this by spacing out your requests a bit.

In [None]:
import time
import os

# Iterate over every URL we have.
for name, url in urls.items():
    # Get the output path for the data. This will use the `data/` directory, so that
    # we don't clutter the root directory.
    file_name = os.path.join('raw', name)
    # Skip downloading files we already have (in case of running this multiple times!)
    if os.path.exists(file_name):
        print(f"Already have {name}!")
        continue
        
    print(f"Downloading {name} ...")
    # Create a local file with the name we generated.
    #
    # Note that we open the file for *writing* in *binary* mode, since PDFs
    # are binary (not textual) data.
    with open(file_name, 'wb') as fh:
        # Request the URL from the server
        result = requests.get(url)
        # Save the raw response in the local file
        fh.write(result.content)
        # Clean up the response object
        result.close()

    # Sleep for a little bit to space out the requests.
    time.sleep(0.25)
    
print("Done!")

# 4. Parsing the PDFs

Now that we have the PDFs, it's time to process them, similarly to how we processed the index page.

`BeautifulSoup` worked great for the HTML, but it won't work for PDFs.
Instead, we will use a library called `PyPDF2`.

The goals for this step are:
1. Extract text from the PDFs
2. Parse Ellis filings from the extracted text
3. Create a CSV file with Ellis info by date


## 4.1. Extracting text from PDFs

Let's look at what the PyPDF2 library gives us.

In [None]:
from PyPDF2 import PdfFileReader

reader = PdfFileReader(os.path.join('raw', 'April 2022.pdf'))

In [None]:
# The PDF contains some metadata about its creation.
print("Metadata:")
for key, value in reader.documentInfo.items():
    print(f"  {key}: {value.get_object()}")

In [None]:
# It also can tell us how many pages it has:
print(f"Pages: {reader.getNumPages()}")

In [None]:
# There's a lot more information that can be helpful, but let's
# just look directly at the text of that single page in the document:
text = reader.getPage(0).extract_text()
print("Text:")
print(text)

More soup!

## 4.2. Extracting Ellis filings

The text is kind of messy, but hey, we can see the word "Ellis" and some numbers.
That's very promising!

We can write a regular expression to extract the two numbers after we see the word "Ellis."

In [None]:
import re

# This pattern looks for the string "Ellis)" then finds the first two
# numbers after that.
#
# Notes:
#  1) We make sure to account for the left parenthesis which we see in the document.
#     If you are working with OCRed documents, you need to be careful and creative about
#     these types of symbols! Our document was created digitally, so we don't have that issue.
#  2) We use a variable number of whitespace \s characters with the + sign. This literally
#     means "one or more," because it isn't clear how many spaces are expected to be there,
#     and it may change between documents.
#  3) We use capturing groups with the parentheses to enclose the "number" patterns: (\d+).
#     This will let us extract the numbers with the `.groups()` method on the match object.
match = re.search(r'Ellis\)\s+(\d+)\s+(\d+)', text)
print(match.groups())

Compare this to what we see in the "April 2022.pdf" document.
That looks correct!

**Scaling up**

Now lets loop through all the documents and run this pattern.

In [None]:
# Create a dictionary to store the results of our extraction.
ellis_data = {}

# Compiling the pattern in advance is good practice, but not necessary.
ellis_pattern = re.compile(r'Ellis\)\s+(\d+)\s+(\d+)')

for name in urls.keys():
    print(f"Parsing {name} ...")
    reader = PdfFileReader(os.path.join('raw', name))
    text = reader.getPage(0).extract_text()
    match = ellis_pattern.search(text)
    ellis_data[name.strip('.pdf')] = match.groups()
    
print(ellis_data)

Now it's looking a lot like the dataset we were seeking!!

## 4.3. Saving results

People usually save data as CSV files.
These are simple and portable.
You can share a CSV and immediately open it in Excel, Google Sheets, Apple Numbers, Python, R, or any old text editor.

**hot take**

The downside of CSV is that it does not store any explicit _type_ information.
So, if your data contains numbers, dates, null values, and so on, you (and anyone you give the data to) will have to figure that out themselves when they open the file.

I personally feel it's more useful to use a structured data format such as parquet, sqlite, or even JSON.

But for now, we will just use CSV.



In [None]:
import csv

# Our data is currently a Dict that looks like {date: [petitions, units]}.
# To write a CSV, we need a List of rows.
#
# Each row should contain the following columns:
cols = ['year', 'month', 'petitions', 'units']

# Note that we are changing `date` to two columns `(year, month)`. This will
# make the file a bit more flexible, since we could aggregate it by year,
# or look at year-over-year trends.
rows = []
for date, data in ellis_data.items():
    month, year = date.split()
    # Note that all of the values are strings when we create the row.
    # Remember that CSV doesn't store any type information, so a number is
    # stored exactly the same as a string. The only point of trying to convert
    # here would be to help validate the data.
    #
    # If we were using a different file type, we should convert the year,
    # petitions, and units columns to integer types.
    rows.append({
        'year': year,
        'month': month,
        'petitions': data[0],
        'units': data[1],
    })

# Now create a CSV file and write the data.
with open(os.path.join('data', 'ellis.csv'), 'w') as fh:
    writer = csv.DictWriter(fh, cols)
    # First write a "header" row so anyone using the file knows what the
    # columns mean.
    writer.writeheader()
    # Now write each row
    for row in rows:
        writer.writerow(row)

print("Done!")

**_Now open `data/ellis.csv` and behold: tabular data!!_**

# 5. Working with the data

The scraping project is officially **done**, but we might as well take a look at what we found!
We have all sorts of power now that we have this table.
Let's just try to do a few basic things:

1. Read data from the CSV
2. Tally how many units have been taken off the market via the Ellis Act in our data
3. Make a chart of Ellis petitions over time

## 5.1. Reading the CSV

Reading the CSV is even easier than writing it, though, again we need to be careful about our data types.
We know that we're working with integers and we don't have any missing values, so we'll convert the columns here.

In a larger analysis project you would want to use a fully-featured library like `pandas`.


In [None]:
import csv

data = []
with open(os.path.join('data', 'ellis.csv'), 'r') as fh:
    csv_reader = csv.DictReader(fh)
    for line in csv_reader:
        # Convert the numeric columns to integers.
        # This would fail if we were missing data, or the columns were invalid,
        # So be careful about that!
        line['year'] = int(line['year'])
        line['petitions'] = int(line['petitions'])
        line['units'] = int(line['units'])
        data.append(line)

print("Loaded data!")

## 5.2. Math!

In [None]:
# How many total Ellis petitions?
total_petitions = sum(d['petitions'] for d in data)
print("Total Ellis petitions:", total_petitions)

# How many total units have been Ellised?
total_units = sum(d['units'] for d in data)
print("Total Ellised units:", total_units)

# Average units per petition?
print("Average units per petition", total_units / total_petitions)

# How many units were Ellised per year?
petitions_by_year = {}
for d in data:
    if d['year'] not in petitions_by_year:
        petitions_by_year[d['year']] = 0
    petitions_by_year[d['year']] += d['petitions']
    
print("Petitions by year:")
ordered_years = sorted(petitions_by_year.keys())
for year in ordered_years:
    print(f"  {year}: {petitions_by_year[year]}")

## 5.3. Visualization!

The by-year table is interesting, but it would be easier to make sense of in a chart.

We'll use a charting library called `matplotlib`.

In [None]:
pip install matplotlib

In [None]:
import matplotlib.pyplot as plt

# We can make a really quick plot by year, since we already have the data.
ordered_petitions = [petitions_by_year[year] for year in ordered_years]
plt.plot(ordered_years, ordered_petitions)

In [None]:
# To make an actual time series, we'll need to convert (year, month) to a date type.
from datetime import datetime

time_series = []
petitions_series = []
units_series = []
for d in data:
    # `strptime` parses dates in the given format.
    # %B means the full month name, like "January"
    # %Y means the 4-digit year, like "2002."
    date = datetime.strptime(f"{d['month']} {d['year']}", "%B %Y")
    # Store all the values in separate series, for the plotting library
    time_series.append(date)
    petitions_series.append(d['petitions'])
    units_series.append(d['units'])

In [None]:
# Show the petitions by year
plt.plot(time_series, petitions_series)

In [None]:
# Show the units Ellised by year
plt.plot(time_series, units_series)