## What is web scraping?
Broadly, web scraping is the process of automatically pulling data from websites.
There are a few distinctions which define the problem:

-   *Static vs. Dynamic* web pages require different code packages to scrape.
  -   For example, selenium is a common dynamic web scraping library and scrapy is a common static web scraping library.

-   

## When is it worth it?

commonly produced highly standardized data

getting data quickly is valuable

no download available, cannot get through to data provider

format remains consistent

## What makes it hard? (can be multiple slides)

takes time to code

sites might change

not always desired from site maintainers

difficulty is variable and hard to predict

# Example 1. Scraping state education sites

## Demonstrative example {video-loop="true"}

Slides describing florida example

# Example 2. Scraping Medicaid enrollment data

## Why Medicaid enrollment data?

Since Spring 2023, states have been disenrolling Medicaid beneficiaries who no longer qualify since the Public Health Emergency was ended.

::: {layout-ncol="1"}
![](images/kff-fig.png)
:::

## Why are the data interesting?

In anticipation of ["the great unwinding,"](https://www.brookings.edu/articles/medicaid-and-the-great-unwinding-a-high-stakes-implementation-challenge/) many states implemented policy changes to smooth the transition.

To understand the success of these policies, we wanted **time-series enrollment data for all 50 states**... from a Medicaid data system that is largely decentralized.

## Unreadable PDFs abound!

::: {layout-ncol="2"}

![An example from Louisiana](images/louisiana-horrible.png)

![and another from Ohio](images/ohio-horrible.png)
:::

## A sigh of relief...

Why page through PDFs when another organization's RAs can do it for you?

::: {layout-ncol="1"}
![](images/kff-page.png)
:::

## 1. Identify this is a scrape-able dynamic page

One URL with data you can only get by clicking each option!

::: {layout-ncol="1"}
![](images/kff-page-click.png)
:::

## 2. Confirm HTML actually contains the data

::: {layout-ncol="1"}
![](images/html-plot.png)
:::

## 3. Code for 30 hours! {#fun-slide background-video="images/web-surfer.mp4" background-video-loop="true" background-size="50px"}


```{css, echo=FALSE}
#fun-slide,
#fun-slide h2{
 color: blue;
 font-size: 200px;
 font-style: italic;
 font-family: cursive;
}
```


## 4. Bask in the glow of automated scraping

Whenever new data were released in the following 2 months, I re-ran [the code](https://github.com/UI-Research/web-scraping/blob/master/kff_unwinding.py) and got a well-formatted excel file as output.

::: {layout-ncol="1"}
![](images/example-scraped.png)
:::

## Little did I know, trouble was coming

::: {layout-ncol="1"}
![](images/trouble-in-paradise.png)
:::

## What happened?

2 months later, KFF **stopped updating** the dashboard and **changed how existing data was reported** on graphs.

::: {layout-ncol="1"}
![](images/broken-egg.png)
:::

# Concluding remarks

## Core questions to explore before scraping anything

**Availability of data**

-   Are the data available through other routes?

-   Are the data produced by an organization that is invested in the problem long-term?

**Frequency of scraping**

-   Will I need to scrape the data multiple times?

-   What is the risk that the item scraped from the site will be changed? 

**Time-value tradeoffs**

-   Is the time spent coding worth the payoff?

-   Will collecting data automatically save time on quality assurance?

## Discussion

The remainder of the time is reserved for group discussion!

- Have you ever wondered whether a specific site is scrapeable?

Please contact [Manu Alcala](mailto:malcalakovalski@urban.org) or [Jameson Carter](mailto:jamcarter@urban.org) if you would like to discuss either of these projects or scope whether a use-case is reasonable.