# Intro to Dynamic Web Scraping with R

A (mostly) pain-free way to collect data from Forbes, Amazon, Instagram, and other scraping-resistant sites.

## About Me
- Reporting intern at the [Chronicle of Higher Education](https://chronicle.com/)
- Elections scraper for the Associated Press
- Web developer for Behind the Badge
- Volunteer for the [Data Liberation Project](https://www.data-liberation-project.org/)
- Graduate of the [Lede Program](https://ledeprogram.com/) for data reporting at Columbia
- Longtime scraper with a passion for freeing data from poorly designed / hostile sites

[declanrjb.com](https://declanrjb.com) | [github.com/declanrjb](https://github.com/declanrjb)

## Thank Yous
- Hadley Wickham, creator of rvest and speaker @ NICAR. Talk inspired by his NICAR session, and uses some of the same examples. All code is my own.
- Jeremy Singer-Vine, Data Editor @ NY Times, teacher and mentor to years of Lede students
- Leon Yin, Data Reporter @ Bloomberg, king of undocumented APIs

## Following along
- This talk is available on GitHub
- Scan the QR or visit [tinyurl.com/live-scraping](https://tinyurl.com/live-scraping) to follow along with the slides now or later

<figure>
<img src="assets/live-scraping-qr.png"
     style="width:50%;" 
    />
</figure>

## What is this talk about?
- Webscraping
- Webscraping in the R language
- Webscraping dynamic sites
- All of the above!

## Webscraping 

**noun**

1) Using a computer to programmatically acquire / summarize data from one or more websites

2) The art of ignoring the Terms of Service in order to get a bunch of information someone else doesn't want you to have (for the public good!)

<figure>
<img src="assets/forbes-header.png"
     style="width:100%" 
    />
</figure>

<figure>
<img src="assets/forbes-code.png"
     style="width:100%" 
    />
</figure>

<figure>
<img src="assets/diagram1.png"
     style="width:100%" 
    />
</figure>

<figure>
<img src="assets/diagram2.png"
     style="width:100%" 
    />
</figure>

## Part the First

Convincing a server to just give us the page

| Code | Human | Output |
| ---- | ----- | ------ |
| `rank <- 4` | Remember the number 4 as 'rank'. When I ask you for rank, tell me it's 4. | |
| `rank %>% sqrt()` | Lookup that rank I asked you to remember (it's 4). Take that and feed it into the calculation you know for taking the square root of a number. | 2 |
| `x <- rank %>% sqrt()` | Recall what rank is. Take the square root. Remember the result as 'x' | |
| `x` | Remind me what x was again? | 2 |

## Installations
- Google Chrome browser (installs chromote, a headless browser, behind the scenes) 
- If using R for the first time, the [latest version of R](https://www.r-project.org/)
- (Recommended) [RStudio](https://posit.co/download/rstudio-desktop/), an R-focused IDE.

## Writing the script (finally)

- Let's install the libraries we need
- Large packages of pre-written code that do things we want to do (first rule of coding: copy and paste with abandon)
- You can follow along with these steps at [tinyurl.com/live-scraping](https://tinyurl.com/live-scraping)

In [1]:
# General tools for the R language
install.packages("tidyverse")
# Web scraping tools
install.packages("rvest")


The downloaded binary packages are in
	/var/folders/7d/nwvg_sj134x6bj58g27_zxp40000gn/T//RtmpvQx5TZ/downloaded_packages

The downloaded binary packages are in
	/var/folders/7d/nwvg_sj134x6bj58g27_zxp40000gn/T//RtmpvQx5TZ/downloaded_packages


Don't forget to import the libraries at the start of your script

In [2]:
library(tidyverse)
library(rvest)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘rvest’


The following object is masked from ‘package:readr’:

    guess_encoding




In [3]:
# retrieve the page
page <- read_html("https://www.forbes.com/top-colleges/")

Did it work?

In [4]:
page

{html_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<div id="__next">\n<div class="ForbesHeader_mainHeader__XuFcZ"><h ...

Yes!

## Part the Second

Turning a page into data

<figure>
<img src="assets/forbes_table-selection.png"
     style="width:100%" 
    />
</figure>

## Choosing a good selector

- Typically best to use the most specific tag available (`ListTable_listTable__-N5U5`)
- Be wary of ids/selectors that look randomly generated or altered
- Standard divs (`table`) make for more resilient scripts

In [5]:
page <- read_html("https://www.forbes.com/top-colleges/")
page |> html_nodes("table")

{xml_nodeset (0)}

<figure>
<img src="assets/forbes_table-selection.png"
     style="width:100%" 
    />
</figure>

## This is a dynamic site

(shudder)

In [16]:
page <- read_html_live("https://www.forbes.com/top-colleges/")
Sys.sleep(1)
page |> html_node("table")

{html_node}
<table class="ListTable_listTable__-N5U5">
[1] <thead><tr>\n<th class="ListTable_tableColumn__JG0zP TableHead_sortableCo ...
[2] <tbody>\n<tr class="ListTable_tableRow__P838D ListTable_activeRow__-1d4o" ...

In [7]:
page <- read_html_live("https://www.forbes.com/top-colleges/")
Sys.sleep(1)
page |> html_node("table") |> html_table()

Rank,Name,State,Type,Av. Grant Aid,Av. Debt,Median 10-year Salary,Financial Grade
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1.0,Princeton University,NJ,Private not-for-profit,"$59,792","$7,559","$189,400",A-
,"Princeton University is a private Ivy League research university located in Princeton, New Jersey. As the fourth oldest college in the United States, Princeton has a deep history that spans 276 years. The university offers 37 degree concentrations and over 50 interdepartmental certificate programs, with some of the most popular majors being social sciences, engineering, public administration, and social service professions. Princeton provides generous financial aid, covering 100% of tuition and housing for families earning up to $65,000 for the class of 2026. In 2022, Princeton reported an endowment valued at $35.8 billion to help fund university-wide research, service programs, and financial aid to help students graduate debt free. Campus traditions allow Ivy League sports rivalries to live on—students celebrate defeating Harvard and Yale in the same season by creating a bonfire by burning crates decorated with the rival team’s colors and logos. Keeping with tradition, students conclude the end of each course with a round of applause in appreciation to the professor once the final lecture is given, writes Amélie Lemay, who described her 2021 freshman semester at Princeton University. View ProfileAV. GRANT AID$59,792","Princeton University is a private Ivy League research university located in Princeton, New Jersey. As the fourth oldest college in the United States, Princeton has a deep history that spans 276 years. The university offers 37 degree concentrations and over 50 interdepartmental certificate programs, with some of the most popular majors being social sciences, engineering, public administration, and social service professions. Princeton provides generous financial aid, covering 100% of tuition and housing for families earning up to $65,000 for the class of 2026. In 2022, Princeton reported an endowment valued at $35.8 billion to help fund university-wide research, service programs, and financial aid to help students graduate debt free. Campus traditions allow Ivy League sports rivalries to live on—students celebrate defeating Harvard and Yale in the same season by creating a bonfire by burning crates decorated with the rival team’s colors and logos. Keeping with tradition, students conclude the end of each course with a round of applause in appreciation to the professor once the final lecture is given, writes Amélie Lemay, who described her 2021 freshman semester at Princeton University. View ProfileAV. GRANT AID$59,792","Princeton University is a private Ivy League research university located in Princeton, New Jersey. As the fourth oldest college in the United States, Princeton has a deep history that spans 276 years. The university offers 37 degree concentrations and over 50 interdepartmental certificate programs, with some of the most popular majors being social sciences, engineering, public administration, and social service professions. Princeton provides generous financial aid, covering 100% of tuition and housing for families earning up to $65,000 for the class of 2026. In 2022, Princeton reported an endowment valued at $35.8 billion to help fund university-wide research, service programs, and financial aid to help students graduate debt free. Campus traditions allow Ivy League sports rivalries to live on—students celebrate defeating Harvard and Yale in the same season by creating a bonfire by burning crates decorated with the rival team’s colors and logos. Keeping with tradition, students conclude the end of each course with a round of applause in appreciation to the professor once the final lecture is given, writes Amélie Lemay, who described her 2021 freshman semester at Princeton University. View ProfileAV. GRANT AID$59,792","Princeton University is a private Ivy League research university located in Princeton, New Jersey. As the fourth oldest college in the United States, Princeton has a deep history that spans 276 years. The university offers 37 degree concentrations and over 50 interdepartmental certificate programs, with some of the most popular majors being social sciences, engineering, public administration, and social service professions. Princeton provides generous financial aid, covering 100% of tuition and housing for families earning up to $65,000 for the class of 2026. In 2022, Princeton reported an endowment valued at $35.8 billion to help fund university-wide research, service programs, and financial aid to help students graduate debt free. Campus traditions allow Ivy League sports rivalries to live on—students celebrate defeating Harvard and Yale in the same season by creating a bonfire by burning crates decorated with the rival team’s colors and logos. Keeping with tradition, students conclude the end of each course with a round of applause in appreciation to the professor once the final lecture is given, writes Amélie Lemay, who described her 2021 freshman semester at Princeton University. View ProfileAV. GRANT AID$59,792","Princeton University is a private Ivy League research university located in Princeton, New Jersey. As the fourth oldest college in the United States, Princeton has a deep history that spans 276 years. The university offers 37 degree concentrations and over 50 interdepartmental certificate programs, with some of the most popular majors being social sciences, engineering, public administration, and social service professions. Princeton provides generous financial aid, covering 100% of tuition and housing for families earning up to $65,000 for the class of 2026. In 2022, Princeton reported an endowment valued at $35.8 billion to help fund university-wide research, service programs, and financial aid to help students graduate debt free. Campus traditions allow Ivy League sports rivalries to live on—students celebrate defeating Harvard and Yale in the same season by creating a bonfire by burning crates decorated with the rival team’s colors and logos. Keeping with tradition, students conclude the end of each course with a round of applause in appreciation to the professor once the final lecture is given, writes Amélie Lemay, who described her 2021 freshman semester at Princeton University. View ProfileAV. GRANT AID$59,792","Princeton University is a private Ivy League research university located in Princeton, New Jersey. As the fourth oldest college in the United States, Princeton has a deep history that spans 276 years. The university offers 37 degree concentrations and over 50 interdepartmental certificate programs, with some of the most popular majors being social sciences, engineering, public administration, and social service professions. Princeton provides generous financial aid, covering 100% of tuition and housing for families earning up to $65,000 for the class of 2026. In 2022, Princeton reported an endowment valued at $35.8 billion to help fund university-wide research, service programs, and financial aid to help students graduate debt free. Campus traditions allow Ivy League sports rivalries to live on—students celebrate defeating Harvard and Yale in the same season by creating a bonfire by burning crates decorated with the rival team’s colors and logos. Keeping with tradition, students conclude the end of each course with a round of applause in appreciation to the professor once the final lecture is given, writes Amélie Lemay, who described her 2021 freshman semester at Princeton University. View ProfileAV. GRANT AID$59,792","Princeton University is a private Ivy League research university located in Princeton, New Jersey. As the fourth oldest college in the United States, Princeton has a deep history that spans 276 years. The university offers 37 degree concentrations and over 50 interdepartmental certificate programs, with some of the most popular majors being social sciences, engineering, public administration, and social service professions. Princeton provides generous financial aid, covering 100% of tuition and housing for families earning up to $65,000 for the class of 2026. In 2022, Princeton reported an endowment valued at $35.8 billion to help fund university-wide research, service programs, and financial aid to help students graduate debt free. Campus traditions allow Ivy League sports rivalries to live on—students celebrate defeating Harvard and Yale in the same season by creating a bonfire by burning crates decorated with the rival team’s colors and logos. Keeping with tradition, students conclude the end of each course with a round of applause in appreciation to the professor once the final lecture is given, writes Amélie Lemay, who described her 2021 freshman semester at Princeton University. View ProfileAV. GRANT AID$59,792"
2.0,Stanford University,CA,Private not-for-profit,"$60,619","$12,999","$177,500",A+
3.0,Massachusetts Institute of Technology,MA,Private not-for-profit,"$45,591","$13,792","$189,400",A
4.0,Yale University,CT,Private not-for-profit,"$63,523","$4,926","$168,300",A+
5.0,"University of California, Berkeley",CA,Public,"$21,669","$7,238","$167,000",
6.0,Columbia University,NY,Private not-for-profit,"$61,061","$16,849","$156,000",A+
7.0,University of Pennsylvania,PA,Private not-for-profit,"$57,175","$12,499","$171,800",A+
8.0,Harvard University,MA,Private not-for-profit,"$61,801","$9,004","$171,400",A
9.0,Rice University,TX,Private not-for-profit,"$51,955","$10,818","$152,100",A


In [10]:
library(tidyverse)
library(rvest)
page <- read_html_live("https://www.forbes.com/top-colleges/")
Sys.sleep(1)
df <- page |> html_node("table") |> html_table()

## Part the Third

Data cleaning (eating your vegetables)

<figure>
<img src="assets/messy-dataframe.png"
     style="width:100%" 
    />
</figure>

In [12]:
df <- df |> filter(!is.na(Rank))

<figure>
<img src="assets/tidy-dataframe.png"
     style="width:100%" 
    />
</figure>

In [13]:
df |> write_csv('forbes-rankings.csv')

<figure>
<img src="assets/sheets.gif"
     style="width:100%" 
    />
</figure>

## Extra credit

Automatic page loops to retrieve rankings beyond the first 50

In [132]:
library(tidyverse)
library(rvest)

get_college_rankings <- function(page) {
    df <- page |> html_node("table") |> html_table()
    df <- df |> filter(!is.na(Rank))
    return(df)
}

page <- read_html_live("https://www.forbes.com/top-colleges/")
Sys.sleep(1)
df <- get_college_rankings(page)

while (is.na(page |> html_node('button[aria-label="Next"]') |> html_attr("disabled"))) {
    page$click('button[aria-label="Next"]')
    Sys.sleep(1)
    df <- rbind(df, get_college_rankings(page))
}

write_csv(df,"forbes-rankings.csv")

Complete how-to available at [github.com/declanrjb/live-scraping-intro](https://github.com/declanrjb/live-scraping-intro)

## Questions?