In [2]:
library(tidyverse)
library(rvest)

── [1mAttaching packages[22m ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.0     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.1.0
[32m✔[39m [34mtidyr  [39m 1.2.1     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: ‘rvest’


The following object is masked from ‘package:readr’:

    guess_encoding




# Lecture 11: Web scraping

<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will:**
* Understand how to import data from online sources by scraping web pages.
</div>

These notes correspond to Chapter 26 of your book.


## Ethics of scraping data online
You should carefully read [Section 26.2](https://r4ds.hadley.nz/webscraping.html#scraping-ethics-and-legalities) of the book concerning various ethical and legal issues surrounding scraping web sites for data. In this class we will only look at large, public web sites like Wikipedia and IMDB, where there is no risk of anything bad happening. However, there are other situations where it may be unethical, or even illegal, to harvest data from a website, even if you are technically able. **As data scientists in the real world, it will be up to you to carefully weigh these concerns before using the tools discussed in today's lecture.**

## Reading data from the Internet
These days, it's increasingly common to pull data from online sources. For example, say I wanted to know the population of European countries. This is [easily found](https://en.wikipedia.org/wiki/Demographics_of_Europe#Population_by_country) on Wikipedia. How can I get these data into R and analyze them?

## How do web pages work?

Web pages are written in a special language called HTML (**H**yper**t**ext **M**arkup **L**anguage). Here is a simple example of some HTML:

    <html>
    <head> 
      <title>Page title</title>
    </head>
    <body>
      <h1 id='first'>A heading</h1>
      <p>Some text &amp; <b>some bold text.</b></p>
      <img src='myimg.png' width='100' height='100'>
    </body>

Web scraping is possible because most web pages have a consistent, hierarchical structure. For example, if I asked you how to navigate to the title of the web page shown above, you would follow the "path"

    html > head > title
    
to arrive at "Page title".

## HTML elements

There are a lot of HTML elements that might contain interesting information. Here are a few of the most common:
- Block tags that denote sections of text: `<h1>` (heading), `<p>` (paragraph), `<ul>`/`<ol>` (un)ordered list, etc.
- `<table>` (a table), `<tr>` (a table row), `<td>` (a table cell), etc.
- Each of these elements can contain attributes such as `id=` or `class=`. For example, `<table id="movies">` is probably a table that contains movie information.

The `rvest` package is used to load a web page and extract elements and tables based on their HTML tags. Let's see how it works by scraping the Wikipedia page mentioned earlier:

In [209]:
europop <- read_html("http://en.wikipedia.org/wiki/Demographics_of_Europe#Population_by_country")

In this page there are many tables:

In [38]:
europop %>% html_elements("table")

{xml_nodeset (17)}
 [1] <table class="wikitable">\n<caption>Population of Europe, in millions, b ...
 [2] <table class="wikitable sortable" style="text-align:right;">\n<caption>P ...
 [3] <table class="wikitable sortable" style="text-align:right;">\n<caption>( ...
 [4] <table class="wikitable sortable" style="text-align: right;">\n<caption> ...
 [5] <table class="wikitable sortable static-row-numbers plainrowheaders srn- ...
 [6] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
 [7] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
 [8] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
 [9] <table class="nowraplinks hlist mw-collapsible autocollapse navbox-inner ...
[10] <table class="nowraplinks navbox-subgroup" style="border-spacing:0"><tbo ...
[11] <table class="nowraplinks navbox-subgroup" style="border-spacing:0"><tbo ...
[12] <table class="nowraplinks navbox-subgroup" style="border-spacing:0"><tbo .

How can we find the correct one? One option is to use our browser to find something that uniquely identifies the table that we want. Alternatively, since there are only about 17, we can just at each table until we find the one we want:

In [46]:
# find the table that contains the population for each country

## 🤔 Quiz

What's the average population density ($\text{persons}/\text{km}^2$) for countries in Europe?

<ol style="list-style-type: upper-alpha;">
    <li>1234.5</li>
    <li>20000.0</li>
    <li>611.8</li>
    <li>6520.5</li>
    <li>101.1</li>
</ol>



In [76]:
# avg pop density

## 🤔 Quiz

Use the same page Wikipedia page (Demographics of Europe) to answer the following question:

On average, how many people were born *each day* in Europe between 2010 and 2021 (inclusive)?

<ol style="list-style-type: upper-alpha;">
    <li>90210.10</li>
    <li>23043.97</li>
    <li>7710127</li>
    <li>21123.64</li>
    <li>21109.18</li>
</ol>



In [69]:
# average births per day

In [171]:
# number of days in 2010--2021

## The Simpsons

The Simpsons is a popular and long-running TV show. How many people still watch the Simpsons? What is their most popular episode?

In [211]:
simpsons <- read_html('https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes_(season_21–present)')

In [255]:
# parse simpsons

## 🤔 Quiz

The episode with the largest number of viewers was **Once Upon a Time in Springfield**. Which episode of the Simpsons had the **smallest** number of viewers?


<ol style="list-style-type: upper-alpha;">
    <li>My Octopus and a Teacher</li>
    <li>Treehouse of Horror XXI</li>
    <li>Marge the Meanie</li>
    <li>The D'oh-cial Network</li>
    <li>The Devil Wears Nada</li>
</ol>



In [249]:
# smallest number of viewers

## IMDB top movies

Let's consider a well-known table: the [top 250 movies on IMDB](https://www.imdb.com/chart/top/).

In [16]:
imdb.250 <- read_html("https://www.imdb.com/chart/top/")

In [254]:
# parse imdb

## Exercise

"The Kid" came out in 1921 and has a rating of 8.2. Another movie that was rated at least as high didn't come out until 1927 (Metropolis), so we could say that The Kid reigned as the #1 film for six years. Metropolis reigned for four years until City Lights (rating 8.4) came out.

Which film reigned for the longest amount of time?

In [None]:
# longest reign

## Super Bowl TV ratings
We just had the Super Bowl. How have the TV ratings for the Super Bowl changed over the years?

In [126]:
sbtv <- read_html('https://en.wikipedia.org/wiki/Super_Bowl_television_ratings') %>% html_elements('table') %>% .[[1]] %>% html_table

In [138]:
# viewers over time

How does this compare with other major sports?

- https://en.wikipedia.org/wiki/World_Series_television_ratings
- https://en.wikipedia.org/wiki/NBA_Finals_television_ratings

In [None]:
# super bowl vs world series

## Scraping other types of web data

Here are some examples of other types of web data we can scrape:

### The UofM Stats department
Let's say I wanted to make a table of all the [undergraduate stats courses](https://lsa.umich.edu/stats/undergraduate-students/statistics-courses.html) offered by the department. 

In [148]:
stats <- read_html('https://lsa.umich.edu/stats/undergraduate-students/statistics-courses.html')

How should we extract the data from this web page? We notice from inspecting the page that each course title is a `<b>` (bold) element:

In [161]:
# extract statistics courses

### Reddit
Let's see how to scrape the [UofM Reddit site](https://old.reddit.com/r/uofm):

In [162]:
top.reddit <- read_html('https://old.reddit.com/r/uofm/top/?sort=top&t=all')

Let's plot the top scoring posts, when they were posted, and how many votes they have received.

In [121]:
# top posts on r/uofm