<img src="https://www.exegetic.biz/img/exegetic-banner-black.svg" width="35%" align="right">

# Web Scraping: Volcano Heights

Andrew B. Collier (@datawookie | andrew@exegetic.biz)<br>
Data Scientist / Founder<br>
[Exegetic Analytics](https://www.exegetic.biz)

<span style="color: #3498db;">**↯ Notebooks**</span> available from https://bit.ly/2kwWRvX.

## Introduction

![](../fig/volcano.png)

The [list of volcanoes by elevation](https://en.wikipedia.org/wiki/List_of_volcanoes_by_elevation) page on WikiPedia has volcano elevations.

**The Brief**: Our brief is to capture data for all members and store it in a relational database.

**The Challenge**: The data are divided into six tables, one for each elevation range.

**The Approach:** These are the steps that we'll take to achieve that goal:

1. Figure out how to scrape a single table.
2. Scrape all six tables.
3. Concatenate the data.

## Setup

Load some libraries.

In [76]:
suppressMessages(library(dplyr))
library(tidyr)
library(purrr)
suppressMessages(library(janitor))
library(stringr)
library(ggplot2)
suppressMessages(library(rvest))

The URL. Open [this link](https://en.wikipedia.org/wiki/List_of_volcanoes_by_elevation) in your browser.

In [77]:
URL = "https://en.wikipedia.org/wiki/List_of_volcanoes_by_elevation"

The first step will be to grab the HTML for the page.

In [78]:
html <- read_html(URL)

## Scrape First Table

Get the CSS selector for the first table.

In [79]:
(table <- html %>% html_node("table.sortable"))

{html_node}
<table class="sortable" border="0" cellspacing="3" cellpadding="1" style="border:1px solid #e7dcc3">
[1] <tbody>\n<tr>\n<th>Mountain\n</th>\n<th>Metres\n</th>\n<th>Feet\n</th>\n< ...

There's a handy helper function for extracting data from an HTML table.

In [None]:
(table <- table %>% html_table())

Clean up the column names and drop the imperial measure.

In [None]:
table <- table %>%
    clean_names() %>%
    select(-feet)

head(table)

Now convert the `metres` column to numeric type.

In [None]:
table <- table %>%
    mutate(
        metres = metres %>% str_replace(",", "") %>% as.numeric()
    )

head(table)

## Scrape All Tables

We're going to want to apply the pre-processing above to all tables, so write a function.

<span style="color: #3498db;">**↯ Exercise**</span> Write a function, `prepare_elevations()`, which will accept an HTML node and return a data frame after

- cleaning up column names;
- dropping the imperial units column; and
- converting the metric column to numeric type.

In [None]:
# ------------------------------------------------------------------------------
#
# Your code goes here.
#
# ------------------------------------------------------------------------------

In [None]:
prepare_elevations <- function(table) {
    table %>%
        html_table() %>%
        clean_names() %>%
        select(-feet) %>%
        mutate(
            metres = metres %>% str_replace(",", "") %>% as.numeric()
        )
}

Test the function on the first table.

In [None]:
prepare_elevations(html %>% html_node("table.sortable"))

Looks good! Now map it across all tables.

In [None]:
elevations <- map(
    html %>% html_nodes("table.sortable"),
    prepare_elevations
)

The result is a list with one element per table.

In [None]:
class(elevations)

In [None]:
length(elevations)

Now we just need to concatenate them.

In [None]:
elevations <- bind_rows(elevations)

How many volcanoes?

In [None]:
nrow(elevations)

In [None]:
tail(elevations)

Finish off by creating a histogram of volcano heights.

In [None]:
ggplot(elevations, aes(x = metres)) +
    geom_histogram(binwidth = 200, fill = "#3498db", colour = "black") +
    labs(
        title = "Volcano Elevations",
        caption = "Data from WikiPedia",
        xlab = "Elevation [metres]",
        ylab = "Count"
    ) +
    theme_classic()