Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
1143 lines (739 sloc) 58.3 KB
title author date output
Part 2: Basic R Concepts
Dan Nguyen
August 17, 2015
md_document html_document output_dir
variant preserve_yaml
markdown_github
true
self_contained toc
false
true
builds

Chapter 2: Basic R and Data Visualization Concepts

This chapter is focused on R syntax and concepts for working with data frames and generating visualizations via ggplot2. For demonstration purposes, I use a subset of the U.S. Geological Survey's earthquake data (records for August 2015). Chapters 3 and 4 will repeat the same concepts except with earthquake data spanning back to 1995.

What is ggplot

Be explicit and repetitive

Loading the libraries and themes

library(ggplot2)
library(scales)
library(grid)
library(dplyr)
library(lubridate)
library(rgdal)
library(hexbin)
library(ggmap)
  • ggplot2 is the visualization framework.
  • dplyr is a library for manipulating, transforming, and aggregating data frames. It also utilizes the nice piping from the magrittr package, which warms the Unix programmer inside me.
  • lubridate - Greatly eases working with time values when generating time series and time-based filters. Think of it as moment.js for R.
  • scales - Provides helpers for making scales for our ggplot2 charts.
  • grid - contains some functionality that ggplot2 uses for chart styling and arrangement. including the unit() function.
  • hexbin - Used with ggplot2 to make hexbin maps.
  • rgdal - bindings for geospatial operations, including map projection and the reading of map shapefiles. Can be a bear to install due to a variety of dependencies. If you're on OS X, I recommend installing Homebrew and running brew install gdal before installing the rgdal package via R.
  • ggmap - Another ggplot2 plugin that allows the plotting of geospatial data on top of map imagery from Google Maps, OpenStreetMap, and Stamen Maps.

Customizing the themes

This is entirely optional, though it will likely add some annoying roadblocks for everyone trying to follow along. There's not much to theming in ggplot2: just remember a shit-ton of syntax and have a lot of patience. You can view my ggplot theme script and copy it from here, and then make it locally available to your scripts (e.g. ./myslimthemes.R), so that you can source them as I do in each of my scripts:

source("./myslimthemes.R")
theme_set(theme_dan())

Or, just don't use them when copying the rest of my code. All of the theme calls begin with theme_dan or dan. One dependency that might cause a few problems is the use of the extrafont package, which I use to set the default font to Gill Sans. I think I set it up so that systems without Gill Sans or extrafont will just throw a warning message. But I didn't spend too much time testing that.

Downloading the data

For this chapter, we use two data files:

If you're following along many years from now and the above links no longer work, the Github repo for this walkthrough contains copies of the raw files for you to practice on. You can either just clone this repo to get the files. Or download them from Github's raw file storage – these URLs are subject to the whims of Github's framework and may change down the road:

The easiest thing is to just clone the repo or download and unpack the zip file of this repo.

First we create a data directory to store our files:

dir.create('./data')

Download earthquake data into a data frame

Because the data files can get big, I've included an if statement so that if a file exists at ./data/all_month.csv, the download command won't attempt to re-download the file.

url <- "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv"
fname <- "./data/all_month.csv"
if (!file.exists(fname)){
  print(paste("Downloading: ", url))
  download.file(url, fname)
}

Alternatively, to download a specific interval of data, such as just the earthquakes for August 2015, use the USGS Archives endpoint:

base_url <- 'http://earthquake.usgs.gov/fdsnws/event/1/query.csv?'
starttimeparam <- 'starttime=2015-08-01 00:00:00'
endtime_param <- 'endtime=2015-08-31 23:59:59'
orderby_param <- 'orderby=time-asc'
url <- paste(base_url, starttimeparam, endtime_param, orderby_param, sep = "&")
print(paste("Downloading: ", url))
fname = 'data/2015-08.csv'
download.file(url, fname)

The standard read.csv() function can be used to convert the CSV file into a data frame, which I store into a variable named usgs_data:

usgs_data <- read.csv(fname, stringsAsFactors = FALSE)

Convert the time column to a Date-type object

The time column of usgs_data contains timestamps of the events:

head(usgs_data$time)

However, these values are strings; to work with them as measurements of time, e.g. in creating a time series chart, we convert them to time objects (more specifically, objects of class POSIXct). The lubridate package provides the useful ymd_hms() function for fairly robust parsing of strings.

The standard way to transform a data frame column looks like this:

usgs_data$time <- ymd_hms(usgs_data$time)

However, I'll more frequently use the mutate() function from the dplyr package:

usgs_data <- mutate(usgs_data, time =  ymd_hms(time))

And because dplyr makes use of the piping convention from the magrittr package, I'll often write the above snippet like this:

usgs_data <- usgs_data %>% mutate(time =  ymd_hms(time))

Download and read the map data

Let's download the U.S. state boundary data. This also comes in a zip file that we download and unpack:

url <- "http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_state_20m.zip"
fname <- "./data/cb_2014_us_state_20m.zip"
if (!file.exists(fname)){
  print(paste("Downloading: ", url))
  download.file(url, fname)
}
unzip(fname, exdir = "./data/shp")

The unzip command unpacks a list of files into the data/shp/ subdirectory. We only care about the one named cb_2014_us_state_20m.shp. To read that file into a SpatialPolygonsDataFrame-type object, we use the rgdal library's readOGR() command and assign the result to the variable us_map:

us_map <- readOGR("./data/shp/cb_2014_us_state_20m.shp", "cb_2014_us_state_20m")

At this point, we've created two variables, usgs_data and us_map, which point to data frames for the earthquakes and the U.S. state boundaries, respectively. We'll be building upon these data frames, or more specifically, creating subsets of their data for our analyses and visualizations.

Let's count how many earthquake records there are in a month's worth of USGS earthquake data:

nrow(usgs_data)

That's a nice big number with almost no meaning. We want to know how big these earthquakes were, where they hit, and how many of the big ones hit. To go beyond this summary number, we turn to data visualization.

A quick introduction to simple charts in ggplot2

The earthquake data contains several dimensions of interest: location (e.g. latitude and longitude), magnitude, and time. A scatterplot is frequently used to show the relationship between two or more of a dataset's variables. In fact, we can think of geographic maps as a sort of scatterplot showing longitude and latitude along the x- and y-axis, though maps don't imply that there's a "relationship", per se, between the two coordinates.

We'll cover mapping later in this chapter, so let's start off by plotting the earthquake data with time as the indepdendent variable, i.e. along the x-axis, and mag as the dependent variable, i.e. along the y-axis.

ggplot(usgs_data, aes(x = time, y = mag)) +
  geom_point() +
  scale_x_datetime() +
  ggtitle("Worldwide seismic events for August 2015")

First of all, from a conceptual standpoint, one might argue that a scatterplot doesn't much sense here, because how big an earthquake is seems to be totally unrelated to when it happens. This is not quite true, as weaker earthquakes can presage or closely follow – i.e. aftershocks – a massive earthquake.

But that insight is lost in our current attempt because we're attempting to plot the entire world's earthquakes on a single plot. This ends up being impractical. There are so many earthquake events that we run into the problem of overplotting: Dots representing earthquakes of similar magnitude and time will overlap and obscure each other.

One way to fix this is to change the shape of the dots and increase their transparency, i.e. decrease their alpha:

ggplot(usgs_data, aes(x = time, y = mag)) +
  geom_point(alpha = 0.2) +
  scale_x_datetime() +
  ggtitle("Worldwide seismic events for August 2015")

That's a little better, but there are just too many points for us to visually process. For example, we can vaguely tell that there seem to be more earthquakes of magnitudes between 0 and 2. But we can't accurately quantify the difference.

{:#make-histogram}

Fun with histograms

Having too much data to sensibly plot is going to be a major and recurring theme for the entirety of this walkthrough. In chapter 3, our general strategy will be to divide and filter the earthquake records by location, so that not all of the points are plotted in a single chart.

However, sometimes it's best to just change the story you want to tell. With a scatterplot, we attempted (and failed) to show intelligible patterns in a month's worth of worldwide earthquake activity. Let's go simpler. Let's attempt to just show what kinds of earthquakes, and how many, have occurred around the world in a month's timespan.

A bar chart (histogram) is the most straightforward way to show how many earthquakes fall within a magnitude range.

Below, I simply organize the earthquakes into "buckets" by rounding them to the nearest integer:

ggplot(usgs_data, aes(x = round(mag))) +
  geom_histogram(binwidth = 1)

The same chart, but with some polishing of its style. The color parameter for geom_histogram() allows us to create an outline to visually separate the bars:

ggplot(usgs_data, aes(x = round(mag))) +
  geom_histogram(binwidth = 1, color = "white") +
  # this removes the margin between the plot and the chart's bottom, and
  # formats the y-axis labels into something more human readable:
  scale_y_continuous(expand = c(0, 0), labels = comma) +
  ggtitle("Worldwide earthquakes by magnitude, rounded to nearest integer\nAugust 2015")

The tradeoff for the histogram's clarity is a loss in granularity. In some situations, it's important to see the characteristics of the data at a more granular level.

{:#aug_2015_histogram_quakes_bucketed_tenths-mark}

The geom_histogram()'s binwidth parameter can do the work of rounding and grouping the values. The following snippet groups earthquakes the nearest tenth of a magnitude number. The size parameter in geom_histogram() controls the width of each bar's outline:

ggplot(usgs_data, aes(x = mag)) +
  geom_histogram(binwidth = 0.1, color = "white", size = 0.2) +
  scale_x_continuous(breaks = pretty_breaks()) +
  scale_y_continuous(expand = c(0, 0)) +
  ggtitle("Worldwide earthquakes grouped by tenths of magnitude number\nAugust 2015")

Time series

Another tradeoff between histogram and scatterplot is that the latter displayed two facets of the data: magnitude and time. With a histogram, we can only show one or the other.

A histogram in which the count of records is grouped according to their time element is more commonly known as a time series. Here's how to chart the August 2015 earthquakes by day, which we derive from the time column using lubridate's day() function:

ggplot(usgs_data, aes(x = day(time))) +
  geom_histogram(binwidth = 1, color = "white") +
  scale_x_continuous(breaks = pretty_breaks()) +
  scale_y_continuous(expand = c(0, 0)) +
  ggtitle("Worldwide earthquakes grouped by tenths of magnitude number\nAugust 2015")

Stacked charts

We do have several options for showing two or more characteristics of a dataset as a bar chart. By setting the fill aesthetic of the data -- in this case, to the mag column -- we can see the breakdown of earthquakes by magnitude for each day:

ggplot(usgs_data, aes(x = day(time), fill = factor(round(mag)))) +
  geom_histogram(binwidth = 1, color = "white") +
  scale_x_continuous(breaks = pretty_breaks()) +
  scale_y_continuous(expand = c(0, 0)) +
  ggtitle("Worldwide earthquakes grouped by day\nAugust 2015")

Let's be a little nitpicky and rearrange the legend so that order of the labels follows the same order of how the bars are stacked:

ggplot(usgs_data, aes(x = day(time), fill = factor(round(mag)))) +
  geom_histogram(binwidth = 1, color = "white") +
  scale_x_continuous(breaks = pretty_breaks()) +
  scale_y_continuous(expand = c(0, 0)) +
  guides(fill = guide_legend(reverse = TRUE))
  ggtitle("Worldwide earthquakes grouped by day and magnitude\nAugust 2015")
Limit the color choices

Before we move on from the humble histogram/time series, let's address how unnecessarily busy the previous chart is. There are 9 gradations of color. Most of them, particularly at the extremes of the scale, are barely even visible because of how few earthquakes are in that magnitude range.

Is it worthwhile to devote so much space on the legend to M6.0+ earthquakes when in reality, so few earthquakes are in that range? Sure, you might say: those are by far the deadliest of the earthquakes. While that is true, the purpose of this chart is not to notify readers how many times their lives may have been rocked by a huge earthquake in the past 30 days -- a histogram a very poor way to convey such information, period. Furthermore, the reality of the chart's visual constraints make it next to impossible to even see those earthquakes on the chart.

So let's simplify the color range by making an editorial decision: earthquakes below M3.0 aren't terribly interesting. At M3.0+, they have the possibility of being notable depending on where they happen. Since the histogram already completely fails at showing location of the earthquakes, we don't need to feel too guilty that our simplified, two-color chart obscures the quantity of danger from August 2015's earthquakes:

ggplot(usgs_data, aes(x = day(time), fill = mag > 3)) +
  geom_histogram(binwidth = 1, color = "black", size = 0.2) +
  scale_x_continuous(breaks = pretty_breaks()) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_fill_manual(values = c("#fee5d9", "#de2d26"), labels = c("M3.0 >", "M3.0+")) +
  guides(fill = guide_legend(reverse = TRUE)) +
  ggtitle("Worldwide earthquakes grouped by day and magnitude\nAugust 2015")

You probably knew that the majority of earthquakes were below M3.0. But did you underestimate how much they were in the majority? The two-tone chart makes the disparity much clearer.

One more nitpicky design detail: the number of M3.0+ earthquakes, e.g. the red-colored bars, are more interesting than the number of weak earthquakes. But the jagged shape of the histogram makes it harder to count the M3.0+ earthquakes. So let's reverse the order of the stacking (and the legend). The easieset way to do that is to change the "phrasing" of the conditional statement for the fill aesthetic, from mag > 3 to mag < 3:

ggplot(usgs_data, aes(x = day(time), fill = mag < 3)) +
  geom_histogram(binwidth = 1, color = "black", size = 0.2) +
  scale_x_continuous(breaks = pretty_breaks()) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_fill_manual(values = c("#de2d26", "#fee5d9"), labels = c("M3.0+", "M3.0 >")) +
  guides(fill = guide_legend(reverse = TRUE)) +
  ggtitle("Worldwide earthquakes grouped by day and magnitude\nAugust 2015")

It's the story, stupid

Though it's easy to get caught up in the technical details of how to build and design a chart, remember the preeminent role of editorial insight and judgment. The details you choose to focus on and the story you choose to tell have the greatest impact on the success of your data visualizations.

Filtering the USGS data

TKTK

The USGS data feed contains more than just earthquakes, though. So we use dplyr's group_by() function on the type column and then summarise() the record counts:

usgs_data %>% group_by(type) %>% summarise(count = n())

For this particular journalstic endeavour, we don't care about explosions and quarry blasts. We also only care about events of a reasonable magnitude – remember that magnitudes under 3.0 are often not even noticed by the average person. Likewise, most of the stories about Oklahoma's earthquakes focus on earthquakes of magnitude of at least 3.0, so let's filter usgs_data appropriately and store the result in a variable named quakes:

There are several ways to create a subset of a data frame:

  • Using bracket notation:

      quakes <- usgs_data[usgs_data$mag >= 3.0 & usgs_data$type == 'earthquake',]
    
  • Using subset():

      quakes <- subset(usgs_data, usgs_data$mag >= 3.0 & usgs_data$type == 'earthquake')
    
  • And my preferred convention: using dplyr's filter():

quakes <- usgs_data %>% filter(mag >= 3.0, type == 'earthquake')

The quakes dataframe is now about a tenth the size of the data we original downloaded, which will work just fine for our purposes:

nrow(quakes)

Plotting the earthquake data without a map

A geographical map can be thought of a plain ol' scatterplot. In this case, each dot is plotted using the longitude and latitude values, which serve as the x and y coordinates, respectively:

ggplot(quakes, aes(x = longitude, y = latitude)) + geom_point() +
  ggtitle("M3.0+ earthquakes worldwide, August 2015")

Even without the world map boundaries, we can see in the locations of the earthquakes a rough outline of the world's fault lines. Compare the earthquakes plot to this Digital Tectonic Activity Map from NASA:

Digital Tectonic Activity Map of the Earth, via NASA

Even with just ~1,000 points – and, in the next chapter, 20,000+ points, we again run into a problem of overplotting, so I'll increase the size and transparency of each point and change the point shape. And just to add some variety, I'll change the color of the points from black to firebrick:

We can apply these styles in the geom_point() call:

ggplot(quakes, aes(x = longitude, y = latitude)) +
  geom_point(size = 3,
             alpha = 0.2,
             shape = 1,
             color = 'firebrick') +
  ggtitle("M3.0+ earthquakes worldwide, August 2015")

Varying the size by magnitude

Obviously, some earthquakes are more momentous than others. An easy way to show this would be to vary the size of the point by mag:

ggplot(quakes, aes(x = longitude, y = latitude)) +
  geom_point(aes(size = mag),
             shape = 1,
             color = 'firebrick')

However, this understates the difference between earthquakes, as their magnitudes are measured on a logarithmic scale; a M5.0 earthquake has 100 times the amplitude of a M3.0 earthquake. Scaling the circles accurately and fitting them on the map would be...a little awkward (and I also don't know enough about ggplot to map the legend's labels to the proper non-transformed values). In any case, for the purposes of this investigation, we mostly care about the frequency of earthquakes, rather than their actual magnitudes, so I'll leave out the size aesthetic in my examples.

Let's move on to plotting the boundaries of the United States.

Plotting the map boundaries

The data contained in the us_map variable is actually a kind of data frame, a SpatialPolygonsDataFrame, which is provided to us as part of the sp package, which was included via the rgdal package.

Since us_map is a data frame, it's pretty easy to plop it right into ggplot():

ggplot() +
  geom_polygon(data = us_map, aes(x = long, y = lat, group = group)) +
   theme_dan_map()

By default, things look a little distorted because the longitude and latitude values are treated as values on a flat, 2-dimensional plane. As we've learned, the world is not flat. So to have the geographical points -- in this case, the polygons that make up state boundaries -- look more like we're accustomed to on a globe, we have to project the longitude/latitude values to a different coordinate system.

This is a fundamental cartographic concept that is beyond my ability to concisely and intelligibly explain here, so I direct you to Michael Corey, of the Center for Investigative Reporting, and his explainer, "Choosing the Right Map Projection". And Mike Bostock has a series of excellent interactive examples showing some of the complexities of map projection; I embed one of his D3 examples below:

<iframe src="http://bl.ocks.org/dannguyen/raw/36e0f357433dda000dc0/918717fa8db0d0f8d8460026b4a41815b82362de/" width="700" height="400" marginwidth="0" marginheight="0" scrolling="no"></iframe>

Once you understand map projections, or at least are aware of their existence, applying them to ggplot() is straightforward. In the snippet below, I apply the Albers projection, which is the standard projection for the U.S. Census (and Geological Survey) using the coord_map() function. Projecting in Albers requires a couple of parameters that I'm just going to copy and modify from this r-bloggers example, though I assume it has something to do with specifying the parallels needed for accurate proportions:

ggplot() +
  geom_polygon(data = us_map, aes(x = long, y = lat, group = group)) +
  coord_map("albers", lat0 = 38, latl = 42) +
  theme_classic()

Filtering out Alaska and Hawaii

The U.S. Census boundaries only contains data for the United States. So why does our map span the entire globe? Because Alaska has the annoying property of being both the most western and eastern point of the United States, such that it wraps around to the other side of our coordinate system, i.e. from longitude -179 to 179.

There's obviously a more graceful, mathematically-proper way of translating the coordinates so that everything fits nicely on our chart. But for now, to keep things simple, let's just remove Alaska -- and Hawaii, and all the non-states -- as that gives us an opportunity to practice filtering SpatialPolygonsDataFrames.

First, we inspect the column names of us_map's data to see which one corresponds to the name of each polygon, e.g. Iowa or CA:

colnames(us_map@data)

Both NAME and STUSPS (which I'm guessing stands for U.S. Postal Service) will work:

head(select(us_map@data, NAME, STUSPS))

To filter out the data that corresponds to Alaska, i.e. STUSPS == "AK":

byebye_alaska <- us_map[us_map$STUSPS != 'AK',]

To filter out Alaska, Hawaii, and the non-states, e.g. Guam and Washington D.C., and then assign the result to the variable usc_map:

x_states <- c('AK', 'HI', 'AS', 'DC', 'GU', 'MP', 'PR', 'VI')
usc_map <- subset(us_map, !(us_map$STUSPS %in% x_states))

Now let's map the contiguous United States, and, while we're here, let's change the style of the map to be in dark outline with white fill:

ggplot() +
  geom_polygon(data = usc_map, aes(x = long, y = lat, group = group),
                 fill = "white", color = "#444444", size = 0.1) +
  coord_map("albers", lat0 = 38, latl = 42)

Plotting the quakes on the map

Plotting the earthquake data on top of the United States map is as easy as adding two layers together; notice how I plot the map boundaries before the points, or else the map (or rather, its white fill) will cover up the earthquake points:

ggplot(quakes, aes(x = longitude, y = latitude)) +
  geom_polygon(data = usc_map, aes(x = long, y = lat, group = group),
                   fill = "white", color = "#444444", size = 0.1) +
  geom_point(size = 3,
               alpha = 0.2,
               shape = 4,
               color = 'firebrick') +
  coord_map("albers", lat0 = 38, latl = 42) +
  theme_dan_map()

Well, that turned out poorly. The viewpoint reverts to showing the entire globe. The problem is easy to understand: the plot has to account for both the United States boundary data and the worldwide locations of the earthquakes.

In the next section, we'll tackle this problem by converting the earthquakes data into its own spatial-aware data frame. Then we'll cross-reference it with the data in usc_map to remove earthquakes that don't originate from within the boundaries of the contiguous United States.

Working with and filtering spatial data points

To reiterate, usc_map is a SpatialPolygonsDataFrame, and quakes is a plain data frame. We want to use the geodata in usc_map to remove all earthquake records that don't take place within the boundaries of the U.S. contiguous states.

The first question to ask is: why don't we just filter quakes by one of its columns, like we did for mag and type? The problem is that while the USGS data has a place column, it is not U.S.-centric, i.e. there's not an easy way to say, "Just show me records that take place within the United States", because place doesn't always mention the country:

head(quakes$place)

So instead, we use the latitude/longitude coordinates stored in usc_map to filter out earthquakes by their latitude/longitude values. The math to do this from scratch is quite...labor intensive. Luckily, the sp library can do this work for us, we just have to first convert the quakes data frame into one of sp's special data frames: a SpatialPointsDataFrame.

sp_quakes <- SpatialPointsDataFrame(data = quakes,
                                    coords = quakes[,c("longitude", "latitude")])

Then we assign it the same projection as us_map (note that usc_map also has this same projection). First let's inspect the actual projection of us_map:

us_map@proj4string

Now assign that value to sp_quakes (which, by default, has a proj4string attribute of NA):

sp_quakes@proj4string <- us_map@proj4string

Let's see what the map plot looks like. Note that in the snippet below, I don't use sp_quakes as the data set, but as.data.frame(sp_quakes). This conversion is necessary as ggplot2 doesn't know how to deal with the SpatialPointsDataFrame (and yet it does fine with SpatialPolygonsDataFrames...whatever...):

ggplot(as.data.frame(sp_quakes), aes(x = longitude, y = latitude)) +
  geom_polygon(data = usc_map, aes(x = long, y = lat, group = group),
                   fill = "white", color = "#444444", size = 0.1) +
  geom_point(size = 3,
               alpha = 0.2,
               shape = 4,
               color = 'firebrick') +
  coord_map("albers", lat0 = 38, latl = 42) +
  theme_dan_map()

No real change...we've only gone through the process of making a spatial points data frame. Creating that spatial data frame, then converting it back to a data frame to use in ggplot() has basically no effect -- though it would if the geospatial data in usc_map had a projection that significantly transformed its lat/long coordinates.

How to subset a spatial points data frame

To see the change that we want – just earthquakes in the contiguous United States – we subset the spatial points data frame, i.e. sp_quakes, using usc_map. This is actually quite easy, and uses similar notation as when subsetting a plain data frame. I actually don't know enough about basic R notation and S4 objects to know or explain why this works, but it does:

sp_usc_quakes <- sp_quakes[usc_map,]
usc_quakes <- as.data.frame(sp_usc_quakes)

ggplot(usc_quakes, aes(x = longitude, y = latitude)) +
  geom_polygon(data = usc_map, aes(x = long, y = lat, group = group),
                   fill = "white", color = "#444444", size = 0.1) +
  geom_point(size = 3,
               alpha = 0.5,
               shape = 4,
               color = 'firebrick')  +
  coord_map("albers", lat0 = 38, latl = 42) +
  ggtitle("M3.0+ earthquakes in the contiguous U.S. during August 2015") +
  theme_dan_map()

Subsetting points by state

What if we want to just show earthquakes in California? We first subset usc_map:

ca_map <- usc_map[usc_map$STUSPS == 'CA',]

Then we use ca_map to filter sp_quakes:

ca_quakes <- as.data.frame(sp_quakes[ca_map,])

Mapping California and its quakes:

ggplot(ca_quakes, aes(x = longitude, y = latitude)) +
  geom_polygon(data = ca_map, aes(x = long, y = lat, group = group),
                   fill = "white", color = "#444444", size = 0.3) +
  geom_point(size = 3,
               alpha = 0.8,
               shape = 4,
               color = 'firebrick')  +
  coord_map("albers", lat0 = 38, latl = 42) +
  theme_dan_map() +
  ggtitle("M3.0+ earthquakes in California during August 2015")

Joining shape file attributes to a data frame

The process of subsetting usc_map for each state, then subsetting the sp_quakes data frame, is a little cumbersome. Another approach is to add a new column to the earthquakes data frame that specifies which state the earthquake was in.

As I mentioned previously, the USGS data has a place column, but it doesn't follow a structured taxonomy of geographical labels and serves primarily as a human-friendly label, e.g. "South of the Fiji Islands" and "Northern Mid-Atlantic Ridge".

So let's add the STUSPS column to sp_quakes. First, since we're done mapping just the contiguous United States, let's create a map that includes all the 50 U.S. states and store it in the variable states_map:

x_states <- c('AS', 'DC', 'GU', 'MP', 'PR', 'VI')
states_map <- subset(us_map, !(us_map$STUSPS %in% x_states))

The sp package's over() function can be used to join the rows of sp_quakes to the STUSPS column of states_map. In other words, the resulting data frame of earthquakes will have a STUSPS column, and the quakes in which the geospatial coordinates overlap with polygons in states_map will have a value in it, e.g. "OK", "CA":

xdf <- over(sp_quakes, states_map[, 'STUSPS'])

Most of the STUSPS values in xdf will be <NA> because, most of the earthquakes do not take place in the United States. Though we see of all the United States, Oklahoma (i.e. OK) has experienced the most M3.0+ earthquakes by far in August 2015, twice as many as Alaska:

xdf %>% group_by(STUSPS) %>%
        summarize(count = n()) %>%
        arrange(desc(count))

To get a data frame of U.S.-only quakes, we combine xdf with sp_quakes. We then filter the resulting data frame by removing all rows in which STUSPS is <NA>:

ydf <- cbind(sp_quakes, xdf) %>% filter(!is.na(STUSPS))
states_quakes <- as.data.frame(ydf)

In later examples, I'll plot just the contiguous United States, so I'm going to remake the usc_quakes data frame in the same fashion as states_quakes:

usc_map <- subset(states_map, !states_map$STUSPS %in% c('AK', 'HI'))
xdf <- over(sp_quakes, usc_map[, 'STUSPS'])
usc_quakes <- cbind(sp_quakes, xdf) %>% filter(!is.na(STUSPS))
usc_quakes <- as.data.frame(usc_quakes)

So how is this different than before, when we derived usc_quakes?

old_usc_quakes <- as.data.frame(sp_quakes[usc_map,])

Again, the difference in our latest approach is that the resulting data frame, in this case the new usc_quakes and states_quakes, has a STUSPS column:

head(select(states_quakes, place, STUSPS))

The STUSPS column makes it possible to do aggregates of the earthquakes dataframe by STUSPS. Note that states with 0 earthquakes for August 2015 are omitted:

ggplot(states_quakes, aes(STUSPS)) + geom_histogram(binwidth = 1) +
  scale_y_continuous(expand = c(0, 0)) +
  ggtitle("M3.0+ earthquakes in the United States, August 2015")

Layering and arranging ggplot visuals

In successive examples, I'll be using a few more ggplot tricks and features to add a little more narrative and clarity to the basic data visualizations.

Highlighting and annotating data

To highlight the Oklahoma data in orange, I simply add another layer via geom_histogram(), except with data filtered for just Oklahoma:

ok_quakes <- states_quakes[states_quakes$STUSPS == 'OK',]
ggplot(states_quakes, aes(STUSPS)) +
  geom_histogram(binwidth = 1) +
  geom_histogram(data = ok_quakes, binwidth = 1, fill = '#FF6600') +
  ggtitle("M3.0+ earthquakes in the United States, August 2015")

We can add labels to the bars with a stat_bin() call:

ggplot(states_quakes, aes(STUSPS)) +
  geom_histogram(binwidth = 1) +
  geom_histogram(data = ok_quakes, binwidth = 1, fill = '#FF6600') +
  stat_bin(binwidth=1, geom="text", aes(label = ..count..), vjust = -0.5) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 60)) +
  ggtitle("M3.0+ earthquakes in the United States, August 2015")

Highlighting Oklahoma in orange on the state map also involves just making another layer call:

ok_map <- usc_map[usc_map$STUSPS == 'OK',]
ggplot(usc_quakes, aes(x = longitude, y = latitude)) +
  geom_polygon(data = usc_map, aes(x = long, y = lat, group = group),
                   fill = "white", color = "#444444", size = 0.1) +
  geom_polygon(data = ok_map, aes(x = long, y = lat, group = group),
                   fill = "#FAF2EA", color = "#FF6600", size = 0.4) +
  geom_point(size = 2,
               alpha = 0.5,
               shape = 4,
               color = 'firebrick')  +
  coord_map("albers", lat0 = 38, latl = 42) +
  ggtitle("M3.0+ earthquakes in the United States, August 2015") +
  theme_dan_map()

Time series

These are not too different from the previous histogram examples, except that the x-aesthetic is set to some function of the earthquake data frame's time column. We use the lubridate package, specifically floor_date() to bin the records to the nearest hour. Then we use scale_x_date() to scale the x-axis accordingly; the date_breaks() and date_format() functions come from the scales package.

For example, using the states_quakes data frame, here's a time series showing earthquakes by day of the month-long earthquake activity in the United States:

ggplot(states_quakes, aes(x = floor_date(as.Date(time), 'day'))) +
  geom_histogram(binwidth = 1) +
  scale_x_date(breaks = date_breaks("week"), labels = date_format("%m/%d")) +
  scale_y_continuous(expand = c(0, 0), breaks = pretty_breaks()) +
  ggtitle("Daily counts of M3.0+ earthquakes in the United States, August 2015")

To create a stacked time series, in which Oklahoma's portion of earthquakes is in orange, it's just a matter of adding another layer as before:

ggplot(states_quakes, aes(x = floor_date(as.Date(time), 'day'))) +
  geom_histogram(binwidth = 1, fill = "gray") +
  geom_histogram(data = filter(states_quakes, STUSPS == 'OK'), fill = "#FF6600", binwidth = 1) +
  scale_y_continuous(expand = c(0, 0), breaks = pretty_breaks()) +
  scale_x_date(breaks = date_breaks("week"), labels = date_format("%m/%d")) +
  ggtitle("Daily counts of M3.0+ earthquakes, August 2015\nOklahoma vs. all other U.S.")

Or, alternatively, we can set the fill aesthetic to the STUSPS column; I use the guides() and scale_fill_manual() functions to order the colors and labels as I want them to be:

ggplot(states_quakes, aes(x = floor_date(as.Date(time), 'day'),
                       fill = STUSPS != 'OK')) +
  geom_histogram(binwidth = 1) +
  scale_x_date(breaks = date_breaks("week"), labels = date_format("%m/%d")) +
  scale_fill_manual(values = c("#FF6600", "gray"), labels = c("OK", "Not OK")) +
  guides(fill = guide_legend(reverse = TRUE)) +
  ggtitle("Daily counts of M3.0+ earthquakes, August 2015\nOklahoma vs. all other U.S.")

Line and area charts

I've included these here because I keep forgetting how to make them. However, neither of them are particularly useful for our data.

For example, here's a line chart of the

Small multiples

ggplot2's facet_wrap() function provides a convenient way to generate a grid of visualizations, one for each value of a variable, i.e. Tufte's "small multiples", or, lattice/trellis charts:

  ggplot(data = mutate(usc_quakes, week = floor_date(time, "week")),
         aes(x = longitude, y = latitude)) +
  geom_polygon(data = usc_map, aes(x = long, y = lat, group = group),
                   fill = "white", color = "#444444", size = 0.1) +
  geom_point(size = 1,
               alpha = 0.5,
               shape = 1,
               color = 'firebrick')  +
  coord_map("albers", lat0 = 38, latl = 42) +
  facet_wrap(~ week, ncol = 2) +
  ggtitle("M3.0+ Earthquakes in the contiguous U.S. by week, August 2015") +
  theme_dan_map()

Put your mappable dots into bins

We've seen how a histogram brings clarity to dense data by "binning" values, such as grouping earthquakes by nearest decimal of magnitude, or by day for a time series.

The same concept can be applied to scatterplots and maps when overplotting is a problem:

ggplot() +
  geom_polygon(data = ok_map,
               aes(x = long, y = lat, group = group),
               fill = "white", color = "#666666", size = 0.5) +
  geom_point(data = ok_quakes,
              aes(x = longitude, y = latitude), color = "red", alpha = 0.4, shape = 1, size = 2.5) +
  coord_map("albers", lat0 = 38, latl = 42) +
  ggtitle("M3.0+ earthquakes in Oklahoma, August 2015") +
  theme_dan_map()

A few of the earthquakes are so close together that they appear to be in the exact same location at this zoom level. But how many exactly? There could be 2, 3, or 10 earthquakes represented by what appear to be a single marker.

Binning a map (or scatterplot) replaces the dots with shapes, such as rectangle or hexagon, and uses the shape's fill to indicate the number of points within each shape's "bin". Here's a rectangular bin of the data:

ggplot(data = ok_quakes, aes(x = longitude, y = latitude)) +
  geom_polygon(data = ok_map,
               aes(x = long, y = lat, group = group),
               fill = "white", color = "#666666", size = 0.5) +
  stat_bin2d(color = "gray") +
  coord_map("albers", lat0 = 38, latl = 42) +
  ggtitle("M3.0+ earthquakes in Oklahoma, August 2015 (rectangular-binned)") +
  theme_dan_map()

We control the granularity of the chart with the the bins argument for stat_bin2d(); setting it to 10 will make the bins bigger than in the previous chart. In the snippet below, I also set the alpha aesthetic to vary by ..count..:

ggplot(data = ok_quakes, aes(x = longitude, y = latitude)) +
  geom_polygon(data = ok_map,
               aes(x = long, y = lat, group = group),
               fill = "white", color = "#666666", size = 0.5) +
  stat_bin2d(color = "gray", bins = 10, fill = "blue", aes(alpha = ..count..)) +
  coord_map("albers", lat0 = 38, latl = 42) +
  ggtitle("M3.0+ earthquakes in Oklahoma, August 2015 (rectangular-binned)") +
  theme_dan_map()

You can also set fill to null and have the bins vary by color, which is helpful for seeing the underlying layers. In the example below, I've also included the earthquakes as points so that you can see how many points are in each bin:

(This is not something I would do in practice because it's redundant and it increases the visual clutter.)

ggplot(data = ok_quakes, aes(x = longitude, y = latitude)) +
  geom_polygon(data = ok_map,
               aes(x = long, y = lat, group = group),
               fill = "white", color = "#666666", size = 0.5) +
  stat_bin2d(aes(color = ..count..), bins = 10, fill = "NA") +
  geom_point(color = "red", alpha = 0.5, shape = 1) +
  coord_map("albers", lat0 = 38, latl = 42) +
  ggtitle("M3.0+ earthquakes in Oklahoma, August 2015 (rectangular-binned)") +
  theme_dan_map()

Contour and density plots

If you want to get fancy, you can use stat_density2d() to create a 2D density plot with contour lines, sometimes referred to as a continuous 2D-histogram:

ggplot(data = ok_quakes, aes(x = longitude, y = latitude)) +
  geom_polygon(data = ok_map,
               aes(x = long, y = lat, group = group),
               fill = "white", color = "#666666", size = 0.5) +
  geom_point(data = ok_quakes, aes(x = longitude, y = latitude),
            color = "red") +
  stat_density2d(bins = 10) +
  coord_map("albers", lat0 = 38, latl = 42) +
  ggtitle("M3.0+ earthquakes in Oklahoma, August 2015 (density plot)") +
  theme_dan_map()

Density plots are neat but they confuse easily-confused people like myself, and I don't think the continuous binning is helpful in this situation. It also looks too much like a topographical map, which is particularly confusing when the topic involves earthquakes and geology.

Hexbins

So when it comes to two-dimensional binning, I prefer hexbins. I had a hard time justifying why, other than boxes are boring because everyone else uses them. But Fabio Nelli has a nice technical explanation why hexagons make for nice bins:

There are many reasons for using hexagons instead of squares for binning a 2D surface as a plane. The most evident is that hexagons are more similar to circle than square. This translates in more efficient data aggregation around the bin center. This can be seen by looking at some particular properties of hexagons and, especially, of the hexagonal tessellation.

See also: Chris Stucchio's fine demonstration of hexbins to bring clarity to a real-world scatterplot snafu.

But I admit that my hex-preference is primarily informed by how the hexbin package makes it so easy to plot hexagonal bins with R and ggplot2:

Note: In the snippet below, I use the coord_equal() function instead of coord_map("albers"). There might be a bug specific to hexbin and projections, so I leave out Albers for now and, further below, present a workaround.

ggplot() +
  geom_polygon(data = ok_map,
               aes(x = long, y = lat, group = group),
               fill = "white", color = "#666666", size = 0.5) +
  stat_binhex(data = ok_quakes, bins = 20,
              aes(x = longitude, y = latitude), color = "#999999") +
  scale_fill_gradientn(colours = c("white", "red")) +
  coord_equal() +
  ggtitle("M3.0+ earthquakes in Oklahoma, August 2015 (Hexbin)") +
  theme_dan_map()

Using real-world map tiles with ggmap

One of my favorite R packages is ggmap, which provides a convenient interface for downloading map imagery and location data from Google Static Maps, OpenStreetMaps, and Stamen Design, which can then be used as a layer for ggplot data visualizations.

The Census's geographic shapefiles are more than adequate -- and often ideal -- for representing familiar land masses and boundaries. But I'll let ggmap's authors, David Kahle and Hadley Wickham, describe the usecase for ggmap:

In spatial statistics the ability to visualize data and models superimposed with their basic social landmarks and geographic context is invaluable. ggmap is a new tool which enables such visualization by combining the spatial information of static maps from Google Maps, OpenStreetMap, Stamen Maps or CloudMade Maps with the layered grammar of graphics implementation of ggplot2. In addition, several new utility functions are introduced which allow the user to access the Google Geocoding, Distance Matrix, and Directions APIs. The result is an easy, consistent and modular framework for spatial graphics with several convenient tools for spatial data analysis.

ggmap is such an obviously handy tool but I haven't seen it available in other non-web visualization frameworks – though obviously, plotting points on Google Maps/OpenStreetMaps via JavaScript is a ubiquitous feature in browser-based applications. While the maps created by ggmap aren't interactive, you get the full power of R's data processing and ggplot's visualization. An easy tradeoff, in my opinion.

ggmap's get_map()

The ggmap library has a general get_map() function. By default, it will fetch a map image from Google Static Maps, which passes the location value to Google's sophisticated location query dis-discombobulation parser:

chucky <- get_map(location = "Chucky Cheeses Ohkrahoma", zoom = 17)

By default, the get_map() function conveniently prints out the URLs that directly link to the Google Static Maps image and the API's JSON response; I've left in its output above as an example.

The object returned by get_map() is a raster image. Use the ggmap() function to plot it:

ggmap(chucky)

In lieu of using shapefiles that contain geological features:

goog_map <- get_googlemap(center = "Oklahoma",  size = c(500,500), zoom = 6, scale = 2, maptype = 'terrain')

We use ggmap() to draw it as a plot:

ggmap(goog_map) + theme_dan_map()

Use Google feature styles to remove administrative labels:

goog_map2 <- get_googlemap(center = "Oklahoma",  size = c(500,500),
  zoom = 6, scale = 2, maptype = 'terrain',
  style = c(feature = "administrative.province", element = "labels", visibility = "off"))
# plot the map
ggmap(goog_map2) + theme_dan_map()

With points

ggmap(goog_map2, extent = 'panel') +
  geom_point(data = states_quakes, aes(x = longitude, y = latitude), alpha = 0.2, shape = 4, colour = "red") + theme_dan_map()

With hexbin and satellite:

With hexbin, note that we don't use Albers projection:

hybrid_goog_map <-  get_googlemap(center = "Oklahoma",  size = c(640,640), zoom = 6, scale = 2, maptype = 'hybrid',
style = c(feature = "administrative.province", element = "labels", visibility = "off"))

ggmap(hybrid_goog_map, extent = 'panel') +
  stat_binhex(data = states_quakes, aes(x = longitude, y = latitude), bins = 30, alpha = 0.3 ) +
  scale_fill_gradientn(colours=c("pink","red")) +
  guides(alpha = FALSE) + theme_dan_map()

At this point, we've covered just the general range of the data-munging and visualization techniques we need to effectively analyze and visualize the historical earthquake data for the United States.

Optional and fun exercises

{:#mark-hexbin-albers-workaround}

Projecting geocoordinates without coord_map()

We use the spTransform() function to apply the Albers coordinate system to usc_map and usc_quakes:

# store the Albers system in a variable as we need to apply it separately
# to usc_map and usc_quakes
albers_crs <- CRS("+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs")

# turn usc_quakes into a SpatialPointsDataFrame
sp_usc <-  SpatialPointsDataFrame(data = usc_quakes,
                          coords = usc_quakes[,c("longitude", "latitude")])
# give it the same default projection as usc_map
sp_usc@proj4string <- usc_map@proj4string
# Apply the Albers projection to the spatial quakes data and convert it
# back to a data frame. Note that the Albers coordinates are stored in
# latitude.2 and longitude.2
albers_quakes <- as.data.frame(spTransform(sp_usc, albers_crs))

# Apply the Albers projection to usc_map
albers_map <- spTransform(usc_map, albers_crs)

Note: in the x and y aesthetic for the stat_binhex() call, we refer to longitude.2 and latitude.2. This is because albers_quakes is the result of two SpaitalPointsDataFrame conversions; each conversion creates a new longitude and latitude column.

After those extra steps, we can now hexbin our earthquake data on the Albers coordinate system:

ggplot() +
  geom_polygon(data = albers_map,
               aes(x = long, y = lat, group = group),
               fill = "white", color = "#666666", size = 0.5) +
  stat_binhex(data = as.data.frame(albers_quakes), bins = 50,
              aes(x = longitude.2, y = latitude.2), color = "#993333") +
  scale_fill_gradientn(colours = c("white", "red")) +
  coord_equal() +
  ggtitle("M3.0+ earthquakes in the contiguous U.S., August 2015\n(Albers projection)")

Again, this is purely an aesthetic fix. Your data analysis doesn't become less important because your map looks weird. But geographical/geopolitical maps, as flawed as they are for most data analyses, are liked because of our familiarity with them. So if you decide to plot on a geographical layer, not looking weird may actually be a high priority.

Query USGS earthquake historical data using Census spatial data

Analyzing earthquake data at the county level

shpname <- 'cb_2014_us_county_20m'
url <- paste0("http://www2.census.gov/geo/tiger/GENZ2014/shp/", shpname, ".zip")
fname <- paste0("./data/", shpname, ".zip")
if (!file.exists(fname)){
  print(paste("Downloading: ", url))
  download.file(url, fname)
}
unzip(fname, exdir = "./data/shp")
counties_map <- readOGR(paste0("./data/shp/", shpname, '.shp'), shpname,
                         stringsAsFactors = FALSE)
# give them a STUSPS column
counties_map <- merge(counties_map, states_map@data[c('STUSPS', 'STATEFP')], by = 'STATEFP')
# subset for only the counties with STUSPS values
counties_map <- subset(counties_map, !is.na(STUSPS))

Oklahoma counties

Reusing ok_quakes (TK: all_ok_quakes)

ok_counties <- subset(counties_map, STUSPS == 'OK')
ggplot(data = ok_quakes, aes(x = longitude, y = latitude)) +
  geom_polygon(data = ok_counties, aes(x = long, y = lat, group = group),
                color = "black", size = 0.2, fill = "white") +
  geom_point(data = ok_quakes, aes(x = longitude, y = latitude), color = "red", shape = 1) +
  coord_map('albers', lat0 = 39, lat1 = 42) +
  theme_dan_map()

give them a STUSPS column

counties_map <- merge(counties_map, states_map@data[c('STUSPS', 'STATEFP')], by = 'STATEFP')

subset for only the counties with STUSPS values

counties_map <- subset(counties_map, !is.na(STUSPS))

Using ggmap to geolocate the most earthquake-prone Chuck E. Cheese in the state of Oklahoma

http://newsok.com/article/5423712

http://www.chuckecheese.com/experience/locations?zip=10001&miles=10000&c=1000