{{ message }}

ccs-amsterdam / r-course-material

Switch branches/tags
Nothing to show

Cannot retrieve contributors at this time
271 lines (191 sloc) 14.8 KB

Basics of data visualization

Kasper Welbers & Wouter van Atteveldt 2018-09

This tutorial teaches the basics of data visualization using the `ggplot2` package (included in `tidyverse`). For more information, see R4DS Chapter 3: Da`ta Visualization and R4DS Chapter 7: Exploratory Data Analysis.

For many cool visualization examples using `gplot2` (with R code included!) see the R Graph Gallery. For inspiration (but unfortunately no R code), there is also a 538 blog post on data visualization from 2016. Finally, see the article on 'the grammar of graphics' published by Hadley Wickham for more insight into the ideas behind ggplot.

A Basic ggplot plot

Suppose that we want to see the relation between college education and household income, both included in the `county facts` subset published by Houston Data Visualisation github page. Since this data set contains a large amount of columns, we keep only a subset of columns for now:

```library(tidyverse)
csv_folder_url = "https://raw.githubusercontent.com/houstondatavis/data-jam-august-2016/master/csv"
facts = read_csv(paste(csv_folder_url, "county_facts.csv", sep = "/"))
facts_subset = facts %>% select(fips, area_name, state_abbreviation, population=Pop_2014_count, pop_change=Pop_change_pct,
over65=Age_over_65_pct, female=Sex_female_pct, white=Race_white_pct,
facts_state = facts_subset %>% filter(is.na(state_abbreviation) & fips != 0) %>% select(-state_abbreviation)
facts_state```

Now, let's make a scatter plot with percentage college-educated on the x-axis and median income on the y-axis:

`ggplot(data=facts_state) + geom_point(mapping=aes(x=college, y=income))`

The command above consists of two parts that are added together. First, `ggplot` creates an empty canvas tied to the dataset `facts_state`. Then, `geom_point` adds a layer of information to the canvas. In the language of ggplot, each layer has a geometrical representation, in this case points, and an aesthetic mapping which maps the visual elements of the geometry to columns of the data. In this case, the "x" and "y" are mapped to the college and income columns.

The result is a plot where each point here represents a state, and we see a clear correlation between education level and income. There is one clear outlier on the top-right. Can you guess which state that is?

Important note on ggplot command syntax

For the plot to work, R needs to execute the whole ggplot call and all layers as a single statement. Practically, that means that if you combine a plot over multiple lines, the plus sign needs to be at the end of the line, so R knows more is coming.

So, the following is good:

```ggplot(data=facts_state) +
geom_point(mapping=aes(x=college, y=income))```

But this is not:

```ggplot(data=facts_state)
+ geom_point(mapping=aes(x=college, y=income))```

Also note that the data and mapping arguments are the first arguments the functions expect, so you can also leave them out:

`ggplot(facts_state) + geom_point(aes(x=college, y=income))`

Other aesthetics

To find out which visual elements can be used in a layer, use e.g. `?geom_point`. According to the help file, we can (among others) set the colour, alpha (transparency), and size of points. Let's first set the size of points to the (log) population of each state, creating a bubble plot:

`ggplot(data=facts_state) + geom_point(mapping=aes(x=college, y=income, size=population))`

Since it is difficult to see overlapping points, let's make all points somewhat transparent. Note: Since we want to set the alpha of all points to a single value, this is not a mapping (as it is not mapped to a column from the data frame), but a constant. These are set outside the mapping argument:

`ggplot(data=facts_state) + geom_point(mapping=aes(x=college, y=income, size=population), alpha=.5, colour="red")`

Instead of setting colour to a constant value, we can also let it vary with the data. For example, we can colour the states by percentage of population above 65:

`ggplot(data=facts_state) + geom_point(mapping=aes(x=college, y=income, size=population, colour=white), alpha=.9)`

Finally, you can map to a categorical value as well. Let's categorize states into whether population is growing (at least 1%) or stable or declining. We use the `if_else(condition, iftrue, iffalse)` function, which assigns the `iftrue` value if the condition is true, and `iffalse` otherwise:

`facts_state = facts_state %>% mutate(growth=ifelse(pop_change > 1, "Growing", "Stable"))`

Now, we can add the category color to the plot:

`ggplot(data=facts_state) + geom_point(mapping=aes(x=college, y=income, size=population, colour=growth), alpha=.9) `

As you can see in these examples, ggplot tries to be smart about the mapping you ask. It automatically sets the x and y ranges to the values in your data. It mapped the size such that there are small and large points, but not e.g. a point so large that it would dominate the graph. For the colour, for interval variables it created a colour scale, while for a categorical variable it automatically assigned a colour to each group.

Of course, each of those choices can be customized, and sometimes it makes a lot of sense to do so. For example, you might wish to use red for republicans and blue for democrats, if your audience is used to those colors; or you may wish to use grayscale for an old-fashioned paper publication. We'll explore more options in a later tutorial, but for now let's be happy that ggplot does a lot of work for us!

Bar plots

Another frequently used plot is the bar plot. By default, R bar plots assume that you want to plot a histogram, e.g. the number of occurences of each group. As a very simple example, the following plots the number of states that are growing or stable in population:

`ggplot(data=facts_state) + geom_bar(mapping=aes(x=growth))`

For a more interesting plot, let's plot the votes per Republican candidate in the New Hampshire primary. First, we need to download the per-county data, summarize it per state, and filter to only get the NH results for the Republican party: (see the previous tutorials on Data Transformations and Joining data for more information if needed)

```results = read_csv(paste(csv_folder_url, "primary_results.csv", sep = "/"))
nh_gop = results_state %>% filter(state == "New Hampshire" & party == "Republican")
nh_gop```

Now, let's make a bar plot with votes (y) per candidate (x). Since we don't want ggplot to summarize it for us (we already did that ourselves), we set `stat="identity"` to set the grouping statistic to the identity function, i.e. just use each point as it is.

`ggplot(data=nh_gop) + geom_bar(mapping=aes(x=candidate, y=votes), stat='identity')`

Setting graph options

Some options, like labels, legends, and the coordinate system are graph-wide rather than per layer. You add these options to the graph by adding extra functions to the call. For example, we can use coord_flip() to swap the x and y axes:

`ggplot(data=nh_gop) + geom_bar(mapping=aes(x=candidate, y=votes), stat='identity') + coord_flip()`

You can also reorder categories with the `reorder` function, for example to sort by number of votes. Also, let's add some colour (just because we can!):

`ggplot(data=nh_gop) + geom_bar(mapping=aes(x=reorder(candidate, votes), y=votes, fill=candidate), stat='identity') + coord_flip()`

This is getting somewhere, but the y-axis label is not very pretty and we don't need guides for the fill mapping. This can be remedied by more graph-level options. Also, we can use a `theme` to alter the appearance of the graph, for example using the minimal theme:

`ggplot(data=nh_gop) + geom_bar(mapping=aes(x=reorder(candidate, votes), y=votes, fill=candidate), stat='identity') + coord_flip() + xlab("Candidate") + guides(fill=F) + theme_minimal()`

Grouped bar plots

We can also add groups to bar plots. For example, we can set the x category to state (taking only NH and IA to keep the plot readable), and then group by candidate:

```gop2 = results_state %>% filter(party == "Republican" & (state == "New Hampshire" | state == "Iowa"))
ggplot(data=gop2) + geom_bar(mapping=aes(x=state, y=votes, fill=candidate), stat='identity')```

By default, the groups are stacked. This can be controlled with the position parameter, which can be `dodge` (for grouped bars) or `fill` (stacking to 100%):

```ggplot(data=gop2) + geom_bar(mapping=aes(x=state, y=votes, fill=candidate), stat='identity', position='dodge')
ggplot(data=gop2) + geom_bar(mapping=aes(x=state, y=votes, fill=candidate), stat='identity', position='fill')```

You can also make the grouped bars add up to 100% by computing the proportion manually.

```gop2 = gop2 %>% group_by(state) %>% mutate(vote_prop=votes/sum(votes))
ggplot(data=gop2) + geom_bar(mapping=aes(x=state, y=vote_prop, fill=candidate), stat='identity', position='dodge') + ylab("Votes (%)")```

Note that where `group_by %>% summarize` replaces the data frame by a summarization, `group_by %>% mutate` adds a column to the existing data frame, using the grouped values for e.g. sums.

Line plots

Finally, another frequent graph is the line graph. For example, we can plot the ascendancy of Donald Trump by looking at his vote share over time. First, we combine the results per state with the primary schedule: (see the tutorial on Joining data)

```schedule  = read_csv(paste(csv_folder_url, "primary_schedule.csv", sep="/"))
schedule = schedule %>% mutate(date = as.Date(date, format="%m/%d/%y"))
trump = left_join(trump, schedule)
trump = trump %>% group_by(date) %>% summarize(vote_prop=mean(vote_prop))
trump```

Take a minute to inspect the code above, and try to understand what each line does! The best wat to do this is to inspect the output of each line, and trace back how that output is computed based on the input data.

`ggplot(trump) + geom_line(aes(x=date, y=vote_prop))`

We can do the same for multiple candidates as well, for example for the democratic candidates:

```dems = results_state %>% filter(party=="Democrat") %>% left_join(schedule)
ggplot(dems) + geom_line(aes(x=date, y=vote_prop, colour=candidate))```

Bonus question: in the code for Trump, the proportion was calculated in two statements (first per state, then per date), but in this code it is calculated only per date. How does that matter? Is either calculation more correct than the other?

Multiple 'faceted' plots

Just to show off some of the possibilities of ggplot, let's make a plot of all republican primary outcomes on Super Tuesday (March 1st):

```super = results_state %>% left_join(schedule) %>% filter(party=="Republican" & date=="2016-03-01") %>% group_by(state) %>% mutate(vote_prop=votes/sum(votes))
ggplot(super) + geom_bar(aes(x=candidate, y=vote_prop), stat='identity')  + facet_wrap(~ state, nrow = 3) + coord_flip()```

Note facet_wrap wraps around a single facet. You can also use ~facet_grid() to specify separate variables for rows and columns

Themes

Customization of things like background colour, grid colour etc. is handled by themes. `ggplot` has two built-in themes: `theme_grey` (default) and `theme_bw` (for a more minimal theme with white background). The package ggthemes has some more themes, including an 'economist' theme (based on the newspaper). To use a theme, simply add it to the plot:

```library(ggthemes)
ggplot(trump) + geom_line(aes(x=date, y=vote_prop)) + theme_economist()```

Plotting maps

Geographic information can be plotted in `ggplot` much like scatter plots, simply using longitude and lattitude as x and y. Often, we want to plot data on an actual map of (part of) the world, for example to plot locations of tweets or colour a map with information per country or state.

In `ggplot` this is accomplished by plotting the shapes of the countries. The package includes shape data for the US, the world, and some countries like France, but unfortunately not EU or Germany. The maps originate from the `maps` package, so you can check their documentation to see what countries are included.

```library(ggplot2)
states = map_data('state')

This basically tells ggplot what lines to draw to form a state. If a state is not contiguous it will contain subregions resulting in multiple polygons.

We can immediately plot this data, using the `geom_polygon` to plot shapes. We specify x and y as longitude and lattitude, fill by state, and make the state borders white.

```ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") +
coord_fixed(1.3) + guides(fill=FALSE)  ```

Note: the last line fixes the aspect ratio to 1.3 and prevents a per-state legend (guide) from being plotted.

This example coloured the states as a non-informative nominal variable. We can also colour by our own data, for example by percentage white ethnicity:

```states = facts_state %>% mutate(region=tolower(area_name)) %>% select(region, white) %>% inner_join(states)
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, fill = white, group = group), color = "white") +
coord_fixed(1.3) + theme_void() +
ggtitle("Percentage white population per state") ```