Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
194 lines (140 sloc) 10.5 KB

01-intro.Rmd

Jenny Bryan
r format(Sys.time(), '%d %B, %Y')

If I had one thing to tell biologists learning bioinformatics, it would be "write code for humans, write data for computers".

— Vince Buffalo (@vsbuffalo) July 20, 2013

An important aspect of "writing data for computers" is to make your data tidy (see White et al and Wickham in the Resources). There's an emerging consensus on key features of tidy data:

  • Each column is a variable
  • Each row is an observation

If you are struggling to make a figure, for example, stop and think hard about whether your data is tidy. Untidiness is a common, often overlooked cause of agony in data analysis and visualization.

Lord of the Rings example

I will give you a concrete example of some untidy data I created from this data from the Lord of the Rings Trilogy.

The Fellowship Of The Ring
Race Female Male
Elf 1229 971
Hobbit 14 3644
Man 0 1995
The Two Towers
Race Female Male
Elf 331 513
Hobbit 0 2463
Man 401 3589
The Return Of The King
Race Female Male
Elf 183 510
Hobbit 2 2673
Man 268 2459

We have one table per movie. In each table, we have the total number of words spoken, by characters of different races and genders.

You could imagine finding these three tables as separate worksheets in an Excel workbook. Or hanging out in some cells on the side of a worksheet that contains the underlying data raw data. Or as tables on a webpage or in a Word document.

This data has been formatted for consumption by human eyeballs (paraphrasing Murrell; see Resources). The format makes it easy for a human to look up the number of words spoken by female elves in The Two Towers. But this format actually makes it pretty hard for a computer to pull out such counts and, more importantly, to compute on them or graph them.

Exercises

Look at the tables above and answer these questions:

  • What's the total number of words spoken by male hobbits?
  • Does a certain Race dominate a movie? Does the dominant Race differ across the movies?

How well does your approach scale if there were many more movies or if I provided you with updated data that includes all the Races (e.g. dwarves, orcs, etc.)?

Tidy Lord of the Rings data

Here's how the same data looks in tidy form:

Film Race Gender Words
The Fellowship Of The Ring Elf Female 1229
The Fellowship Of The Ring Elf Male 971
The Fellowship Of The Ring Hobbit Female 14
The Fellowship Of The Ring Hobbit Male 3644
The Fellowship Of The Ring Man Female 0
The Fellowship Of The Ring Man Male 1995
The Two Towers Elf Female 331
The Two Towers Elf Male 513
The Two Towers Hobbit Female 0
The Two Towers Hobbit Male 2463
The Two Towers Man Female 401
The Two Towers Man Male 3589
The Return Of The King Elf Female 183
The Return Of The King Elf Male 510
The Return Of The King Hobbit Female 2
The Return Of The King Hobbit Male 2673
The Return Of The King Man Female 268
The Return Of The King Man Male 2459

Notice that tidy data is generally taller and narrower. It doesn't fit nicely on the page. Certain elements get repeated alot, e.g. Hobbit. For these reasons, we often instinctively resist tidy data as inefficient or ugly. But, unless and until you're making the final product for a textual presentation of data, ignore your yearning to see the data in a compact form.

Benefits of tidy data

With the data in tidy form, it's natural to get a computer to do further summarization or to make a figure. This assumes you're using language that is "data-aware", which R certainly is. Let's answer the questions posed above.

What's the total number of words spoken by male hobbits?

aggregate(Words ~ Race * Gender, data = lotr_tidy, FUN = sum)
##     Race Gender Words
## 1    Elf Female  1743
## 2 Hobbit Female    16
## 3    Man Female   669
## 4    Elf   Male  1994
## 5 Hobbit   Male  8780
## 6    Man   Male  8043

Now it takes just one line of code to compute the word total for both genders of all Races across all Films. The total number of words spoken by male hobbits is 8780. It was important here to have all word counts in a single variable, within a data frame that also included variables for Race and Gender.

Does a certain Race dominate a movie? Does the dominant Race differ across the movies?

First, we sum across Gender, to obtain word counts for the different races by movie.

(by_race_film <- aggregate(Words ~ Race * Film, data = lotr_tidy, FUN = sum))
##     Race                       Film Words
## 1    Elf The Fellowship Of The Ring  2200
## 2 Hobbit The Fellowship Of The Ring  3658
## 3    Man The Fellowship Of The Ring  1995
## 4    Elf             The Two Towers   844
## 5 Hobbit             The Two Towers  2463
## 6    Man             The Two Towers  3990
## 7    Elf     The Return Of The King   693
## 8 Hobbit     The Return Of The King  2675
## 9    Man     The Return Of The King  2727

We can stare hard at those numbers to answer the question. But even nicer is to depict the word counts we just computed in a barchart.

library(ggplot2)
p <- ggplot(by_race_film, aes(x = Film, y = Words, fill = Race))
p + geom_bar(stat = "identity", position = "dodge") +
  coord_flip() + guides(fill = guide_legend(reverse=TRUE))

Hobbits are featured heavily in The Fellowhip of the Ring, where as Men had a lot more screen time in The Two Towers. They were equally prominent in the last movie, The Return of the King.

Again, it was important to have all the data in a single data frame, all word counts in a single variable, and associated variables for Film and Race.

Take home message

Having the data in tidy form was a key enabler for our data aggregations and visualization.

Tidy data is integral to efficient data analysis and visualization.

If you're skeptical about any of the above claims, it would be interesting to get the requested word counts, the barchart, or the insight gained from the chart without tidying or plotting the data. And imagine redoing all of that on the full dataset, which includes 3 more Races, e.g. Dwarves.

Where to next?

In the next lesson, we'll show how to tidy this data.

Our summing over Gender to get word counts for Film * Race was an example of data aggregation. The base function aggregate() does simple aggregation. For more flexibility, check out the packages plyr and dplyr. point to other lessons when/if they exist?

The figure was made with ggplot2, a popular package that implements the Grammar of Graphics in R.

Resources

  • Bad Data Handbook by By Q. Ethan McCallum, published by O'Reilly.
    • Chapter 3: Data Intended for Human Consumption, Not Machine Consumption by Paul Murrell.
  • Nine simple ways to make it easier to (re)use your data by EP White, E Baldridge, ZT Brym, KJ Locey, DJ McGlinn, SR Supp. Ideas in Ecology and Evolution 6(2): 1–10, 2013. doi:10.4033/iee.2013.6b.6.f http://library.queensu.ca/ojs/index.php/IEE/article/view/4608
    • See the section "Use standard table formats"
  • Tidy data by Hadley Wickham. Journal of Statistical Software. Vol. 59, Issue 10, Sep 2014. http://www.jstatsoft.org/v59/i10
    • tidyr, an R package to tidy data.
    • R packages by the same author that do heavier lifting in the data reshaping and aggregation department include reshape2, plyr and dplyr.