Jenny Bryan
r format(Sys.time(), '%d %B, %Y')
Recall that an important aspect of "writing data for computers" is to make your data tidy (see White et al and Wickham in the Resources). There's an emerging consensus on key features of tidy data:
- Each column is a variable
- Each row is an observation
But unfortunately, untidy data abounds. In fact, we often inflict it on ourselves, because untidy formats are more attractive for data entry or examination. So how do you make untidy data tidy?
We will now import the untidy data that was presented in the three film-specific word count tables from the intro.
I assume that data can be found as three plain text, delimited files, one for each Film
. How to liberate data from spreadsheets or tables in word processing documents has been covered in [TODO] this other lesson ... LINK. We bring the data into data frames, one per Film
, and do a little inspection. Note: Data Carpentry stores data in the top level data
directory; the object data_dir
holds the path to that directory. You may have these files in your current working directory, which affects the commands below.
fship <- read.csv(file.path(data_dir, "The_Fellowship_Of_The_Ring.csv"))
ttow <- read.csv(file.path(data_dir, "The_Two_Towers.csv"))
rking <- read.csv(file.path(data_dir, "The_Return_Of_The_King.csv"))
## if the csv files are in current working directory, import like so:
## fship <- read.csv("The_Fellowship_Of_The_Ring.csv")
rking
## Film Race Female Male
## 1 The Return Of The King Elf 183 510
## 2 The Return Of The King Hobbit 2 2673
## 3 The Return Of The King Man 268 2459
str(fship)
## 'data.frame': 3 obs. of 4 variables:
## $ Film : Factor w/ 1 level "The Fellowship Of The Ring": 1 1 1
## $ Race : Factor w/ 3 levels "Elf","Hobbit",..: 1 2 3
## $ Female: int 1229 14 0
## $ Male : int 971 3644 1995
We now have one data frame per Film
, each with a common set of 4 variables. Step one in tidying this data is to glue them together into one data frame, stacking them up row wise. I'll call this row-binding and use the base function rbind()
.
lotr_untidy <- rbind(fship, ttow, rking)
str(lotr_untidy)
## 'data.frame': 9 obs. of 4 variables:
## $ Film : Factor w/ 3 levels "The Fellowship Of The Ring",..: 1 1 1 2 2 2 3 3 3
## $ Race : Factor w/ 3 levels "Elf","Hobbit",..: 1 2 3 1 2 3 1 2 3
## $ Female: int 1229 14 0 331 0 401 183 2 268
## $ Male : int 971 3644 1995 513 2463 3589 510 2673 2459
lotr_untidy
## Film Race Female Male
## 1 The Fellowship Of The Ring Elf 1229 971
## 2 The Fellowship Of The Ring Hobbit 14 3644
## 3 The Fellowship Of The Ring Man 0 1995
## 4 The Two Towers Elf 331 513
## 5 The Two Towers Hobbit 0 2463
## 6 The Two Towers Man 401 3589
## 7 The Return Of The King Elf 183 510
## 8 The Return Of The King Hobbit 2 2673
## 9 The Return Of The King Man 268 2459
We are still violating one of the fundamental principles of tidy data. "Word count" is a fundamental variable in our dataset and it's currently spread out over two variables, Female
and Male
. Conceptually, we need to gather up the word counts into a single variable and create a new variable, Gender
, to track whether each count refers to females or males. We use a function from the tidyr
package to do this.
## uncomment and submit this line to install the tidyr package
## install.packages("tidyr")
library(tidyr)
lotr_tidy <-
gather(lotr_untidy, key = 'Gender', value = 'Words', Female, Male)
lotr_tidy
## Film Race Gender Words
## 1 The Fellowship Of The Ring Elf Female 1229
## 2 The Fellowship Of The Ring Hobbit Female 14
## 3 The Fellowship Of The Ring Man Female 0
## 4 The Two Towers Elf Female 331
## 5 The Two Towers Hobbit Female 0
## 6 The Two Towers Man Female 401
## 7 The Return Of The King Elf Female 183
## 8 The Return Of The King Hobbit Female 2
## 9 The Return Of The King Man Female 268
## 10 The Fellowship Of The Ring Elf Male 971
## 11 The Fellowship Of The Ring Hobbit Male 3644
## 12 The Fellowship Of The Ring Man Male 1995
## 13 The Two Towers Elf Male 513
## 14 The Two Towers Hobbit Male 2463
## 15 The Two Towers Man Male 3589
## 16 The Return Of The King Elf Male 510
## 17 The Return Of The King Hobbit Male 2673
## 18 The Return Of The King Man Male 2459
Tidy data ... mission accomplished!
To explain our call to gather()
above: we gathered the values in variables Female
and Male
together into a single new variable Words
. A second new variable Gender
serves as a key, explaining whether each of the values in Words
refers to Female
or Male
. All other variables, such as Film
, remain unchanged and are simply replicated as needed. Read the documentation for gather()
for more information and examples.
Now we write this multi-film, tidy dataset to file for use in various downstream scripts for further analysis and visualization. This would make an excellent file to share on the web with others, providing a tool-agnostic, ready-to-analyze entry point for anyone wishing to play with this data.
write.table(lotr_tidy, file = file.path(data_dir, "lotr_tidy.csv"),
quote = FALSE, sep = ",", row.names = FALSE)
You can inspect this delimited file here: lotr_tidy.csv.
The word count data is given in these two untidy and gender-specific files:
Write an R script that reads them in and writes a single tidy data frame to file. Literally, reproduce the lotr_tidy
data frame and the lotr_tidy.csv
data file from above.
Write R code to compute the total number of words spoken by each Race
across the entire trilogy. Do it two ways:
- Using
Film
-specific orGender
-specific, untidy data frames as the input data. - Using the
lotr_tidy
data frame as input.
Reflect on the process of writing this code and on the code itself. Which is easier to write? Easier to read?
Write R code to compute the total number of words spoken in each Film
. Do this by copying and modifying your own code for totalling words by Race
. Which approach is easier to modify and repurpose -- the one based on multiple, untidy data frames or the tidy data?
It is untidy to have have data parcelled out across different files or data frames. We used rbind()
above to combine Film
-specific data frames into one large data frame.
It is untidy to have a conceptual variable, e.g. "word count", spread across multiple variables, such as word counts for males and word counts for females. We used the gather()
function from the tidyr
package to stack up all the word counts into a single variable, create a new variable to convey male vs. female, and do the replication needed for the other variables.
Many data analytic projects will benefit from a script that marshals data from different files, tidies the data, and writes a clean result to file for further analysis.
Watch out for how untidy data seduces you into working with it more than you should:
- Data optimized for consumption by human eyeballs is attractive, so it's hard to remember it's suboptimal for computation. How can something that looks so pretty be so wrong?
- Tidy data often has lots of repetition, which triggers hand-wringing about efficiency and aesthetics. Until you can document a performance problem, keep calm and tidy on.
In the optional bonus content, I show how to tidy this data using only base R functions. At the other extreme, I also show how to tidy with add-on packages that are capable of more advanced data manipulations.
- Bad Data Handbook by By Q. Ethan McCallum, published by O'Reilly.
- Chapter 3: Data Intended for Human Consumption, Not Machine Consumption by Paul Murrell.
- Nine simple ways to make it easier to (re)use your data by EP White, E Baldridge, ZT Brym, KJ Locey, DJ McGlinn, SR Supp. Ideas in Ecology and Evolution 6(2): 1–10, 2013. doi:10.4033/iee.2013.6b.6.f http://library.queensu.ca/ojs/index.php/IEE/article/view/4608
- Tidy data by Hadley Wickham. Journal of Statistical Software. Vol. 59, Issue 10, Sep 2014. http://www.jstatsoft.org/v59/i10