Overview and learning objectives
Students will work through activities highlighting the motivation for and value of literate programming as a concept, and as its implementation in
RMarkdown. Through this, students will get introduced to the concepts of executable documentation and automation. Students will also learn about best practices for structuring spreadsheet-type data files, and the importance of documenting all changes one makes to data. Finally, students will be introduced to combining all these ideas to create automated, executable, and self-documenting data quality insurance and control reports.
At the beginning of the session, students should
- be familiar with
Rcommands, and running
At the end of the session students will be able to
- Distinguish between a spreadsheet formatted properly for later analysis and one formatted improperly.
- Recognize and correct common data entry errors.
- Describe the concept of 'raw data', and its implications for reproducible and sound data management.
- Apply the concept of literate programming to produce executable documentation of data management and analysis.
Overview and recap
- Recap about markdown,
knitr, and virtues of literate programming from the demonstrations in the Intro lesson: slides
Activity - Using R and Rmarkdown to clean and plot data
Objective: through hands-on interaction and modification, develop familiarity with
RMarkdown and knitting the output.
knit and modify. Using
countryPick4.Rmd as a template, students learn how to import data, filter to one country, make a plot, write it to file, and comment data choices. Then the activity will illustrate what happens when you
- Preview/Knit HTML, note what sorts of outputs are left behind.
- Discuss input and output files.
- Which files can we delete and reproduce? Which files are inputs, outputs, converters of inputs to outputs?
This section is meant for students to explore the power of writing reports in
Documenting data modifications
Activity - Cleaning up data in Excel
Students identify poor and good data formatting practices, and will learn the importance of documenting modifications. This will lead to making modifications in a self-documenting and executable way.
Applying literate programming to produce executable documentation
- Lesson: 02-literate-programming
Resources and useful links
Relevant scientific papers
- EP White, E Baldridge, ZT Brym, KJ Locey, DJ McGlinn, SR Supp (2013) "Nine simple ways to make it easier to (re)use your data." Ideas in Ecology and Evolution 6(2): 1–10, 2013. doi:10.4033/iee.2013.6b.6.f (in particular the section "Use standard table formats")
Best practices for spreadsheets
- [Good practice guidance on releasing statistics in spreadsheets](UK Government)
People and credits
This lesson was first created as a part of the Organization1 lesson at the 1. Reproducible Science Curriculum Hackathon, and was later split out into its own lesson. The corresponding author is Hilmar Lapp (@hlapp). See the commit log for other contributors.
Please post feedback and issues with the lesson on the repository's issue tracker. For instructor questions about teaching this lesson, you can also contact the corresponding author directly.
License and Attribution
- Gapminder data
- Processed and subset (population size, life expectancy, GDP per capita; only every 5 years only starting 1952, only complete records)
Gapminder data as
Rpackage. The data-raw sub-directory reveals the journey from Gapminder.org's Excel workbooks to increasingly clean and tidy data.
- clean dataset can be located in R in the following way (after installing the package):
pathToTsv <- system.file("gapminder.tsv", package = "gapminder")