This repository provides a sample project organization using data on bird collisions in Chicago.
The bird collision data is from
Winger BM, Weeks BC, Farnsworth A, Jones AW, Hennen M, Willard DE (2019) Data from: Nocturnal flight-calling behaviour predicts vulnerability to artificial light in migratory birds. Dryad Digital Repository. https://doi.org/10.5061/dryad.8rr0498
The following is a brief description of our recommended workflow and organization.
Create an RStudio project
The first step is to create an RStudio project. The dcl package contains an RStudio project template with six folders and a makefile. To use our template, you'll need to install the dcl package.
# install.packages("devtools") devtools::install_github("stanford-datalab/dcl")
Now, you can create a project using our template.
- Click on the project drop-down menu in the upper-right corner of RStudio.
- Select New Project.
- Select New Directory > DCL Project.
- Name your directory, then click Create Project.
You should now have a file called [your directory's name].Rproj. Now, every time you want to work on your project, open this RStudio project.
Note that if you created a directory for your project before creating an RStudio project, you can't use RStudio's project templates. It's therefore easiest to create a directory by using the RStudio Project wizard (by following the steps we just explained), instead of first creating a directory and then adding in a project.
If you used the DCL Project template, you'll have a directory with the following folders:
- data: cleaned data
- data-raw: raw data
- docs: data documentation and notes
- eda: exploratory data analysis on your cleaned data
- scripts: data-cleaning scripts
- reports: findings to present to others
The following sections explain the contents of each folder in more detail.
This folder is for your raw data (i.e., the data that you haven't touched yet).
For each data file, come up with a short, but descriptive, name. You'll use these names to name other files.
If you're worried about your files being too large to push to GitHub, you can adjust the .gitignore so that git won't track anything in this folder. The example .gitignore specifies that git should ignore everything in data-raw and data.
This folder is for your data-cleaning scripts. Each script reads in raw data, cleans it, and writes the cleaned data to a .rds file in the data folder.
You should have one script for each raw data source. Name each script the same name as the raw data source. For example, one of our raw data files is named birds.txt. The script that cleans collisions.csv is called collisions.R, and collisions.R writes the cleaned data to collisions.rds.
If you want to join multiple data sources, create an additional script that joins the cleaned data files. For example, bird_collisions_light.R joins birds.rds, collisions.rds, and light_mp.rds, and writes to bird_collisions_light.rds. As the diagram points out, your joining scripts can join cleaned data from data.
We recommend having a clear, common format for your scripts. See scripts/template.R for our template. Specifically, we recommend always separating parameters from code and describing the purpose of each script in comments.
This folder contains cleaned data (in .rds format) that is ready to analyze. Each .rds file should have the same name as its corresponding raw data file and cleaning script.
If you joined data, that cleaned data should be here too. The cleaned and joined data should have the same name as the script that carried out the joining.
The docs folder is for any documentation files you used to understand the data, as well as for any notes you have on the data or your plan for analysis.
This folder contains R Markdown files with your EDA work. Create one R Markdown file for each cleaned data set that you want to explore.
Again, match the names of these files to your data files and cleaning scripts. For example, birds.Rmd performs EDA on just birds.rds, bird_collisions_light.Rmd performs EDA on just bird_collisions_light.rds, etc.
This folder contains reports on your data. These don't need to be named according to the convention of the other files. For example, our reports folder just has one report called report.Rmd.
here::here() for file paths
Say you want to give the file path for collisions.csv in collisions.R. One way to specify the file would be to give the file path relative to the scripts folder:
"../data-raw/collisions.csv". However, this will only work if you set your working directory to the scripts folder every time you run your script. It also means you have to think about where folders are located relative to each other.
The here package makes this process easier. The function
here::here() allows you to specify a file path relative to the directory of your .Rproj file, no matter what folder you're in. For example, with
here::here(), you give the file path of collisions.csv as
See the example scripts, EDA documents, and reports for examples.
R Markdown template
The dcl package also contains a R Markdown template to use for your EDA files and reports. To use the template:
- Click on the new file button in the top-left corner of RStudio.
- Select R Markdown > From Template > DCL GitHub Document.
Our template is similar to the default GitHub document template, but includes a table of contents by default; formats the first R chunk to highlight places for libraries, parameters, and reading in code; and has example headers.
Imagine that birds.txt, our example raw data set, gets updated. Maybe the original owners added new birds or corrected a mistake that they noticed. birds.rds, our cleaned version of this data, depends on birds.txt. bird_collisions_light.rds, our cleaned and joined data, also depends on birds.txt, as do some of our EDA files and reports. To update all these files, we could rerun birds.R to regenerate birds.rds and bird_collisions_light.R to regenerate bird_collisions_light.rds. Then, we could re-knit the relevant EDA files and reports so that they use the updated data. However, manually updating all our files can get tedious. It also requires remembering which files depend on each other, which can get complicated.
Makefiles are a better way to keep track of dependencies and update files when there are changes. We've created a makefile for this example project. It specifies which files depend on each other, as well as what to do when certain files changes (e.g., run the script or knit the R Markdown file).
Makefiles are read by a program called Make. Make looks for changes in the files specified in the makefile. Then, it rebuilds the files that depend on the files that changed, based on the dependency structure given in the makefile.
Importantly, Make will only rebuild files affected by a change. For example, say birds.txt changes. Because of how our makefile is set up, Make will re-rerun birds.R and bird_collisions_light.R, which will re-write to birds.rds and bird_collisions_light.rds. Then, Make will re-knit the EDA files birds.Rmd and bird_collisions_light.Rmd, as well as our report report.Rmd. However, it will not re-rerun collisions.R, re-knit collisions.Rmd, etc., because these other files do not depend on birds.txt.
GNU Make is free software and comes installed on Macs and most Unix machines. If you're a Windows user, you might need to install Make yourself. The following will explain how to use the makefile we have provided in this repo. The GNU Make manual is a reference where you can learn more.
To run Make, navigate to your project directory from the command line. Then, type
make and hit enter. Note that this won't work for your specific project until you create your own makefile by following the instructions below.
Creating a makefile
You'll need to edit the example makefile in order for Make to work for you. If you're using our recommended folder organization, you should be able to re-use a lot of the example file.
VPATH = data data-raw eda reports scripts
This variable provides the names of all the folders where Make should look for your files. If you used our recommended folder organization, you shouldn't have to change anything. If you used different folder names (or have additional folders), just change the names. Make sure to separate the folders with a single space.
all : $(DATA) $(EDA) $(REPORTS) on line 14 defines a target called
all. This tells Make to, by default, consider all the files defined by
If you used the recommended folder organization, you won't need to change line 14. However, you will need to change the definitions of
REPORTS on lines 4-11 to contain the names of the files you'd like Make to monitor and automatically update.
Now, you need to specify the dependencies of your project. Lines 16-27 define our dependencies.
File A depends on File B if changing File B can change File A. For example,
birds.md depends on
birds.rds because changing the cleaned data in
birds.rds could change the analysis, visualizations, etc. in
Create a line for each file in your project that depends on at least one other file. Specify the dependencies by using the following syntax:
[target file] : [dependency file 1] [dependency file 2] [dependency file 3]
Your files can have any number of dependencies, but make sure to separate the dependencies with a single space. If you need more than one line for your dependencies, end all lines except the last with a "\".
Finally, you need to tell Make how to update different types of files. We want Make to run a script if raw data changes, but knit an R Markdown document if cleaned data changes. Lines 30-33 define our rules.
You probably won't need to update these rules, but it's useful to understand them.
The first rule (lines 30-31) tells Make how to update a .rds file. For example, say birds.txt changes. Make knows that birds.rds depends on birds.txt because of our specified dependencies. Make then looks to our first rule to figure out how to update birds.rds. The rule says to run the R script with the same name as the .rds file. In our example, that script is birds.R, so Make will run birds.R.
The second rule (32-33) tells Make how to update a .md file. The rule tells Make to knit the .Rmd version of the relevant .md file. For example, if birds.md needs updating (because birds.rds changed), Make will knit birds.Rmd.