Data Management and Reproducible Research - 2016.10.27 - Dan Lurie and Chris Gagne
Another scientist should be able to reproduce your entire research pipeline, from data collection to final figures, without having to email you with questions. It sounds intimidating, but it doesn't have to be, and in practice it's usually not that much extra work. You already know all the information they would need, it's largely just a matter of being mindful of how you do things and keeping a record. More selfishly, working in a reproducible way will make your own life easier, especially when you have to come back to a project months or years later.
Practical Guidelines for Reliable and Reproducible Research
- Protect the sanctity of your raw data.
- Always keep a hard-copy backup (e.g. CD/DVD, jump drive).
- Thou shalt not modify. Make copies instead.
- If working on a shared computer, lock down file permissions.
- Keep things organized!
- Folder and file names should be self explanatory.
- Be descriptive! You’re not paying per character.
- If possible, use a standardized file structure (e.g. Brain Imaging Data Structure).
- Track data provenance. Where did this file come from?
- Use README files to store notes about the contents of each folder.
- Keep a lab notebook.
- Track everything you do in your research. You're not going to remember later.
- What manual edits or manipulations did you make to the data (and why)?
- What commands did you run, in what order, and with what options?
- What did you try that didn’t work?
- Have a backup plan.
- Dropbox and Google Drive are crazy easy to use.
- Make sure data is properly anonymized before uploading anywhere!
- Think about long-term storage options. What happens when you leave the lab?
- Use a version control system (e.g. GitHub) to track changes.
Using GitHub for Version Control
- Git Basics - What is version control? (video)
- The Git Parable
- Git/Github: A primer for researchers
- A quick introduction to version control with Git and GitHub
- Git for scientists: A tutorial
- Version control with Git
Basic GitHub workflow:
- Initialize a project repository.
- Add files.
- Commit changes.
- Push changes.
- Repeat steps 2-4 as you work.
Digital Lab Notebooks
For some reason, many in psychology and neuroscience seem to have forgotten about lab notebooks. In almost every other area of science, lab notebooks sit at the very core of the entire research enterprise. It's not just a log of what you've done, it's also a place to make plans, consider hypotheses, jot down half formed ideas, etc. Science isn't just the stuff we do when working with subjects or running statistical tests; everything that happens in between is often just as (or even more) important. In recent years, many people have been moving from paper and pencil to digital lab notebooks, and some scientists are now keeping completely open notebooks that anyone can view. We focus here on digital notebooks, but paper notebooks are still going strong. The most important thing is finding a reliable system that works for you.
We both (Dan and Chris) tend to work in two parallel notebook-type places:
- A text-based system for research notes, brainstorming, planning, interpretation, daily logs, code snippets, etc.
- A computational notebook for doing and tracking our analysis.
Popular text-based systems:
- Google Docs
- Microsoft OneNote
The concept of a computational lab notebook gained popularity in large part due to the success of IPython Notebook (now part of Jupyter), but there are now notebook options for most analysis environments.
- Jupyter (supports Python, R, and MATLAB)
- RStudio and R Markdown
- MATLAB Live Editor
- SPSS Syntax Editor (not really a notebook, but it will do in a pinch if you have no choice but to use SPSS)
Benefits of Keeping a Digital Lab Notebook
A log of what you did and how you did it. This is historically the most important purpose of a lab notebook, and has become even more critical in light of the "reproductibility crisis". If you're good about keeping all your analysis in a notebook, your science will be immediately reproducible and your methods sections will write themselves. Trust us, you're not going to remember all the details later, and one of those little details might make all the difference.
A way of sharing results with collaborators. Most science is a team sport, and that often means sharing data, results, and analyses with people who may not be in the same place as you. If you use one of the computational notebooks listed above, it is super easy to generate nice looking PDFs or even interactive webpages where collaborators can see the data, analysis code, results and figures.
A part of your published research. More and more journals (even Nature and Science) are requiring authors to publish or provide their raw data and analysis scripts. This is great for science, but can be a pain for scientists if they haven't been doing things in a reproducible way. If you've already got everything in a digital lab notebook and tracked through a GitHub repository, all you have to do is make things public.
Here are some examples of notebooks/repositories for published/in-press papers:
- Adaptive engagement of cognitive control in context-dependent decision-making.
- Frontoparietal representations of task context support the flexible control of goal-directed cognition.
- Choosing prediction over explanation in psychology: Lessons from machine learning
- Adolescence is associated with genomically patterned consolidation of the hubs of the human brain connectome
- GitHub will automatically render any Jupyter and R Markdown notebooks you have uploaded.
- Use nbviewer to display Jupyter notebooks hosted on your own server.
- Turn a Jupyter notebook into a PDF or HTML website using nbconvert.
- You can optionally hide your code by using this Jupyter cell or this nbconvert template.
- Easily create interactive figures with Plot.ly (Python, R, and MATLAB) and MPLD3 (Python).
- Share your entire computational environment (data, software, scripts, notebooks, etc) using Binder!
Open and reproducible science resources at UC Berkeley:
- Berkeley Institute for Data Science (BIDS)
- Berkeley Initiative for Transparency in the Social Sciences (BITSS)
- Stat 159/259: Reproducible and Collaborative Statistical Data Science
- Dan is also always happy to chat about this stuff.