Skip to content

datalad-handbook/repro-paper-sketch

Repository files navigation

made-with-datalad

Automatically Reproducible Paper Template

This repository is a minimal template for creating a dynamically generated, automatically reproducible paper using DataLad, Python, Makefiles, and LaTeX.

In this example, DataLad is used to link the manuscript to code and data. Within a Python script that computes results and figures, DataLad's Python API is used to retrieve data automatically. The LaTeX manuscript does not hard-code results or tables, but embeds external files or variables with results. The Python script saves its results and figures into the dataset, but also outputs results in form that they can be embedded into the manuscript as variables. To orchestrate code execution and LaTeX manuscript compiling, a Makefile is used. With this setup, generating a manuscript with freshly computed results is a matter of running a make command in a cloned repository.

The template is meant to be used in workshops or tutorials and is therefore a simplified example. However, it is based on an actually published, reproducible paper that can be found in this repository.

If you found this repository outside of a DataLad teaching event, do note that this repository is a DataLad dataset. You can find out what DataLad datasets are in the short introduction at the end of this README, and at handbook.datalad.org

Requirements

To run and adjust this template to your own manuscript or research project, you will need to have the following software tools installed:

  • A Python installation
  • latexmk to render the PDF from main.tex, and make to run the Makefile
  • (Optionally) Inkscape, to render figures that are in SVG format

How to build the paper

  • git clone the repository:
# don't copy leading $ - they only distinguish a command from a comment.
$ git clone https://github.com/datalad-handbook/repro-paper-sketch.git
# one way to create a virtual environment (called repro in this example):
virtualenv --python=python3 ~/env/repro
. ~/env/repro/bin/activate
  • run make

The resulting PDF will be called main.pdf and you will find it in the root of the dataset. If things go wrong during make, run make clean to clean up any left-behind clutter.

How the setup works

This repository is a DataLad dataset, and links data to scripts that compute numerical results or figures from it. The script saves figures and outputs numerical results. Orchestration with a Makefile ensures that these results are collected and that a manuscript that embeds the created results and figures on the fly is generated. If SVG figures exist in the img/ directory, they will be rendered with inkscape to embed them, too. Comments in the Makefile in the relevant sections shed light on what each line does. While the simple Makefile included in this template should get you started, an introduction to Makefiles can also be found in The Turing Way handbook for reproducible research

Quick start tutorial

To give you a quick idea of how to use this template, make the manuscript, and then adjust the script in a way that would minimally affect the results. Afterwards, run make again to see your changes embedded in the manuscript.

# We will assume you have the relevant software (Python, latexmk) set up
# create a fresh virtual env and activate it
$ virtualenv --python=python3 ~/env/repro
$ . ~/env/repro/bin/active
# generate the manuscript with unchanged code
$ make
# open the resulting PDF with a PDF viewer (for example evince)
$ evince main.pdf
# open code/mk_figuresnstats.py with an editor of your choice. Adjust the color palette
# in the function plot_relationships() from "muted" to "Blues". You can also do
# this from the terminal with this line of code:
# macOS
$ sed -i' ' 's/muted/Blues/g' code/mk_figuresnstats.py  
# linux
$ sed -i 's/muted/Blues/g' code/mk_figuresnstats.py  
# run make again
$ make
# take another look at the PDF to see how the figure was dynamically updated

How to adjust the template

If this setup with LaTeX and Makefiles suits your workflow, adjust code, manuscript, and data to your own research project.

You can:

  • Change the contents of requirements.txt to Python modules of your choice. They will be installed before executing the script.
  • Change the code, or add new code. In the current template, the Python script (code/mk_figuresnstats.py) is executed.
  • Change the data. In the current template, code operates on the linked input/ DataLad dataset. You can link any dataset of your choice with DataLad (installation instructions and further info)
  • Change the manuscript. Adjust main.tex to your text of choice, add new figures, tables, or contents.
  • Install the data that you need as a subdataset using
$ datalad clone -d . <url>
  • adjust the dl.get() call in the script to retrieve the data your analysis needs from it,
  • write analysis code that saves its results in either LaTeX variables or files that can be imported into a .tex file, and
  • write your manuscript, embedding all of your results as figures, tables, or variables.

Done. :)

If you are curious about how the manuscript PDF is actually built, make sure you read its content once you have successfully created it!

DISCLAIMER: This is not the only way to generate a reproducible research object, and there are many tools out there that can achieve the same. This template is just one demonstration of one way to write a reproducible manuscript.


DataLad datasets and how to use them

This repository is a DataLad dataset. It provides fine-grained data access down to the level of individual files, and allows for tracking future updates. In order to use this repository for data retrieval, DataLad is required. It is a free and open source command line tool, available for all major operating systems, and builds up on Git and git-annex to allow sharing, synchronizing, and version controlling collections of large files. You can find information on how to install DataLad at handbook.datalad.org/en/latest/intro/installation.html.

Get the dataset

A DataLad dataset can be cloned by running

datalad clone <url>

Once a dataset is cloned, it is a light-weight directory on your local machine. At this point, it contains only small metadata and information on the identity of the files in the dataset, but not actual content of the (sometimes large) data files.

Retrieve dataset content

After cloning a dataset, you can retrieve file contents by running

datalad get <path/to/directory/or/file>`

This command will trigger a download of the files, directories, or subdatasets you have specified.

DataLad datasets can contain other datasets, so called subdatasets. If you clone the top-level dataset, subdatasets do not yet contain metadata and information on the identity of files, but appear to be empty directories. In order to retrieve file availability metadata in subdatasets, use -n flag like so:

datalad get -n <path/to/subdataset>

Afterwards, you can browse the retrieved metadata to find out about subdataset contents, and use datalad get once again (no flag this time) to retrieve individual files. If you use datalad get <path/to/subdataset>, all contents of the subdataset will be downloaded at once.

Stay up-to-date

DataLad datasets can be updated. The command datalad update will fetch updates and store them on a different branch (by default remotes/origin/master). Running

datalad update --merge

will pull available updates and integrate them in one go.

Find out what has been done

DataLad datasets contain their history in the git log. By running git log (or a tool that displays Git history) in the dataset or on specific files, you can find out what has been done to the dataset or to individual files by whom, and when.

More information

More information on DataLad and how to use it can be found in the DataLad Handbook at handbook.datalad.org. The chapter "DataLad datasets" can help you to familiarize yourself with the concept of a dataset.

About

A template to create a reproducible paper with LaTeX, Makefiles, Python, and DataLad

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published