contain-R

Apptainer/Singularity container for reproducible R environments.

What you need for this to work

Apptainer or Singularity CE
install.R or renv.lock file (examples below) that define the environment
An R script/project/command that you want to run in that environment
No need to install R itself (R 4.3.0 is provided by the container)

Motivation and big picture

For reproducibility it is important to:

document dependencies
isolate dependencies from dependencies of other projects

This container:

creates a per-project renv-environment and isolates dependencies
uses pak under the hood to speed up installation
allows to configure a user- or group-wide cache which can be reused across projects
does not allow accidental "I will just quickly install it into my system and document it later" since it is a container
forces you to document your dependencies which is good for reproducibility and your future self

Dependencies are not installed into the container but only managed by the container.

Quick start on your computer

Create a new directory.
In the new directory create a file install.R which contains:

renv::install('ggplot2')

Download the container:

$ singularity pull https://github.com/bast/contain-R/releases/download/0.1.0/contain-R.sif

Run the following in your terminal (it starts installing stuff; this takes 1-2 minutes on my computer):

$ ./contain-R.sif R --quiet -e 'library(ggplot2)'

Run the above again (now it will only take a second).
Run some R script which depends on that environment:

$ ./contain-R.sif Rscript somescript.R

Or if you want the R interactive shell:

$ ./contain-R.sif R

Quick start on a cluster

Same as above but instead of steps 3 and 4, use the following and adapt paths to your situation:

# probably you do not want to be in your home folder to not fill your disk quota
cd /cluster/work/users/myself/experiment

# download the container
$ singularity pull https://github.com/bast/contain-R/releases/download/0.1.0/contain-R.sif

# you decide where these should go
export RENV_CACHE=/cluster/work/users/myself/renv-cache
export PAK_CACHE=/cluster/work/users/myself/pak-cache

# you need only one of the two
export SINGULARITY_BIND="/cluster"
export APPTAINER_BIND="/cluster"

./contain-R.sif R --quiet -e 'library(ggplot2)'

install.R or renv.lock or both?

You need something to define the environment you want, either install.R or renv.lock.

An install.R file looks like this:

renv::install('ggplot2')
renv::install('vcfR')
renv::install('hierfstat')
renv::install('poppr')

List as many packages as you need. You can pin them to specific versions, if needed:

renv::install("digest@0.6.18")

Alternatively, you can create your environment from renv.lock which looks like this example and typically has been generated by renv:

{
  "R": {
    "Version": "3.6.1",
    "Repositories": [
      {
        "Name": "CRAN",
        "URL": "https://cloud.r-project.org"
      }
    ]
  },
  "Packages": {
    "markdown": {
      "Package": "markdown",
      "Version": "1.0",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "4584a57f565dd7987d59dda3a02cfb41"
    },
    "mime": {
      "Package": "mime",
      "Version": "0.7",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "908d95ccbfd1dd274073ef07a7c93934"
    }
  }
}

For more information about lock files, please see https://rstudio.github.io/renv/reference/lockfiles.html.

The container will process them in this order:

If there is only install.R, it will use that one and create an renv environment and lock dependencies in renv.lock.
If there is only renv.lock, it will use that one and create an renv environment.
If install.R is more recent than renv, it will install from it (again).
If renv.lock is more recent than renv, it will install from it (again).

In practice you will probably do either of these two:

You arrive with install.R and it will create renv.lock and renv. You can then take the renv.lock and use it to share an environment with your friend. Maybe you modify install.R later and refresh renv.lock and renv.
Or you arrive with renv.lock that you got from somebody and it will create renv.

Generated paths

Running the container creates the following files and directories in the same place where you run the container (but you can configure some of them if you want them somewhere else):

renv - holding the environment
renv.lock - created or updated if you installed from install.R
creates or modifies .Rprofile - renv adds the line source("renv/activate.R")
renv-cache - renv package cache; you can change its location by defining environment variable RENV_CACHE
pak-cache - pak package cache; you can change its location by defining environment variable RENV_CACHE

Installation takes too long?

Running a script for the first time may take time since it needs to set up the environment and download and install dependencies.

However, re-running the script will take no installation time and if dependencies are already in the cache, it will take no time either.

Pak and renv use different caches and methods

For historical reasons they are slightly different but their developers are working on smoothing things out between the two. You will notice the difference if you start from install.R, and then try to restore back from the generated renv.lock: you will notice that the two will use different methods.

Relevant GitHub issues:

You have the option to turn off pak like this:

export USE_PAK=false

Pros and cons of turning it off:

Advantage: Only one cache location and everything is nicely consistent. You could install from install.R, then remove it even and run from renv.lock and it would be all consistent and not need to re-install anything.
Disadvantage: First installation from install.R might take longer when without pak.

How to configure location for package caches

You can change the location of the package caches:

export RENV_CACHE=/home/user/R/renv-cache
export PAK_CACHE=/home/user/R/pak-cache

Recommendations on where to place package caches

On your own computer it will make sense to reuse the same cache(s) across all projects. This way, when installing dependencies, renv will first look whether you already have the package on your computer.

On a shared cluster it might make sense to have one common cache for your group/allocation since your research group might use similar dependencies in their work. This way you can save space and install time.

Known problems/ ideas for later

Maybe you need a different version of R than 4.3.0. I guess we should at some point have several containers for different versions? Or you build your own from the definition file.
It could be good to let the user configure where renv itself should be located. Currently it is placed in the same folder where the container is run.

Resources

I have used these resources when writing/testing:

https://rstudio.github.io/renv/
https://rstudio.github.io/renv/articles/docker.html
https://pak.r-lib.org/
https://rstudio.github.io/packrat/ (deprecated)
https://sites.google.com/nyu.edu/nyu-hpc/hpc-systems/greene/software/r-packages-with-renv
https://raps-with-r.dev/repro_intro.html
https://www.youtube.com/watch?v=N7z1K4FhVFE (stream recording on how to use renv)
https://github.com/singularityhub/singularity-deploy

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
example		example
LICENSE		LICENSE
README.md		README.md
Singularity.contain-R		Singularity.contain-R
VERSION		VERSION

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

contain-R

What you need for this to work

Motivation and big picture

Quick start on your computer

Quick start on a cluster

install.R or renv.lock or both?

Generated paths

Installation takes too long?

Pak and renv use different caches and methods

How to configure location for package caches

Recommendations on where to place package caches

Known problems/ ideas for later

Resources

About

Releases 1

Languages

License

bast/contain-R

Folders and files

Latest commit

History

Repository files navigation

contain-R

What you need for this to work

Motivation and big picture

Quick start on your computer

Quick start on a cluster

install.R or renv.lock or both?

Generated paths

Installation takes too long?

Pak and renv use different caches and methods

How to configure location for package caches

Recommendations on where to place package caches

Known problems/ ideas for later

Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages