Skip to content
/ contain-R Public

Apptainer/Singularity container for reproducible R environments.

License

Notifications You must be signed in to change notification settings

bast/contain-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

contain-R

Apptainer/Singularity container for reproducible R environments.

What you need for this to work

  • Apptainer or Singularity CE
  • install.R or renv.lock file (examples below) that define the environment
  • An R script/project/command that you want to run in that environment
  • No need to install R itself (R 4.3.0 is provided by the container)

Motivation and big picture

For reproducibility it is important to:

  • document dependencies
  • isolate dependencies from dependencies of other projects

This container:

  • creates a per-project renv-environment and isolates dependencies
  • uses pak under the hood to speed up installation
  • allows to configure a user- or group-wide cache which can be reused across projects
  • does not allow accidental "I will just quickly install it into my system and document it later" since it is a container
  • forces you to document your dependencies which is good for reproducibility and your future self

Dependencies are not installed into the container but only managed by the container.

Quick start on your computer

  1. Create a new directory.
  2. In the new directory create a file install.R which contains:
renv::install('ggplot2')
  1. Download the container:
$ singularity pull https://github.com/bast/contain-R/releases/download/0.1.0/contain-R.sif
  1. Run the following in your terminal (it starts installing stuff; this takes 1-2 minutes on my computer):
$ ./contain-R.sif R --quiet -e 'library(ggplot2)'
  1. Run the above again (now it will only take a second).
  2. Run some R script which depends on that environment:
$ ./contain-R.sif Rscript somescript.R
  1. Or if you want the R interactive shell:
$ ./contain-R.sif R

Quick start on a cluster

Same as above but instead of steps 3 and 4, use the following and adapt paths to your situation:

# probably you do not want to be in your home folder to not fill your disk quota
cd /cluster/work/users/myself/experiment

# download the container
$ singularity pull https://github.com/bast/contain-R/releases/download/0.1.0/contain-R.sif

# you decide where these should go
export RENV_CACHE=/cluster/work/users/myself/renv-cache
export PAK_CACHE=/cluster/work/users/myself/pak-cache

# you need only one of the two
export SINGULARITY_BIND="/cluster"
export APPTAINER_BIND="/cluster"

./contain-R.sif R --quiet -e 'library(ggplot2)'

install.R or renv.lock or both?

You need something to define the environment you want, either install.R or renv.lock.

An install.R file looks like this:

renv::install('ggplot2')
renv::install('vcfR')
renv::install('hierfstat')
renv::install('poppr')

List as many packages as you need. You can pin them to specific versions, if needed:

renv::install("digest@0.6.18")

Alternatively, you can create your environment from renv.lock which looks like this example and typically has been generated by renv:

{
  "R": {
    "Version": "3.6.1",
    "Repositories": [
      {
        "Name": "CRAN",
        "URL": "https://cloud.r-project.org"
      }
    ]
  },
  "Packages": {
    "markdown": {
      "Package": "markdown",
      "Version": "1.0",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "4584a57f565dd7987d59dda3a02cfb41"
    },
    "mime": {
      "Package": "mime",
      "Version": "0.7",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "908d95ccbfd1dd274073ef07a7c93934"
    }
  }
}

For more information about lock files, please see https://rstudio.github.io/renv/reference/lockfiles.html.

The container will process them in this order:

  • If there is only install.R, it will use that one and create an renv environment and lock dependencies in renv.lock.
  • If there is only renv.lock, it will use that one and create an renv environment.
  • If install.R is more recent than renv, it will install from it (again).
  • If renv.lock is more recent than renv, it will install from it (again).

In practice you will probably do either of these two:

  • You arrive with install.R and it will create renv.lock and renv. You can then take the renv.lock and use it to share an environment with your friend. Maybe you modify install.R later and refresh renv.lock and renv.
  • Or you arrive with renv.lock that you got from somebody and it will create renv.

Generated paths

Running the container creates the following files and directories in the same place where you run the container (but you can configure some of them if you want them somewhere else):

  • renv - holding the environment
  • renv.lock - created or updated if you installed from install.R
  • creates or modifies .Rprofile - renv adds the line source("renv/activate.R")
  • renv-cache - renv package cache; you can change its location by defining environment variable RENV_CACHE
  • pak-cache - pak package cache; you can change its location by defining environment variable RENV_CACHE

Installation takes too long?

Running a script for the first time may take time since it needs to set up the environment and download and install dependencies.

However, re-running the script will take no installation time and if dependencies are already in the cache, it will take no time either.

Pak and renv use different caches and methods

For historical reasons they are slightly different but their developers are working on smoothing things out between the two. You will notice the difference if you start from install.R, and then try to restore back from the generated renv.lock: you will notice that the two will use different methods.

Relevant GitHub issues:

You have the option to turn off pak like this:

export USE_PAK=false

Pros and cons of turning it off:

  • Advantage: Only one cache location and everything is nicely consistent. You could install from install.R, then remove it even and run from renv.lock and it would be all consistent and not need to re-install anything.
  • Disadvantage: First installation from install.R might take longer when without pak.

How to configure location for package caches

You can change the location of the package caches:

export RENV_CACHE=/home/user/R/renv-cache
export PAK_CACHE=/home/user/R/pak-cache

Recommendations on where to place package caches

On your own computer it will make sense to reuse the same cache(s) across all projects. This way, when installing dependencies, renv will first look whether you already have the package on your computer.

On a shared cluster it might make sense to have one common cache for your group/allocation since your research group might use similar dependencies in their work. This way you can save space and install time.

Known problems/ ideas for later

  • Maybe you need a different version of R than 4.3.0. I guess we should at some point have several containers for different versions? Or you build your own from the definition file.
  • It could be good to let the user configure where renv itself should be located. Currently it is placed in the same folder where the container is run.

Resources

I have used these resources when writing/testing: