Apptainer/Singularity container for reproducible R environments.
- Apptainer or Singularity CE
install.R
orrenv.lock
file (examples below) that define the environment- An R script/project/command that you want to run in that environment
- No need to install R itself (R 4.3.0 is provided by the container)
For reproducibility it is important to:
- document dependencies
- isolate dependencies from dependencies of other projects
This container:
- creates a per-project renv-environment and isolates dependencies
- uses pak under the hood to speed up installation
- allows to configure a user- or group-wide cache which can be reused across projects
- does not allow accidental "I will just quickly install it into my system and document it later" since it is a container
- forces you to document your dependencies which is good for reproducibility and your future self
Dependencies are not installed into the container but only managed by the container.
- Create a new directory.
- In the new directory create a file
install.R
which contains:
renv::install('ggplot2')
- Download the container:
$ singularity pull https://github.com/bast/contain-R/releases/download/0.1.0/contain-R.sif
- Run the following in your terminal (it starts installing stuff; this takes 1-2 minutes on my computer):
$ ./contain-R.sif R --quiet -e 'library(ggplot2)'
- Run the above again (now it will only take a second).
- Run some R script which depends on that environment:
$ ./contain-R.sif Rscript somescript.R
- Or if you want the R interactive shell:
$ ./contain-R.sif R
Same as above but instead of steps 3 and 4, use the following and adapt paths to your situation:
# probably you do not want to be in your home folder to not fill your disk quota
cd /cluster/work/users/myself/experiment
# download the container
$ singularity pull https://github.com/bast/contain-R/releases/download/0.1.0/contain-R.sif
# you decide where these should go
export RENV_CACHE=/cluster/work/users/myself/renv-cache
export PAK_CACHE=/cluster/work/users/myself/pak-cache
# you need only one of the two
export SINGULARITY_BIND="/cluster"
export APPTAINER_BIND="/cluster"
./contain-R.sif R --quiet -e 'library(ggplot2)'
You need something to define the environment you want, either install.R
or renv.lock
.
An install.R
file looks like this:
renv::install('ggplot2')
renv::install('vcfR')
renv::install('hierfstat')
renv::install('poppr')
List as many packages as you need. You can pin them to specific versions, if needed:
renv::install("digest@0.6.18")
Alternatively, you can create your environment from renv.lock
which looks
like this example and typically has been generated by renv:
{
"R": {
"Version": "3.6.1",
"Repositories": [
{
"Name": "CRAN",
"URL": "https://cloud.r-project.org"
}
]
},
"Packages": {
"markdown": {
"Package": "markdown",
"Version": "1.0",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "4584a57f565dd7987d59dda3a02cfb41"
},
"mime": {
"Package": "mime",
"Version": "0.7",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "908d95ccbfd1dd274073ef07a7c93934"
}
}
}
For more information about lock files, please see https://rstudio.github.io/renv/reference/lockfiles.html.
The container will process them in this order:
- If there is only
install.R
, it will use that one and create anrenv
environment and lock dependencies inrenv.lock
. - If there is only
renv.lock
, it will use that one and create anrenv
environment. - If
install.R
is more recent thanrenv
, it will install from it (again). - If
renv.lock
is more recent thanrenv
, it will install from it (again).
In practice you will probably do either of these two:
- You arrive with
install.R
and it will createrenv.lock
andrenv
. You can then take therenv.lock
and use it to share an environment with your friend. Maybe you modifyinstall.R
later and refreshrenv.lock
andrenv
. - Or you arrive with
renv.lock
that you got from somebody and it will createrenv
.
Running the container creates the following files and directories in the same place where you run the container (but you can configure some of them if you want them somewhere else):
renv
- holding the environmentrenv.lock
- created or updated if you installed frominstall.R
- creates or modifies
.Rprofile
- renv adds the linesource("renv/activate.R")
renv-cache
- renv package cache; you can change its location by defining environment variableRENV_CACHE
pak-cache
- pak package cache; you can change its location by defining environment variableRENV_CACHE
Running a script for the first time may take time since it needs to set up the environment and download and install dependencies.
However, re-running the script will take no installation time and if dependencies are already in the cache, it will take no time either.
For historical reasons they are slightly different but their
developers are working on smoothing things out between the two.
You will notice the difference if you start from install.R
,
and then try to restore back from the generated renv.lock
: you will
notice that the two will use different methods.
Relevant GitHub issues:
You have the option to turn off pak like this:
export USE_PAK=false
Pros and cons of turning it off:
- Advantage: Only one cache location and everything is nicely consistent. You
could install from
install.R
, then remove it even and run fromrenv.lock
and it would be all consistent and not need to re-install anything. - Disadvantage: First installation from
install.R
might take longer when without pak.
You can change the location of the package caches:
export RENV_CACHE=/home/user/R/renv-cache
export PAK_CACHE=/home/user/R/pak-cache
On your own computer it will make sense to reuse the same cache(s) across all projects. This way, when installing dependencies, renv will first look whether you already have the package on your computer.
On a shared cluster it might make sense to have one common cache for your group/allocation since your research group might use similar dependencies in their work. This way you can save space and install time.
- Maybe you need a different version of R than 4.3.0. I guess we should at some point have several containers for different versions? Or you build your own from the definition file.
- It could be good to let the user configure where
renv
itself should be located. Currently it is placed in the same folder where the container is run.
I have used these resources when writing/testing:
- https://rstudio.github.io/renv/
- https://rstudio.github.io/renv/articles/docker.html
- https://pak.r-lib.org/
- https://rstudio.github.io/packrat/ (deprecated)
- https://sites.google.com/nyu.edu/nyu-hpc/hpc-systems/greene/software/r-packages-with-renv
- https://raps-with-r.dev/repro_intro.html
- https://www.youtube.com/watch?v=N7z1K4FhVFE (stream recording on how to use renv)
- https://github.com/singularityhub/singularity-deploy