Scripts to perform analysis for the cvanmf manuscript. These are a mix of
Juypter notebooks and R scripts.
The analysis environment is provided as a Docker container, which contains all the software needed to run the analysis. Broadly, the intention is that the analysis is run by starting the container with this repository mounted, as scripts can then write results outside the container. While other container engines could be used, we provide instructions for docker engine.
Docker engine is at the time of writing under the Apache License 2.0 which allows academic (and many other) uses. Note that Docker Desktop is not, and may require a license. You can run docker engine without Docker Desktop on Windows via WSL2, MacOS via Colima, or Linux.
An image for analysis has been uploaded to Github Container Registry. Instructions given here are for docker engine. If you are using a non amd64 architecture, you will need to instead build the image.
docker pull ghcr.io/apduncan/cvanmf_analysis:submission
This will load the image ghcr.io/apduncan/cvanmf_analysis:submission.
You can check the image is available using docker image list:
(base) kam24goz@N140910:~$ docker image list
REPOSITORY TAG IMAGE ID CREATED SIZE
[...]
apduncan/cvanmf_analysis submission f24fc0a79480 21 hours ago 4.32GB
Clone this repository
git clone apduncan/cvanmf_analysis
These scripts were developed using VS Code connected to a devcontainer, and can be run using that method following these steps:
- Install the devcontainers extension in VS Code
- Ensure you have a container engine installed (e.g. docker engine, Docker desktop, podman)
- Open your local copy of the repository in VS Code
- Use Dev Containers: Reopen in Container from the command palette (F1, Ctrl+Shift+P).
Your clone of the repository will now be open and scripts and notebooks will run in the container.
If you don't want to use VS Code, you can run the container and mount the repository. From the root of the cloned repository
docker run \
-it \
--mount type=bind,src="$(pwd)",dst=/cvanmf_analysis \
-p 8888:8888 \
--name cvanmf_analysis \
--volume /etc/passwd:/etc/passwd:ro \
--volume /etc/group:/etc/group:ro \
--user $(id -u) \
ghcr.io/apduncan/cvanmf_analysis:submission && \
docker attach cvanmf_analysis
The repository will be mounted as /cvanmf_analysis in the container,
so you should cd /cvanmf_analysis. This enters the container as you current
user. If you are having issues with running the container with this, you
can remove the lines relating to /etc/passwd and --user, which will instead
run as root. This will cause result files to be written as root however,
which can make them frustrating to work with later.
The analysis is divided into numbered subdirectories for each topic. Within
those are the analysis scripts, which are either R scripts, or Jupyter
notebooks.
Data is distributed in compressed format, and can be decompressed
as explained below.
Data used as input to the analysis will typically be in data/, and results
written to results/, with subdirectories for figures, tables and notebooks.
Each topic has it's own Readme.md explaining any specifics.
Data is included in the repo in tar.gz format. You should first decompress
all of this. We provide a script to do this:
./extract_data.sh
The script run_all_analysis.sh will run all the included analysis scripts.
R scripts will be run using Rscript and notebooks will be run and written
as HTML using nbconvert.
There are some steps which take a long time to run. These are:
03_global_diversity- producing PCoA, ~20 minutes
Some steps are not run at all automatically. These are
- Rank selection benchmarking. See https://github.com/apduncan/cvanmf_benchmark
- Rank selection execution time benchmark. This is implemented in small nextflow
pipeline in
01_rank_selection/time_benchmarks/. This will take several days to run - Random Forest training and evaluation
To avoid these, you can run each script individually or each notebook interactively as explained below.
Each subdirectory has a run.sh which will run all analyses for that topic.
You can also run individual scripts and notebooks as below.
If you are using VS Code, you can run or source each R script as normal throuh
the interface. However, in the interactive R terminal, you must set the working
directory to the scripts directory e.g.
setwd("02_enterosignatures")
To run from command line, use Rscript script.R from the script's directory.
If you are using VS Code, you should be able to open the Jupyter notebooks directly, and run cells as normal.
If not, you can start a Jupyter server in the container and access that using your browser. There are a few flags you must set to make it accessible outside the container:
jupyter lab --allow-root --ip 0.0.0.0 --no-browser
Then you should be able to access http://localhost:8888/lab?token=providedtoken and interactively run the notebooks.
In some situations you might need to take some additional steps to run the analysis.
If you are unable to load the saved image, for instance being on a
different architecture (non amd64), you will need to build it from the
Dockerfile.
This may produce different results, as the versions of the R packages are
not pinned in the build process. You should use the saved image if at all
possible.
To build with Docker, in the project root, run
docker build -t cvanmf_analysis:custom -f .devcontainer/Dockerfile .
You can change the name (cvanmf_analysis:custom) if you want.
To use this in the devcontainer, change the
.devcontainer/devcontainer.json to contain
{
"name": "cvanmf paper",
"image": "cvanmf_analysis:custom",
[... rest of the file ...]
}