Skip to content

apduncan/cvanmf_analysis

Repository files navigation

cvanmf analysis

Scripts to perform analysis for the cvanmf manuscript. These are a mix of Juypter notebooks and R scripts.

Setting Up Environment

The analysis environment is provided as a Docker container, which contains all the software needed to run the analysis. Broadly, the intention is that the analysis is run by starting the container with this repository mounted, as scripts can then write results outside the container. While other container engines could be used, we provide instructions for docker engine.

Docker engine is at the time of writing under the Apache License 2.0 which allows academic (and many other) uses. Note that Docker Desktop is not, and may require a license. You can run docker engine without Docker Desktop on Windows via WSL2, MacOS via Colima, or Linux.

Loading Image

An image for analysis has been uploaded to Github Container Registry. Instructions given here are for docker engine. If you are using a non amd64 architecture, you will need to instead build the image.

docker pull ghcr.io/apduncan/cvanmf_analysis:submission

This will load the image ghcr.io/apduncan/cvanmf_analysis:submission. You can check the image is available using docker image list:

(base) kam24goz@N140910:~$ docker image list
REPOSITORY                  TAG          IMAGE ID       CREATED        SIZE
[...]
apduncan/cvanmf_analysis    submission   f24fc0a79480   21 hours ago   4.32GB

Clone This Repo

Clone this repository

git clone apduncan/cvanmf_analysis

Option 1: Devcontainer & VS Code

These scripts were developed using VS Code connected to a devcontainer, and can be run using that method following these steps:

  1. Install the devcontainers extension in VS Code
  2. Ensure you have a container engine installed (e.g. docker engine, Docker desktop, podman)
  3. Open your local copy of the repository in VS Code
  4. Use Dev Containers: Reopen in Container from the command palette (F1, Ctrl+Shift+P).

Your clone of the repository will now be open and scripts and notebooks will run in the container.

Option 2: Docker & Mount Reposistory

If you don't want to use VS Code, you can run the container and mount the repository. From the root of the cloned repository

docker run \
    -it \
    --mount type=bind,src="$(pwd)",dst=/cvanmf_analysis \
    -p 8888:8888 \
    --name cvanmf_analysis \
    --volume /etc/passwd:/etc/passwd:ro \
    --volume /etc/group:/etc/group:ro \
    --user $(id -u) \
    ghcr.io/apduncan/cvanmf_analysis:submission && \
    docker attach cvanmf_analysis

The repository will be mounted as /cvanmf_analysis in the container, so you should cd /cvanmf_analysis. This enters the container as you current user. If you are having issues with running the container with this, you can remove the lines relating to /etc/passwd and --user, which will instead run as root. This will cause result files to be written as root however, which can make them frustrating to work with later.

Running Analysis

Note on Structure

The analysis is divided into numbered subdirectories for each topic. Within those are the analysis scripts, which are either R scripts, or Jupyter notebooks. Data is distributed in compressed format, and can be decompressed as explained below. Data used as input to the analysis will typically be in data/, and results written to results/, with subdirectories for figures, tables and notebooks. Each topic has it's own Readme.md explaining any specifics.

Extracting Data

Data is included in the repo in tar.gz format. You should first decompress all of this. We provide a script to do this:

./extract_data.sh

Run All Analysis

The script run_all_analysis.sh will run all the included analysis scripts. R scripts will be run using Rscript and notebooks will be run and written as HTML using nbconvert.

There are some steps which take a long time to run. These are:

  • 03_global_diversity - producing PCoA, ~20 minutes

Some steps are not run at all automatically. These are

  • Rank selection benchmarking. See https://github.com/apduncan/cvanmf_benchmark
  • Rank selection execution time benchmark. This is implemented in small nextflow pipeline in 01_rank_selection/time_benchmarks/. This will take several days to run
  • Random Forest training and evaluation

To avoid these, you can run each script individually or each notebook interactively as explained below.

Run Individual Analyses

Each subdirectory has a run.sh which will run all analyses for that topic. You can also run individual scripts and notebooks as below.

R Scripts

If you are using VS Code, you can run or source each R script as normal throuh the interface. However, in the interactive R terminal, you must set the working directory to the scripts directory e.g. setwd("02_enterosignatures")

To run from command line, use Rscript script.R from the script's directory.

Jupyter Notebooks

If you are using VS Code, you should be able to open the Jupyter notebooks directly, and run cells as normal.

If not, you can start a Jupyter server in the container and access that using your browser. There are a few flags you must set to make it accessible outside the container:

jupyter lab --allow-root --ip 0.0.0.0 --no-browser

Then you should be able to access http://localhost:8888/lab?token=providedtoken and interactively run the notebooks.

Potential Steps

In some situations you might need to take some additional steps to run the analysis.

Building the Image

If you are unable to load the saved image, for instance being on a different architecture (non amd64), you will need to build it from the Dockerfile. This may produce different results, as the versions of the R packages are not pinned in the build process. You should use the saved image if at all possible.

To build with Docker, in the project root, run

docker build -t cvanmf_analysis:custom -f .devcontainer/Dockerfile .

You can change the name (cvanmf_analysis:custom) if you want. To use this in the devcontainer, change the .devcontainer/devcontainer.json to contain

{
	"name": "cvanmf paper",
    "image": "cvanmf_analysis:custom",
    [... rest of the file ...]
}

About

Data analysis using cvanmf for manuscript

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors