-
Notifications
You must be signed in to change notification settings - Fork 72
anpan_tutorial
This online version of the tutorial is provided for convenience only. It is recommended instead to install the package with build_vignettes = TRUE
then run vignette("anpan_tutorial", package = "anpan")
in R to view a guaranteed up-to-date version of the tutorial with better formatting.
It's difficult to find associations between microbial strains and host health outcomes due to their fine resolution and non-recurrence across individuals. This package, anpan
, aims to make inferring those relationships a bit easier by providing an interface to our strain analysis functionality. This functionality covers three main points:
- Modeling the association between outcome variables and microbial gene presence while accounting for covariates - Per-species gene models
- Including adaptive sample filtering of per-bug microbial gene profiles to identify and discard samples in which the bug is poorly covered.
- Modeling the association between outcome variables and the phylogeny of strains within a given microbial species - Phylogenetic modeling
- Modeling the difference in microbial pathways between experimental groups while controlling for species abundance - Pathway random effects model
Each bullet above contains a link to the relevant section of the tutorial. There libraries needed to run them are given below. There is also an Advanced topics section with some additional information on a handful of more complicated diagnostics/methods/techniques.
The goal of anpan is to consolidate statistical methods for strain analysis. This includes automated filtering of metagenomic functional profiles, testing genetic elements for association with outcomes, phylogenetic association testing, and pathway-level random effects models.
anpan depends on R ≥4.1.0 and the following R packages, most of which are available through CRAN (the exception being cmdstanr):
install.packages(c("ape",
"data.table",
"dplyr",
"fastglm",
"furrr",
"ggdendro",
"ggnewscale",
"ggplot2",
"loo",
"patchwork",
"phylogram",
"posterior",
"progressr",
"purrr",
"R.utils",
"remotes",
"stringr",
"tibble",
"tidyselect")) # add Ncpus = 4 to go faster
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
If the cmdstanr
installation doesn't work you can find more detailed instructions at this link.
Once you've installed cmdstanr
, you will need to use it to install CmdStan itself:
library(cmdstanr)
check_cmdstan_toolchain()
install_cmdstan(cores = 2)
On some servers it may also be necessary to install / load the GNU MPFR library prior to installing CmdStan.
Once you have the dependencies, you can install anpan from github with:
remotes::install_github("biobakery/anpan")
We'll start by loading the library in this code chunk:
library(anpan)
#> This is anpan version 0.3.0
#> - Get help: Visit the biobakery help forum at https://forum.biobakery.org/
#> - Parallelize: Before calling anpan, run future::plan() in a way that's appropriate for your system.
#> - Show progress: Before calling anpan, run library(progressr); handlers(global=TRUE)
The startup message points out that you can easily parallelize / show progress bars for most long-running computations in anpan
by setting plan()
and handlers()
after loading the furrr
and progressr
packages. For most users plan(multisession, workers = 4)
and handlers(global = TRUE)
are probably close to what you want, though both the parallelization strategy and progress reporting are highly customizable.
A couple points to know about anpan function names:
- All of the modeling functions in anpan are prefixed with
anpan_
, so if you type that into the RStudio console the auto-complete prompt should show a list of all the modeling functions. This can help you find the function you need quickly without having to dig through the documentation. - Plotting functions are prefixed with
plot_
. - Several modeling functions have a
_batch()
version, which applies a given model to each bug present in a user-specified input directory.
This code chunk loads some other packages we'll use:
library(data.table)
library(ggplot2)
library(tibble)
library(dplyr)
library(ape)
Now that you have all the required libraries loaded, you can progress to any of the model sections:
Working with this package requires some background statistical knowledge, specifically on HMC and probability models. If you'd like some background on this material, I recommend Richard McElreath's fantastic Statistical Rethinking course.
- HUMAnN 2.0
- HUMAnN 3.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- PARATHAA