Skip to content

anpan_tutorial

Andrew Ghazi edited this page Jan 27, 2023 · 21 revisions

Overview


This online version of the tutorial is provided for convenience only. It is recommended instead to install the package with build_vignettes = TRUE then run vignette("anpan_tutorial", package = "anpan") in R to view a guaranteed up-to-date version of the tutorial with better formatting.

Introduction

It's difficult to find associations between microbial strains and host health outcomes due to their fine resolution and non-recurrence across individuals. This package, anpan, aims to make inferring those relationships a bit easier by providing an interface to our strain analysis functionality. This functionality covers three main points:

  • Modeling the association between outcome variables and microbial gene presence while accounting for covariates - Per-species gene models
    • Including adaptive sample filtering of per-bug microbial gene profiles to identify and discard samples in which the bug is poorly covered.
  • Modeling the association between outcome variables and the phylogeny of strains within a given microbial species - Phylogenetic modeling
  • Modeling the difference in microbial pathways between experimental groups while controlling for species abundance - Pathway random effects model

Each bullet above contains a link to the relevant section of the tutorial. There libraries needed to run them are given below. There is also an Advanced topics section with some additional information on a handful of more complicated diagnostics/methods/techniques.

Install anpan

The goal of anpan is to consolidate statistical methods for strain analysis. This includes automated filtering of metagenomic functional profiles, testing genetic elements for association with outcomes, phylogenetic association testing, and pathway-level random effects models.

Dependencies

anpan depends on R ≥4.1.0 and the following R packages, most of which are available through CRAN (the exception being cmdstanr):

install.packages(c("ape", 
                   "data.table",
                   "dplyr", 
                   "fastglm",
                   "furrr", 
                   "ggdendro",
                   "ggnewscale",
                   "ggplot2",
                   "loo",
                   "patchwork",
                   "phylogram",
                   "posterior",
                   "progressr",
                   "purrr", 
                   "R.utils",
                   "remotes",
                   "stringr",
                   "tibble",
                   "tidyselect")) # add Ncpus = 4 to go faster

install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))

If the cmdstanr installation doesn't work you can find more detailed instructions at this link.

Once you've installed cmdstanr, you will need to use it to install CmdStan itself:

library(cmdstanr)
check_cmdstan_toolchain()
install_cmdstan(cores = 2)

On some servers it may also be necessary to install / load the GNU MPFR library prior to installing CmdStan.

Installation

Once you have the dependencies, you can install anpan from github with:

remotes::install_github("biobakery/anpan")

Libraries

We'll start by loading the library in this code chunk:

library(anpan)

#> This is anpan version 0.3.0
#> - Get help: Visit the biobakery help forum at https://forum.biobakery.org/
#> - Parallelize: Before calling anpan, run future::plan() in a way that's appropriate for your system.
#> - Show progress: Before calling anpan, run library(progressr); handlers(global=TRUE)

The startup message points out that you can easily parallelize / show progress bars for most long-running computations in anpan by setting plan() and handlers() after loading the furrr and progressr packages. For most users plan(multisession, workers = 4) and handlers(global = TRUE) are probably close to what you want, though both the parallelization strategy and progress reporting are highly customizable.

A couple points to know about anpan function names:

  • All of the modeling functions in anpan are prefixed with anpan_, so if you type that into the RStudio console the auto-complete prompt should show a list of all the modeling functions. This can help you find the function you need quickly without having to dig through the documentation.
  • Plotting functions are prefixed with plot_.
  • Several modeling functions have a _batch() version, which applies a given model to each bug present in a user-specified input directory.

This code chunk loads some other packages we'll use:

library(data.table)
library(ggplot2)
library(tibble)
library(dplyr)
library(ape)

Now that you have all the required libraries loaded, you can progress to any of the model sections:

Working with this package requires some background statistical knowledge, specifically on HMC and probability models. If you'd like some background on this material, I recommend Richard McElreath's fantastic Statistical Rethinking course.

Clone this wiki locally