Skip to content

ANPAN_StrainPhlAn

kelsthom13 edited this page Jul 27, 2023 · 7 revisions

Make sure you have loaded the libraries from the main anpan_tutorial page!

ANPAN's Phylogenetic Model with StrainPhlAn trees

PGLMMs

Phylogenetic generalized linear mixed models (PGLMMs) are probabilistic models that account for phylogenetic structure.


Table of Contents


For a PGLMM with a linear regression as the base model, the model structure is as follows:

$$ y = X \beta + (1|\text{leaf}) + \epsilon $$

$$(1|\text{leaf}) \sim \text{MVNormal}(0, σ_p^2\Omega)$$

$$\epsilon \sim \text{Normal}(0, σ_R^2)$$

The outcome $y$ is modeled as a function of some familiar terms: covariates $X$, coefficients $\beta$, and residual noise $\epsilon$. The key addition is the phylogenetic term $(1|\text{leaf})$, which contains a "random effect" for each observation in $y$. Unlike typical random effects (which are usually independent between different levels of the random effect variable), the values in the phylogenetic term follow a pre-specified correlation structure $\Omega$, which is derived from a tree. The variability of the phylogenetic term is scaled by the "phylogenetic noise" parameter $\sigma_{phylo}$.

To understand how a tree implies a correlation structure, consider the plot below showing a phylogenetic tree and the lower triangle of the correlation matrix it implies. You can see that the clade on the left (with low inter-leaf distances between its members) forms a bright sub-block of high correlation in the matrix. The tiny clade on the far right has high inter-leaf distance between its members and the rest of the tree, so its members generally have low correlation with the rest of the leaves, exhibited by the dark band on the right of the triangle. Note that the matrix is derived solely from the structure of the tree (the black lines) -- it is not influenced by the colored outcome dots. The correlation matrix numerically quantifies the "higher inter-leaf distance → lower correlation" principle.

We can see in the synthetic example above that the tree clearly does impact the outcome: the two well-separated clades clearly show very different outcome values. A PGLMM evaluated on this data would easily detect such a blazing signal, so in the remainder of this section, we'll simulate a tree with an outcome that's realistically noisier and less visually obvious. After that, we'll fit the PGLMM with anpan_pglmm() and then examine the results.

Input data

Note: If you are running this tutorial as part of a Huttenhower lab run course on the VM, the files do not need to be downloaded.

Only use the next few lines of code if you are running this tutorial in the guacamole VMs.

#Move to the directory for this tutorial
mkdir ~/Tutorials/anpan
cd ~/Tutorials/anpan

# Create input and output directories and move files from the StrainPhlAn tutorial into the input
mkdir input
mkdir output
cp ../strainphlan4/pre-baked/output/RAxML_bestTree.t__SGB4933_group.StrainPhlAn4.tre ./input
cp ../strainphlan4/metadata.txt ./input

#Launch R
R

Phylogenetic tree

For this tutorial, we will be using the output from our StrainPhlAn 4 Tutorial. StrainPhlAn is a tool for reconstructing the dominant clade of a species within a sample. Briefly, StrainPhlAn takes all the sequences that mapped to the markers for a species in MetaPhlAn and assesses the reads for well-covered, consistent single nucleotide level substitutions in the samples sequence compared to the reference marker. It then builds a phylogenetic tree comparing the dominant species-level clade from one sample to the entire dataset's sample-specific species, building a within-species phylogenetic tree.

For the StrainPhlAn tutorial, a tree was constructed from SGB4933_group which falls into the definition for the species Eubacterium rectale. The trees we created and visualized indicated clear clustering by geographic region of metagenome collection. Through the phylogenetic model within ANPAN we are going to test if that clustering is significant. You can download the tree file here.

Read the tree into the ecosystem using the ape library.

#Load the needed libraries
library(data.table)
library(ggplot2)
library(tibble)
library(dplyr)
library(ape)
library(anpan)

#read in the tree generated by StrainPhlAn - we usually work with the `*_bestTree_* file. 
tree = read.tree("./input/RAxML_bestTree.t__SGB4933_group.StrainPhlAn4.tre")

Metadata

The samples that were used in StrainPhlAn were sourced from several studies from different places across the globe. Below is the metadata table from the tutorial. It can also be downloaded from here.

Sample Study Country
CMD64776337ST-21-0 ZellerG_2014 DEU
G78505 VatanenT_2016 RUS
G88884 SchirmerM_2016 NLD
G88970 SchirmerM_2016 NLD
G89027 SchirmerM_2016 NLD
H2M514903 LiJ_2017 CHN
H3M518116 LiJ_2017 CHN
HD-1 QinN_2014 CHN
HD-5 QinN_2014 CHN
HV-6 QinN_2014 CHN
LD-48 QinN_2014 CHN
M1.42.ST BritoIL_2016 FJI
M2.48.ST BritoIL_2016 FJI
M2.58.ST BritoIL_2016 FJI
N021 WenC_2017 CHN
RA023 WenC_2017 CHN
S353 KarlssonFH_2013 SWE
SID530054 FengQ_2015 AUT
SRS011302 HMP_2012 USA
SZAXPI003417-4 YuJ_2015 CHN
T2D-025 QinJ_2012 CHN
T2D-063 QinJ_2012 CHN
T2D-105 QinJ_2012 CHN
W1.27.ST BritoIL_2016 FJI
YSZC12003_36795 XieH_2016 GBR

For ANPAN to run correctly we will need to have a binary outcome variable, which does not exist in this metadata yet, here we are going to break the metadata into the metagenomes sourced from the Asian continent (A) vs. all others (other). We are doing this to test if there is a geographic signal within this clade, which has already been well researched as shown in this manuscript, previously. Typically the outcome from this model will be something like case/control.

Next, we will read in and format the metadata to work with ANPAN.

#read in the metadata 
meta = read.csv("./input/metadata.txt", header = T, sep = "\t")

#Add a column for our binary outcome
meta$continent = ifelse(meta$Country == "CHN" | meta$Country == "FJI", "A", "other")
meta$continent = factor(meta$continent, levels = c("A", "other"))
#We do not want to include the referenced in the analysis since our metadata does not include where those isolates where collected from. 
meta = subset(meta, Country != "Ref")



#check that your tip.labels are in the metadata
intersect(meta$SampleID, tree$tip.label)
names(meta) = c("sample_id", "country", "continent")
#Our sample labels do match our metadata! 

Fit the PGLMM

The code chunk below shows how to use anpan_pglmm() to fit a PGLMM that examines these data for phylogenetic patterns while adjusting for (possible) covariates, here we do not have any covariates to provide but you could use the covariates parameter. It will take a couple of minutes to run. By default anpan_pglmm regularizes the noise ratio sigma_phylo / sigma_resid with a Gamma(1,2) prior and runs a leave-one-out model comparison against a "base" model that doesn't have the phylogenetic component. The default family = "gaussian" argument means the residual error is normally distributed. For this example, we switch to family = "binomial" since we have a binary outcome and thus run a phylogenetic logistic regression.

#Running ANPAN 
##Required: 
###column called "sample_id" with matching labels to the tips. We set this up in the metadata section. If you have strings you want to remove the tip labels (e.g. the _bowtie2 that is added by the bioBakery workflows) you can add that string to the parameter `trim_pattern =` (`trim_patter = "_bowtie2`).
###the `reg_noise` parameter has to be set to FALSE (F) if you are working with a binomial model. 
###For a binary outcome the outcome variable has to be a factor with two levels in your metadata. (e.g. use the `factor` function in R to set that column and the `levels` for the variables)

results_anpan = anpan_pglmm(meta, tree, out_dir = "./test_anpan", outcome = "continent",
            family = 'binomial',
            bug_name        = "E.recatle",
            reg_noise       = F,
            loo_comparison  = TRUE,
            run_diagnostics = FALSE,
            refresh         = 500,
            show_plot_tree  = F,
            show_post       = F)

This model finds that there is a significant improvement in predicting the outcome after the addition of a phylogenetic signal, given the sample size the improvement is modest (less than 4). Thus, within this species, there are distinct sub-clades or strains of E. rectale by geographic collection site, but the model can't fully differentiate between the strains by location.

See the tutorial with synthetic data for a more thorough explanation of how the model is working/diagnostics.

The results should look like this - where you can visually observe where the posterior effect is dipping toward one of the binary inputs max.

Clone this wiki locally