-
Notifications
You must be signed in to change notification settings - Fork 72
maaslin1
MaAsLin is a multivariate statistical framework that finds associations between clinical metadata and potentially high-dimensional experimental data. MaAsLin performs boosted additive general linear models between one group of data (metadata/the predictors) and another group (in our case relative taxonomic abundances/the response).
MaAsLin is available as a Galaxy module, a Docker image, and included in bioBakery (VM and cloud). For additional information, refer to the manuscript: Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, Reyes JA, Shah SA, LeLeiko N, Snapper SB, Bousvaros A, Korzenik J, Sands BE, Xavier RJ, Huttenhower C. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012 Apr 16;13(9):R79..
We provide support for MaAsLin users. Please join our Google group designated specifically for MaAsLin users. Feel free to post any questions on the google group by posting directly or emailing maaslin-users@googlegroups.com.
Table of contents
The following figure shows the workflow for MaAsLin.
MaAsLin can be run from the command line, as an R module, or through Galaxy. Follow these steps to install MaAsLin locally. Please note, if you are using bioBakery (Vagrant VM or Google Cloud) you do not need to install MaAsLin because the tool and its dependencies are already installed. For detailed information on how to run MaAsLin in Galaxy see the Galaxy section at the end of this tutorial.
MaAsLin requires the following R packages: agricolae, gam (version 1.14), gamlss, gbm, glmnet, inlinedocs, logging, MASS, nlme (version 3.1-127), optparse, outliers, penalized, pscl, robustbase. If you are installing MaAsLin with Docker, these dependencies will be automatically installed. If you are installing MaAsLin from source, install these packages before installing MaAsLin.
To install with Docker: $ docker run -it biobakery/maaslin bash
To install from source, first download the latest version of
MaAsLin.
Next install MaAsLin, replacing X.Y.Z with the version number:
$ R CMD INSTALL Maaslin_X.Y.Z.tar.gz
Once you have obtained the microbial abundance tables through MetaPhlAn2 (See MetaPhlAn2 tutorial for details) or other tools, MaAsLin allows you to determine associations between the microbial abundances (or functions) and the metadata.
MaAsLin requires a single input file that contains the metadata and the microbial abundances. Most likely you will have these as separate files. A utility script is included in MaAsLin to merge these into a single file.
The metadata file contains all of the clinical metadata you would like to analyze to determine if there exist any significant associations with any of the microbial abundances (or functions). This file should be formatted like the example below with the sample name as the first column and additional columns containing the metadata for each of the samples. This file is tab-delimited. Download this file from the following link: HMP.metadata.tsv .
sid RANDSID GENDER STSite
SRS043001 550534656 female Stool
SRS017127 159551223 male Buccal_mucosa
SRS021473 158479027 male Buccal_mucosa
Bash command to download the file:
$ curl -O https://github.com/biobakery/maaslin/tree/master/inst/extdata/HMP.metadata.tsv
The microbial abundance (or function) file contains values normalized to relative abundances. It includes data for all of the samples that are described in the metadata file. This file is tab-delimited with the first row the sample name and rows that follow include abundances. The file should be formatted like the example below. Download this file from the following link: HMP.abundances.tsv.
sid SRS043001 SRS017127 SRS021473
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Porphyromonadaceae 2.57633 0.0 0.0
k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus 0.09916 0.0 0.00175
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Micrococcaceae 0.0 0.07358 0.71281
k__Bacteria|p__Proteobacteria|c__Epsilonproteobacteria 0.0 0.0049 1.10502
Bash command to download the file:
$ curl -O https://github.com/biobakery/maaslin/tree/master/inst/extdata/HMP.abundances.tsv
Follow the instructions below to attach the metadata to the microbial abundance table. This section is optional if you are starting with a single input file of merged metadata and abundance data (relative abundance in the range of 0 to 1 is required). Or if you are using a separate method to attach the metadata, you may skip to the next section (All that is necessary is that the metadata needs to be listed first in the file (before the microbial abundance table), in the format shown by the sample input data). Please ensure that the format of your input files follows that of the demo files above.
Run the following command to create the merged file:
$ merge_metadata.py HMP.metadata.tsv -l < HMP.abundances.tsv > HMP.merged.metadata.abundances.species.renormed.pcl
The resulting file is
HMP.merged.metadata.abundances.species.renormed.pcl
. By default this
MaAsLin script will normalize the values to relative abundances that sum
to 1 for each sample. The abundances are required to be in the range of
0 to 1 in order to apply the arcsine square root transformation
(default). If you are using a different method to merge your data and
metadata, make sure your abundance data is normalized to proportions or
relative abundances. Adding the option -l
only the abundances at a
species-level will be output. You may now use this file as an input for
MaAsLin.
This file is of PCL format. It is the expected input file format for MaAsLin. This file format has the following requirements.
- Rows represent metadata and features (bugs), columns represent samples.
- The first row by default should be the sample ids.
- Metadata rows should be next.
- Lastly, rows containing features (bugs) measurements (like abundance) should be after metadata rows.
- The first column should contain the ID describing the column. For metadata, this may be, for example, "Age" for a row containing the age of the patients donating the samples. For measurements, this should be the feature name (bug name).
- By default, the file is expected to be TAB delimited.
- If a consensus lineage or hierarchy of taxonomy is contained in the feature name, the default delimiter between clades is the pipe ("|").
The read config file determines which rows/columns to process without
modifying the input metadata-merged-microbial abundance table. A sample
.read.config
file
(HMP.read.config)
is shown below:
Matrix: Metadata
Read_PCL_Rows: GENDER,STSite
Matrix: Abundance
Read_PCL_Rows: k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_viscosus-
Bash command to download the file:
$ curl -O https://github.com/biobakery/maaslin/tree/master/inst/extdata/HMP.read.config
The text above dictates that the Metadata matrix includes the rows GENDER and STSite, while the Abundance matrix starts with the row the clade appears in the row (as the row name). For more examples, please refer to the MaAsLin documentation.
Once you have the metadata-merged-microbial abundance table, and the .read.config file (see the samples to ensure the format), you are ready to run MaAsLin. Use the demo input files provided in the prior sections to generate a demo output set.
$ Maaslin.R HMP.merged.metadata.abundances.species.renormed.pcl output -i HMP.read.config
The above command will create a directory: output
, which will contain
the results.
- Looking at the output analysis files there is a pdf of plots generated for the STSite but not the Gender metadata. What do you think this indicates? What file would you look at to confirm your hypothesis? If you need a hint, read the section of the tutorial about output files.
The first example shows how to run a data set with one sample per subject. If you were to have multiple samples per subject, you would want to run a mixed or random effects model, typically with the subject as random effects, although multiple random effects can also be specified.
Looking at the metadata and the merged PCL file, notice there is one
additional metadata row that is not used in this run named RANDSID
.
This row indicates how the sample ids relate to each other with each
subject having more than one sample. Modify the read config file to read
in this metadata row. The new read config file will now look like the
example below.:
Matrix: Metadata
Read_PCL_Rows: -STSite
Matrix: Abundance
Read_PCL_Rows: k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_viscosus-
- There is another way the read config file could be written to also include this metadata row. What does that read config file look like?
Now lets rerun with this new read config file and add an option to run with this new row as a random co-variate.
$ Maaslin.R HMP.merged.metadata.abundances.species.renormed.pcl output_random -i HMP.read.config --random RANDSID
The above command will create a directory: output_random
, which will
contain the results. These results are different than the prior run.
- How do these results differ from the prior results? Why do you think they differ?
Each MaAsLin run will generate multiple output files. The output files
in the QC subfolder are all generated by the initial quality control
steps that filter the data to remove clades and metadata that are
sparse, remove outliers, and arcsine transforms the proportional
abundance data. The file QC/*cleaned.tsv
will contain the original
input file after it has been run through all of the QC steps. The main
output files of interest are the analysis files and the log file.
Each metadata row will have at most one txt output file. If significant associations were found for at least one taxon, a pdf of figures will also be generated.
Running the demo input files will generate a pair of analysis files for the STSite metadata. The txt output file will contain all of the significant associations (taxon-metadata pairs) along with the corresponding regression coefficient, number of observations, number of non-zero observations, P-value, and Q-value (FDR-adjusted P-value using Benjamini–Hochberg (BH) as default). The first few lines of the demo output file are included below (with the full taxonomy reduced to just the species to save space). Download the complete file from the following link, HMP-STSite.txt.
Variable Feature Value Coefficient N N not 0 P-value Q-value
STSite s__Streptococcus_mitis STSiteStool -0.258471709381982 297 204 3.45919838845225e-166 8.02534026120922e-164
STSite s__Streptococcus_mitis STSitePosterior_fornix -0.261049166635981 297 204 4.22958832642797e-135 4.90632245865644e-133
STSite s__Streptococcus_oralis STSiteStool -0.0336438278067034 297 112 3.11171830769747e-85 2.39063208284818e-83
The corresponding pdf file of plots includes only those taxa which pass the significance threshold. An example plot from the demo output set is shown below.
The log file includes the formulas and values for all of the taxons processed. Each taxon has its own section including formulas and values. An example section from the demo outputs follows with the samples and values removed to save space. Download the complete file from the following link, HMP_log.txt .
#taxon s__Actinomyces_viscosus [full taxonomy removed to save space]
#metadata GENDER STSite
#samples SRS011061 [... removed to save space]
#Boost formula adCur ~ `GENDER` + `STSite`
#model-gbm
[values ... removed to save space]
adCur ~ GENDER + STSite
attr(,"variables")
list(adCur, GENDER, STSite)
attr(,"factors")
GENDER STSite
adCur 0 0
GENDER 1 0
STSite 0 1
attr(,"term.labels")
[1] "GENDER" "STSite"
attr(,"order")
[1] 1 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: 0x8079ad8>
attr(,"predvars")
list(adCur, GENDER, STSite)
attr(,"dataClasses")
adCur GENDER STSite
"numeric" "factor" "factor"
#summary-gbm
var rel.inf
STSite STSite 100
GENDER GENDER 0
#model
Call:
lm(formula = as.formula(strFormula), data = frmeTmp, na.action = c_strNA_Action)
Coefficients:
(Intercept) STSitePosterior_fornix STSiteStool
0.01745 -0.01737 -0.01744
#summary
Call:
lm(formula = as.formula(strFormula), data = frmeTmp, na.action = c_strNA_Action)
Residuals:
Min 1Q Median 3Q Max
-0.0174529 -0.0000848 -0.0000121 -0.0000121 0.0214007
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0174529 0.0005423 32.18 <2e-16 ***
STSitePosterior_fornix -0.0173681 0.0009546 -18.20 <2e-16 ***
STSiteStool -0.0174407 0.0007215 -24.17 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.00561 on 294 degrees of freedom
Multiple R-squared: 0.6919, Adjusted R-squared: 0.6898
F-statistic: 330.1 on 2 and 294 DF, p-value: < 2.2e-16
- If running with random effects, how does the log differ from that shown?
Although we recommend using the default options, arguments exist to
modify both the MaAsLin methodology and output files. To see an
up-to-date listing of argument usage, type $ Maaslin.R --help
(from
the command line) or > help(Maaslin)
(within the R environment).
Example args:
strVerbosity = "DEBUG"
dSignificanceLevel = 0.1
In the above example, MaAsLin is modified to produce verbose output for debugging and to change the threshold for reporting associations only when the corresponding q-values are less than or equal to 0.1.
All versus All
The 'all versus all' analysis is a way of manipulating how metadata are
used. When this option is specified, a univariate analysis is run, i.e.
each metadata is separately evaluated for associations with the feature
relative abundances, as opposed to the default multivariable association
testing that models all the metadata at once. To give a more concrete
example, you may have metadata age, sex, and diet. Rather than fitting a
full multivariable model (such as features ~ age + sex + diet
), you
may want to evaluate each metadata separately (i.e. features ~ age
;
features ~ sex
; features ~ diet
). This can be specified in the
command line environment as an additional option as
$ Maaslin.R --fAllvAll
or within the R environment inside the
Maaslin()
function call as fAllvAll = TRUE
.
Forcing covariates
strForcedPredictors
indicates which metadata are forced in the
analysis. In this method, there is a group of metadata that is always
evaluated, in addition to another group of metadata that are not forced.
Following up on the example above, you may want to evaluate the
association between a set of features (i.e. relative abundances of
microbes) and a set of metadata (age and diet), while always adjusting
for sex. In this specific case, the sex metadata is the forced part of
the evaluation while the others are not forced. In addition, you may
also want to evaluate the non-forced set of metadata one at a time (i.e.
'all versus all'). This allows you to individually associate age and
diet while adjusting for sex. This does not, however, bypass the feature
selection method so the metadata that are not forced are subject to
feature selection and may be removed before coming to the evaluation. If
you want all the metadata that are not forced to be evaluated in serial
you will need to turn off feature selection and will have final combined
options (either within the R environment inside the Maaslin()
function
call or as additional arguments to the Maaslin.R
function from the
command line environment) as seen here:
fAllvAll = TRUE
strForcedPredictors = "sex"
strModelSelection = "none"
MaAsLin can be run from the Huttenhower lab Galaxy instance. With this method, you do not need to install MaAsLin on your local machine.
To run on Galaxy, MaAsLin requires a microbial abundance table with metadata attached (e.g.: maaslin_demo2.pcl).
Once you have your feature table with metadata ready, you can run it through MaAsLin on Galaxy by following these steps (these steps use the demo data as input):
- Go to the Huttenhower Galaxy Server: http://huttenhower.sph.harvard.edu/galaxy/.
- Click on the
Get Data -> Upload File
link on the left pane and upload the demo file maaslin_demo2.pcl. You can do this by clicking on theBrowse
button, selecting the demo file, and then pressing theStart
button. Select formattabular
.
- Click on the
MaAsLin
link on the left pane.- Select the input data from the
Input file
drop-down menu. - Specify the last metadata row in the sample, after which the microbial species are listed (this is Weight in our sample dataset).
- Click on Execute
- Select the input data from the
The results will appear on the right pane. You may proceed with viewing it (by clicking on the Eye symbol) or downloading it on your computer (by clicking on the Save symbol).
- HUMAnN 2.0
- HUMAnN 3.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- PARATHAA