MTXmodel_tutorial
The MTXmodel R package was built for metatranscriptomics (MTX) differential gene expression analysis using MaAsLin2 (Microbiome Multivariable Association with Linear Models). It integrates feature-specific covariates to determine multivariable associations between metadata and microbial MTX features. RNA expression changes within a microbial community are highly affected by the underlying differences in metagenomic abundances (i.e. gene copy number or the abundance of a given microbe). This package can adjust for the community DNA abundance as a continuous covariate for a given feature in the model, allowing for robust differential expression analysis in microbial communities.
If you use the MTXmodel package, please cite our manuscript: Yancong Zhang, Kelsey N. Thompson, Huttenhower C, Eric A. Franzosa. "Statistical approaches for differential expression analysis in metatranscriptomics." Bioinformatics, 37.Supplement_1: i34-i41 (2021).
And feel free to link it to your Methods: https://github.com/biobakery/MTX_model
- 1. Description
- 2. Install the library
- 3. Running the MTXmodel
- 4. Running on command line
- 5. Session Info
The R library MTXmodel keeps all the features and functions of MaAsLin2 and adds new functionality which can be used for adjusting feature-specific covariates. In this tutorial, we will walk through most of the steps from the manuscript. We will compare the output of unadjusted MaAsLin2 runs of MTX data with ratio-adjusted MTX data using DNA copy number or MTX abundance data adjusted with DNA abundances.
MTXmodel is an R package that can be run on the command line or as an R function.
Skip this step if you are doing this tutorial as part of a Huttenhower Lab run course.
If only running from the command line, you do not need to install the MTXmodel package but you will need to install the MTXmodel dependencies.
NOTE: You only need to do one of the following options to install the MTX_model package.
- Download the source: MTX_model.master.zip
- Decompress the download:
$ tar xzvf MTX_model-master.zip
- Install the Bioconductor dependencies edgeR and metagenomeSeq.
- Install the CRAN dependencies:
$ R -q -e "install.packages(c('lmerTest','pbapply','car','dplyr','vegan','chemometrics','ggplot2','pheatmap','hash','logging','data.table','MuMIn','glmmTMB','MASS','cplm','pscl'), repos='http://cran.r-project.org')"
- Install the MTXmodel package (only required if running as an R function):
$ R CMD INSTALL MTX_model-master
library(devtools)
install_github('biobakery/MTX_model')
MTXmodel can be run from the command line or as an R function. Both methods require the same arguments, have the same options, and use the same default settings.
MTXmodel requires three input files:
- A feature table of RNA abundances - we generated this with HUMAnN 2
- This file is tab-delimited.
- Formatted with features as columns and samples as rows.
- The transposition of this format is also okay.
- Possible features in this file include RNA abundance of genes, enzymes, or pathways.
- Covariate DNA data of features file
- This file is tab-delimited.
- Formatted with features as columns and samples as rows.
- The transposition of this format is also okay.
- Possible data in this file include DNA abundance of genes, enzymes, or pathways.
- Metadata file
- This file is tab-delimited.
- Formatted with features as columns and samples as rows.
- The transposition of this format is also okay.
- Possible metadata in this file include pH, disease status, or age.
The data file can contain samples not included in the metadata file (as true for the reverse case of more samples in the metadata). For both cases, those samples that are not included in both files will be removed prior to model construction. Additionally, The sample order within the files does not need to match as MTXModel will double check this for us.
NOTE: If running MTXmodel as an R
function, the data and metadata
inputs can be of type data.frame
instead of a path to a file.
Example input files can be found in the inst/extdata
folder
of the MTXmodel source. The files provided were generated from
the HMP2 data which can be downloaded from https://ibdmdb.org/ .
HMP2_pwyRNA.tsv
: is a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway RNA abundances for all samples.
HMP2_pwyDNA.tsv
: is a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway DNA abundances for all samples.
HMP2_pwy.RNA_DNA_ratio.tsv
: is a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway RNA abundances for all samples and it has been normalized as a ratio of the underlying (matched) DNA abundances.
HMP2_metadata.tsv
: is a tab-delimited file with samples as rows and metadata as columns. It is a subset of the metadata file so that it just includes some of the fields.
MTXmodel generates the same types of output files with MaAsLin2: data and visualization. See more details in MaAsLin2 manual. For this tutorial, we also will run two models using MaAsLin 2
instead of the wrapper.
Set up the working directory - this example is specific to the Huttenhower VMs - if you are running this somewhere else try setwd("./Desktop")
or similar in your own environment
getwd()
setwd("./Tutorials/MTX_model/") # change this line to your specific environment
dir.create("R_MTXmodel_Tuorial") # Create a new directory
setwd("./R_MTXmodel_Tuorial") # Change the current working directory
getwd() #check if the directory has been successfully changed
Next, we will read-in the files downloaded with MTXmodel library in R. Here we are first setting their path to an object and then reading them in as data.frames for MaAsLin 2 alternatively, you can use the path to the file natively.
library(MTXmodel) #add the library to the working path
#read-in each file from the package then convert it to a data.frame
### RNA abundances
input_data <- system.file(
'extdata','HMP2_pwyRNA.tsv', package="MTXmodel")
df_input_data = read.table(file = input_data,
header = TRUE,
sep = "\t",
row.names = 1,
stringsAsFactors = FALSE)
df_input_data[1:5, 1:5]
# RNA/DNA ratio data
#input_dataratio <- system.file(
# 'extdata','HMP2_pwy.RNA_DNA_ratio.tsv', package="MTXmodel")
df_input_dataratio = read.table(file = "~/Tutorials/MTX_model/inst/extdata/HMP2_pwy.RNA_DNA_ratio.tsv",
header = TRUE,
sep = "\t",
row.names = 1,
stringsAsFactors = FALSE)
df_input_dataratio[1:5, 1:5]
# Metadata from the HMP2
input_metadata <-system.file(
'extdata','HMP2_metadata.tsv', package="MTXmodel")
df_input_metadata = read.table(file = input_metadata,
header = TRUE,
sep = "\t",
row.names = 1,
stringsAsFactors = FALSE)
df_input_metadata[1:5, 1:5]
# DNA data
input_dnadata <- system.file(
'extdata','HMP2_pwyDNA.tsv', package="MTXmodel")
df_input_dnadata = read.table(file = input_dnadata,
header = TRUE,
sep = "\t",
row.names = 1,
stringsAsFactors = FALSE)
df_input_dnadata[1:5, 1:5]
Next, we are going to run the model in three different ways:
- The raw RNA abundances from HUMAnN with MaAsLin 2
- The RNA/DNA ratios - which were created using a helper script from HUMAnN on paired MGX/MTX data with MaAsLin 2
- Using the MTXmodel library to further adjust the raw RNA abundance by the underlying DNA abundances
In this first example run on MTX data, we will run MaAsLin 2 on the RNA pathway abundances as characterized by HUMAnN 2 but not normalized by the matched DNA of these samples.
library(Maaslin2)
fit_rna <- Maaslin2(
df_input_data, df_input_metadata, 'demo_output_rna', transform = "LOG",
fixed_effects = c('diagnosis', 'dysbiosisnonIBD','dysbiosisUC','dysbiosisCD', 'antibiotics', 'age'),
random_effects = c('site', 'subject'),
reference = "diagnosis,nonIBD",
normalization = 'NONE',
standardize = FALSE
)
Next, we will run the same model changing the input to the RNA/DNA ratio data.frame
.
fit_rna_ratio <- Maaslin2(
df_input_dataratio, df_input_metadata, 'demo_output_rna_ratio', transform = "LOG",
fixed_effects = c('diagnosis', 'dysbiosisnonIBD','dysbiosisUC','dysbiosisCD', 'antibiotics', 'age'),
random_effects = c('site', 'subject'),
reference = "diagnosis,nonIBD",
normalization = 'NONE',
standardize = FALSE
)
Finally, we run the MTXmodel. Here we will switch the input_data
back to the raw RNA abundance file (CPM normalized), and add a flag for the DNA abundance file, input_dnadata = input_dnadata
, which will add the covariate of the DNA abundance to the linear model as it runs within MaAsLin 2. The other MTXmodel specific flag is --rna_dna_flt=lenient
which applies different levels of filtering for the zeros in the data. This flag can be tuned depending on how zero-inflated your data is, the default is lenient which filters out any features that are absent in either the DNA or the RNA (must have at least one sample with some abundance in the RNA and DNA).
library(MTXmodel)
fit_rna_dna <- MTXmodel(
df_input_data, df_input_metadata, 'demo_output_rna_dna', transform = "LOG",
fixed_effects = c('diagnosis', 'dysbiosisnonIBD','dysbiosisUC','dysbiosisCD', 'antibiotics', 'age'),
random_effects = c('site', 'subject'),
reference = "diagnosis,nonIBD",
normalization = 'NONE',
standardize = FALSE,
input_dnadata = df_input_dnadata
)
Finally, let's use some simple R scripts to compare the results from each model:
First, we will just look at the raw counts of features associated with the CD term in the diagnosis variable. To do this we will use the base R function subset
to subset the results to just the ones from the diagnosis comparisons and table
to count the number of pathways that were associated with UC/CD in each model.
#compare the raw counts of features associated with CD
results_rna = subset(fit_rna$results, metadata == "diagnosis")
table(results_rna$value)
#CD/UC = 186
results_rna_ratio = subset(fit_rna_ratio$results, metadata == "diagnosis")
table(results_rna_ratio$value)
#CD/UC = 37
results_rna_dna = subset(fit_rna_dna$results, metadata == "diagnosis")
table(results_rna_dna$value)
#CD/UC = 178
As you can tell, this number was highly dependent on the model, with most results coming from the raw RNA abundances through MaAsLin and the least results from the RNA/DNA ratio data once again with the MaAsLin function.
Next, let's look at which (if any) features overlapped between the models. We can do this with the intersect
call in R:
# features called by the RNA/DNA ratios compared to the Raw RNA abundances
intersect(results_rna_ratio$feature, results_rna$feature) # 37 features overlapped
# features called by the Raw RNA abundances compared to the RNA abundances adjusted by the DNA abundances
intersect(results_rna$feature, results_rna_dna$feature) #178 features overlapped between the models
# features called by the RNA/DNA ratios compared to the RNA abundances adjusted by the DNA abundances
intersect(results_rna_ratio$feature, results_rna_dna$feature) # 37 features overlapped
From this, we can see that while the models are calling different total numbers of pathways, the ones that they are calling are overlapping between all the models.
We can also look at the top pathways that each model called. The MaAsLin results are sorted by FDR-qvalue, so here we can use head(x, number)
to get the number
of top rows.
For the Raw RNA Abundance:
#compare the raw counts of features associated with CD
head(results_rna$feature, 3)
results_rna$model = "RNA" # create a column that describes the data
[1] "PWY_5690_TCA_cyc_II" "PWY_5690_TCA_cyc_II" "PWY_7199_pyrimidine_dns_salvage"
#RNA/DNA ratios
#compare the raw counts of features associated with CD
head(results_rna_ratio$feature, 3)
results_rna_ratio$model = "RNA/DNA Ratio" # create a column that describes the data
[[1] "PWY_6700_queuosine_biosyn" "PWY_6317_galactose_deg_I" "PWY66_422_D_galactose_deg_V"
#MTXmodel
#compare the raw counts of features associated with CD
head(results_rna_dna$feature, 3)
results_rna_dna$model = "RNA adjusted by DNA" # create a column that describes the data
list = unique(results_rna_dna$feature)[c(1:10)] # select the top pathway results for the next step
[1] "PWY_3841_folate_transformations_II" "X1CMET2_PWY_N10_formyl_tetrahydrofolate_biosyn" "PWY_3841_folate_transformations_II"
Finally, let's plot the top results in the MTX_model, across all the models. Here we first create one object that includes all the results, then subset it to the list we created above which is the top 10 pathways in the MTXmodel results.
library(ggplot2)
results = rbind(results_rna, results_rna_dna, results_rna_ratio) # create one object from the results
results = results[results$feature %in% list, ] # subset to just the top features in the MTXmodel
ggplot(results, aes(x = coef, y = feature, color = model)) + geom_point(aes(shape = value)) + theme_classic()
ggsave("model_comparison.png", height = 5, width = 8) # save the plot to the working directory
Here, you can tell that model choice does matter quite a bit on the effect size. So even if the models are calling the same features the underlying effect size does vary.
$ MTXmodel.R inst/extdata/HMP2_pwyRNA.tsv inst/extdata/HMP2_metadata.tsv demo_output --min_abundance 0 --min_prevalence 0.0 --max_significance 0.25 --min_variance 0.0 --correction BH --standardize TRUE --normalization NONE --transform LOG --analysis_method LM --cores 1 --fixed_effects diagnosis,dysbiosisCD,dysbiosisUC,dysbiosisnonIBD,antibiotics,age --random_effects subject,site --plot_heatmap FALSE --plot_scatter FALSE --reference diagnosis,nonIBD --rna_dna_flt semi_strict --input_dnadata inst/extdata/HMP2_pwyDNA.tsv
- Make sure to provide the full path to the MTXmodel executable (i.e., ./R/MTXmodel.R).
- In the demo command:
-
HMP2_pwyRNA.tsv
is the path to your MTX-feature data file -
HMP2_pwyDNA.tsv
is the path to your paired MGX-feature data file -
HMP2_metadata.tsv
is the path to your metadata file -
demo_output
is the path to the folder to write the output
-
Session info from running the demo in R can be displayed with the following command.
sessionInfo()
Run MTXmodel help to print a list of the options and the default settings.
$ MTXmodel.R --help
Usage: ./R/MTXmodel.R [options] <data.tsv> <metadata.tsv> <output_folder>
Options:
-h, --help
Show this help message and exit
-a MIN_ABUNDANCE, --min_abundance=MIN_ABUNDANCE
The minimum abundance for each feature [ Default: 0 ]
-p MIN_PREVALENCE, --min_prevalence=MIN_PREVALENCE
The minimum percent of samples for which a feature is detected at minimum abundance [ Default: 0.1 ]
-b MIN_VARIANCE, --min_variance=MIN_VARIANCE
Keep features with variances greater than value [ Default: 0 ]
-s MAX_SIGNIFICANCE, --max_significance=MAX_SIGNIFICANCE
The q-value threshold for significance [ Default: 0.25 ]
-n NORMALIZATION, --normalization=NORMALIZATION
The normalization method to apply [ Default: TSS ] [ Choices: TSS, CLR, CSS, NONE, TMM ]
-t TRANSFORM, --transform=TRANSFORM
The transform to apply [ Default: LOG ] [ Choices: LOG, LOGIT, AST, NONE, PA ]
-m ANALYSIS_METHOD, --analysis_method=ANALYSIS_METHOD
The analysis method to apply [ Default: LM ] [ Choices: LM, CPLM, NEGBIN, ZINB, LOGIT ]
-r RANDOM_EFFECTS, --random_effects=RANDOM_EFFECTS
The random effects for the model, comma-delimited for multiple effects [ Default: none ]
-f FIXED_EFFECTS, --fixed_effects=FIXED_EFFECTS
The fixed effects for the model, comma-delimited for multiple effects [ Default: all ]
-c CORRECTION, --correction=CORRECTION
The correction method for computing the q-value [ Default: BH ]
-z STANDARDIZE, --standardize=STANDARDIZE
Apply z-score so continuous metadata are on the same scale [ Default: TRUE ]
-l PLOT_HEATMAP, --plot_heatmap=PLOT_HEATMAP
Generate a heatmap for the significant associations [ Default: TRUE ]
-i HEATMAP_FIRST_N, --heatmap_first_n=HEATMAP_FIRST_N
In heatmap, plot top N features with significant associations [ Default: 50 ]
-o PLOT_SCATTER, --plot_scatter=PLOT_SCATTER
Generate scatter plots for the significant associations [ Default: TRUE ]
-e CORES, --cores=CORES
The number of R processes to run in parallel [ Default: 1 ]
-d REFERENCE, --reference=REFERENCE
The factor to use as a reference for a variable with more than two levels provided as a string of 'variable,reference' semi-colon delimited for multiple variables [ Default: NA ]
-x INPUT_DNADATA, --input_dnadata=INPUT_DNADATA
The DNA abundance for each feature [ Default: none ]
-y RNA_DNA_FLT, --rna_dna_flt=RNA_DNA_FLT
Filtering features/samples based on the detectable abundance of RNA and covariate DNA per feature [ Default: lenient ]
[ Choices: none, lenient, semi_strict, strict]
none: do not apply filtering
lenient: ignore features that are not detected at both DNA and RNA levels across all samples
semi_strict: ignore the samples where both feature's DAN and feature's RNA are not detected
strict: ignore the samples where either feature's DAN or feature's RNA is not detected
- HUMAnN 2.0
- HUMAnN 3.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- PARATHAA