Skip to content

MTXmodel_tutorial

kelsthom13 edited this page Jun 14, 2023 · 13 revisions

MTXmodel Tutorial

This R package is built for metatranscriptomics (MTX) modeling based on MaAsLin2. It integrates feature-specific covariates to determine multivariable association between metadata and microbial MTX features. MTX abundance changes are highly affected by underlying differences in metagenomic abundances (i.e. gene copy number). This package can adjust the DNA abundance as a continuous covariate for a given feature in the models for differential expression analysis in microbial communities.

If you use the MTXmodel package, please cite our manuscript: Yancong Zhang, Kelsey N. Thompson, Huttenhower C, Eric A. Franzosa. "Statistical approaches for differential expression analysis in metatranscriptomics." Bioinformatics, 37.Supplement_1: i34-i41 (2021).

And feel free to link it to your Methods: https://github.com/biobakery/MTX_model


Contents

1. Description

MTXmodel keeps all the features and functions of MaAsLin2 and additionally adds new functions which can be used for adjusting feature-specific covariates. In this tutorial, we will walk through most of the steps from the manuscript and compare the output of the unadjusted MaAsLin runs of MTX data, ratio adjusted RNA compared DNA copies, and then use the MTXmodel library to adjust for the underlying DNA abundance.

--- Add Fig. 1 from the manuscript

2. Install the library

2.1 Requirements

MTXmodel is an R package that can be run on the command line or as an R function.

2.2 Installation

Skip this step if you are doing this tutorial as part of a Huttenhower Lab run course.

If only running from the command line, you do not need to install the MTXmodel package but you will need to install the MTXmodel dependencies.

You only need to do one of the following options to install the MTX_model package.

Option 1: Installing from command line

  1. Download the source: MTX_model.master.zip
  2. Decompress the download:
    • $ tar xzvf MTX_model-master.zip
  3. Install the Bioconductor dependencies edgeR and metagenomeSeq.
  4. Install the CRAN dependencies:
    • $ R -q -e "install.packages(c('lmerTest','pbapply','car','dplyr','vegan','chemometrics','ggplot2','pheatmap','hash','logging','data.table','MuMIn','glmmTMB','MASS','cplm','pscl'), repos='http://cran.r-project.org')"
  5. Install the MTXmodel package (only required if running as an R function):
    • $ R CMD INSTALL MTX_model-master

Option 2: Installing using devtools

library(devtools)
install_github('biobakery/MTX_model')

3. Running the MTXmodel

MTXmodel can be run from the command line or as an R function. Both methods require the same arguments, have the same options, and use the same default settings.

3.1 Input Files

MTXmodel requires three input files.

  1. Data (or features) file
    • This file is tab-delimited.
    • Formatted with features as columns and samples as rows.
    • The transpose of this format is also okay.
    • Possible features in this file include taxonomy or genes or pathways.
  2. Pre-normalized Data (or features) file - features are ratio of RNA data compared to underlying DNA abundance
    • This file is tab-delimited.
    • Formatted with features as columns and samples as rows.
    • The transpose of this format is also okay.
    • Possible features in this file include taxonomy or genes or pathways.
  3. Metadata file
    • This file is tab-delimited.
    • Formatted with features as columns and samples as rows.
    • The transpose of this format is also okay.
    • Possible metadata in this file include gender or age.
  4. Covariate data of features file
    • This file is tab-delimited.
    • Formatted with features as columns and samples as rows.
    • The transpose of this format is also okay.
    • Possible data in this file include DNA abundance of genes or pathways.

The data file can contain samples not included in the metadata file (as true for the reverse case more samples in the metadata). For both cases, those samples are not included in both files will be removed prior to constructing the model analysis. The samples order within the files does not need to match.

NOTE: If running MTXmodel as an R function, the data and metadata inputs can be of type data.frame instead of a path to a file.

3.2.1 Output files

MTXmodel generates the same types of output files with MaAsLin2: data and visualization. See more details in MaAsLin2 manual. For this tutorial we also will run two models using `MaAsLin 2' instead of the wrapper.

3.2.2 Input files

In this first example run on MTX data we will run MaAsLin 2 on the RNA pathway abundances as characterized by HUMAnN 2 but not normalized by the matched DNA of these samples.

Example input files can be found in the inst/extdata folder of the MTXmodel source. The files provided were generated from the HMP2 data which can be downloaded from https://ibdmdb.org/ .

HMP2_pwyRNA.tsv: is a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway RNA abundances for all samples.

HMP2_pwy.RNA_DNA_ratio.tsv: is a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway RNA abundances for all samples and it has been normalized as a ratio of the underlying (matched) DNA abundances.

HMP2_pwyDNA.tsv: is a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway DNA abundances for all samples.

HMP2_ratioDNA.tsv: is a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway DNA abundances for all samples.

HMP2_metadata.tsv: is a tab-delimited file with samples as rows and metadata as columns. It is a subset of the metadata file so that it just includes some of the fields.

3.3 Setting up environment

Set up the working directory - this is specific to the Huttenhower VMs - if you are running this somewhere else try setwd("./Desktop") or similar in your own environment

getwd()
setwd("./Tuorials/MTX_model/")
dir.create("R_MTXmodel_Tuorial") # Create a new directory
setwd("./R_Maaslin_tutorial") # Change the current working directory 
getwd() #check if directory has been successfully changed

Next we will read in the files downloaded with MTXmodel library in R. Here we are are first setting their path and then reading them in as data.frames for MaAsLin 2 alternatively you can use the path to the file natively.

library(MTXmodel) #add library to working path

#read in each file from the package then convert to a data.frame
### RNA abundances
input_data <- system.file(
  'extdata','HMP2_pwyRNA.tsv', package="MTXmodel")

df_input_data = read.table(file             = input_data,
                           header           = TRUE,
                           sep              = "\t", 
                           row.names        = 1,
                           stringsAsFactors = FALSE)
df_input_data[1:5, 1:5]

# RNA/DNA ratio data 
input_dataratio <- system.file(
  'extdata','HMP2_pwy.RNA_DNA_ratio.tsv', package="MTXmodel")
df_input_dataratio = read.table(file             = input_dataratio,
                                header           = TRUE,
                                sep              = "\t", 
                                row.names        = 1,
                                stringsAsFactors = FALSE)
df_input_dataratio[1:5, 1:5]

# Metadata from the HMP2
input_metadata <-system.file(
  'extdata','HMP2_metadata.tsv', package="MTXmodel")
df_input_metadata = read.table(file             = input_metadata,
                               header           = TRUE,
                               sep              = "\t", 
                               row.names        = 1,
                               stringsAsFactors = FALSE)
df_input_metadata[1:5, 1:5]

# DNA data 
input_dnadata <- system.file(
  'extdata','HMP2_pwyDNA.tsv', package="MTXmodel")
df_input_dnadata = read.table(file             = input_dnadata,
                               header           = TRUE,
                               sep              = "\t", 
                               row.names        = 1,
                               stringsAsFactors = FALSE)
df_input_dnadata[1:5, 1:5]

Next we are going to run the model in three different ways:

  1. The raw RNA abundances from HUMAnN
  2. The RNA/DNA ratios - which were created using a helper script from HUMAnN on paired MGX/MTX data
  3. Using the MTXmodel library to futher adjust the raw RNA abundance by the underlying DNA abundances

3.4 Raw RNA Abundances MaAsLin 2

library(Maaslin2)
fit_rna <- Maaslin2(
  input_data, input_metadata, 'demo_output_rna', transform = "LOG",
  fixed_effects = c('diagnosis', 'dysbiosisnonIBD','dysbiosisUC','dysbiosisCD', 'antibiotics', 'age'),
  random_effects = c('site', 'subject'),
  reference = "diagnosis,nonIBD",
  normalization = 'NONE'
)

---ADD heatmap from run

3.5 RNA/DNA Ratios MaAsLin 2

fit_rna_ratio <- Maaslin2(
  input_dataratio, input_metadata, 'demo_output_rna_ratio', transform = "LOG",
  fixed_effects = c('diagnosis', 'dysbiosisnonIBD','dysbiosisUC','dysbiosisCD', 'antibiotics', 'age'),
  random_effects = c('site', 'subject'),
  reference = "diagnosis,nonIBD",
  normalization = 'NONE'
)

--- ADD heatmap from run

3.6 RNA abundance adjusted by DNA abundance MTXmodel

library(MTXmodel)
fit_rna_dna <- MTXmodel(
  input_data, input_metadata, 'demo_output_rna_dna', transform = "LOG",
  fixed_effects = c('diagnosis', 'dysbiosisnonIBD','dysbiosisUC','dysbiosisCD', 'antibiotics', 'age'),
  random_effects = c('site', 'subject'),
  reference = "diagnosis,nonIBD",
  normalization = 'NONE',
  standardize = FALSE,
  input_dnadata = input_dnadata
)

--- Add heatmap from run

3.7 Compare output

Finally let's use some simple R scripts to compare the results from each model

Frist we will just look at the raw counts of features assocaited with the CD term in the diagnosis variable. To do this we will use the base R functions subset to subset the results to just the ones from the daignosis comparisons and table to counts the number of pathways that were assocaited with UC/CD in each model.

#compare the raw counts of features associted with CD
results_rna = subset(fit_rna$results, metadata == "diagnosis")
table(results_rna$value)
#CD/UC = 186

results_rna_ratio = subset(fit_rna_ratio$results, metadata == "diagnosis")
table(results_rna_ratio$value)
#CD/UC = 37

results_rna_dna = subset(fit_rna_dna$results, metadata == "diagnosis")
table(results_rna_dna$value)
#CD/UC = 178

As you can tell, this number was highly dependent on the model, withe most results coming from the raw RNA abundances through MaAsLin and the least results from the RNA/DNA ratio data once again with the MaAsLin function.

Next, let's look at which (if any) features overlapped between the models. We can do this with the interect call in R

# features called by the RNA/DNA ratios compared to the Raw RNA abundances 
intersect(results_rna_ratio$feature, results_rna$feature) # 37 features overlapped 

# features called by the Raw RNA abundances compared to the RNA abundances adjusted by the DNA abundances 
intersect(results_rna$feature, results_rna_dna$feature) #178 features overlapped between the models

# features called by the RNA/DNA ratios compared to the RNA abundances adjusted by the DNA abundances 
intersect(results_rna_ratio$feature, results_rna_dna$feature) # 37 features overlapped 

From this we can see that while the models are calling different total numbers of pathways, the ones that they are calling are the same.

We can also look at the top pathways that each model called. The MaAsLin results are sorted by FDR-qvalue, so here we can use head(x, number) to get the number, number of top rows.

For the Raw RNA Abundance:

#compare the raw counts of features associted with CD
head(results_rna$feature, 3)
results_rna$model = "RNA"
[1] "PWY_5690_TCA_cyc_II"             "PWY_5690_TCA_cyc_II"             "PWY_7199_pyrimidine_dns_salvage"
#compare the raw counts of features associted with CD
head(results_rna_ratio$feature, 3)
results_rna_ratio$model = "RNA/DNA Ratio"
[[1] "PWY_6700_queuosine_biosyn"   "PWY_6317_galactose_deg_I"    "PWY66_422_D_galactose_deg_V"
#compare the raw counts of features associted with CD
head(results_rna_dna$feature, 3)
results_rna_dna$model = "RNA adjusted by DNA"
list = unique(results_rna_dna$feature)[c(1:10)]
[1] "PWY_3841_folate_transformations_II"             "X1CMET2_PWY_N10_formyl_tetrahydrofolate_biosyn" "PWY_3841_folate_transformations_II"  
library(ggplot2)
results = rbind(results_rna, results_rna_dna, results_rna_ratio)
results = results[results$feature %in% list, ]
ggplot(results, aes(x = coef, y = feature, color = model)) + geom_point(aes(shape = value)) + theme_classic()
ggsave("model_comparison.png", height = 5, width = 5)

---insert plot

4. Running on command line

$ MTXmodel.R inst/extdata/HMP2_pwyRNA.tsv inst/extdata/HMP2_metadata.tsv demo_output --min_abundance 0 --min_prevalence 0.0 --max_significance 0.25 --min_variance 0.0 --correction BH --standardize TRUE --normalization NONE --transform LOG --analysis_method LM --cores 1 --fixed_effects diagnosis,dysbiosisCD,dysbiosisUC,dysbiosisnonIBD,antibiotics,age --random_effects subject,site --plot_heatmap FALSE --plot_scatter FALSE --reference diagnosis,nonIBD --rna_dna_flt semi_strict --input_dnadata inst/extdata/HMP2_pwyDNA.tsv

  • Make sure to provide the full path to the MTXmodel executable (i.e., ./R/MTXmodel.R).
  • In the demo command:
    • HMP2_pwyRNA.tsv is the path to your MTX-feature data file
    • HMP2_pwyDNA.tsv is the path to your paired MGX-feature data file
    • HMP2_metadata.tsv is the path to your metadata file
    • demo_output is the path to the folder to write the output

5. Session Info

Session info from running the demo in R can be displayed with the following command.

sessionInfo()

Options

Run MTXmodel help to print a list of the options and the default settings.

$ MTXmodel.R --help

Usage: ./R/MTXmodel.R [options] <data.tsv> <metadata.tsv> <output_folder>

Options:

-h, --help
	Show this help message and exit
	
-a MIN_ABUNDANCE, --min_abundance=MIN_ABUNDANCE
	The minimum abundance for each feature [ Default: 0 ]

-p MIN_PREVALENCE, --min_prevalence=MIN_PREVALENCE
	The minimum percent of samples for which a feature is detected at minimum abundance [ Default: 0.1 ]

-b MIN_VARIANCE, --min_variance=MIN_VARIANCE
	Keep features with variances greater than value [ Default: 0 ]

-s MAX_SIGNIFICANCE, --max_significance=MAX_SIGNIFICANCE
	The q-value threshold for significance [ Default: 0.25 ]

-n NORMALIZATION, --normalization=NORMALIZATION
	The normalization method to apply  [ Default: TSS ] [ Choices: TSS, CLR, CSS, NONE, TMM ]

-t TRANSFORM, --transform=TRANSFORM
	The transform to apply [ Default: LOG ] [ Choices: LOG, LOGIT, AST, NONE, PA ]

-m ANALYSIS_METHOD, --analysis_method=ANALYSIS_METHOD
	The analysis method to apply [ Default: LM ] [ Choices: LM, CPLM, NEGBIN, ZINB, LOGIT ]

-r RANDOM_EFFECTS, --random_effects=RANDOM_EFFECTS
	The random effects for the model,  comma-delimited for multiple effects  [ Default: none ]

-f FIXED_EFFECTS, --fixed_effects=FIXED_EFFECTS
	The fixed effects for the model,  comma-delimited for multiple effects  [ Default: all ]

-c CORRECTION, --correction=CORRECTION
	The correction method for computing  the q-value [ Default: BH ]

-z STANDARDIZE, --standardize=STANDARDIZE
	Apply z-score so continuous metadata are on  the same scale [ Default: TRUE ]

-l PLOT_HEATMAP, --plot_heatmap=PLOT_HEATMAP
	Generate a heatmap for the significant  associations [ Default: TRUE ]

-i HEATMAP_FIRST_N, --heatmap_first_n=HEATMAP_FIRST_N
	In heatmap, plot top N features with significant  associations [ Default: 50 ]

-o PLOT_SCATTER, --plot_scatter=PLOT_SCATTER
	Generate scatter plots for the significant  associations [ Default: TRUE ]

-e CORES, --cores=CORES
	The number of R processes to  run in parallel [ Default: 1 ]

-d REFERENCE, --reference=REFERENCE
	The factor to use as a reference for a variable with more than two levels provided as a string of 'variable,reference' semi-colon delimited for multiple variables [ Default: NA ]

-x INPUT_DNADATA, --input_dnadata=INPUT_DNADATA
	The DNA abundance for each feature [ Default: none ]

-y RNA_DNA_FLT, --rna_dna_flt=RNA_DNA_FLT
	Filtering features/samples based on the detectable abundance of RNA and covariate DNA per feature [ Default: lenient ]
	[ Choices: none, lenient, semi_strict, strict]
	none: do not apply filtering
	lenient: ignore features that are not detected at both DNA and RNA levels across all samples
	semi_strict: ignore the samples where both feature's DAN and feature's RNA are not detected
	strict: ignore the samples where either feature's DAN or feature's RNA is not detected
Clone this wiki locally