-
Notifications
You must be signed in to change notification settings - Fork 75
metaphlan2
MetaPhlAn2 is a tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data.
- For information on the first generation of MetaPhlAn, see the MetaPhlAn1 tutorial.
- Interested in trying out the latest version?
- Please see MetaPhlAn 3.0 tutorial.
- Please direct questions to the MetaPhlAn2 biobakery support forum.
- For additional information, see the MetaPhlAn2 User Manual.
If you're using this tutorial as part of a workshop, or if you'd like to think about the algorithm and implementation details a bit, you'll occasionally see "discussion questions" at the end of tutorial sections, formatted like this:
- What is your quest?
- What is your favorite color?
Table of contents
- Overview
- Prerequisites
- Installation
- Create taxonomic profiles
- Visualize results
- Create a strain-level marker-based heatmap (PanPhlAn)
- Additional tools
The basic steps of MetaPhlAn2 are:
- Python (version >= 2.7)
- Bowtie2
- Numpy
- Pandas (optional, only required by utility scripts)
- BioPython (optional, only required by utility scripts)
- SciPy (optional, only required by utility scripts)
- Matplotlib (optional, only required by utility scripts)
-
biom (optional, only required for
biom
format input/output)
If you're using this tutorial from the bioBakery VM (locally or on the cloud), MetaPhlAn2 is already installed! Otherwise, you can install MetaPhlAn2 and other bioBakery tools automatically with Conda. :
$ conda install -c bioconda metaphlan2=2.7
This will also install all MetaPhlAn2 dependencies.
Alternatively, you can manually install MetaPhlAn2 from source and then manually install the dependencies.
Step 1: Download MetaPhlAn2 and unpack the software:
$ tar xzvf biobakery-metaphlan2-<versionid>.tar.gz
$ cd biobakery-metaphlan2-<versionid>/
$ ls
Step 2: Add all of the scripts to your PATH:
- Add this command
export PATH=$MDIR:$MDIR/utils/:$PATH
to your$HOME/.bashrc
file a. Replace$MDIR
with the full path to the MetaPhlAn2 folder b. To get the value of$MDIR
run$ pwd
from inside the MetaPhlAn2 folder - Source your
$HOME/.bashrc
file. a.$ source $HOME/.bashrc
b. This only needs to be done once to update your PATH in your current session. c. All future sessions will automatically source this file when you login.
Alternatively, instead of adding the scripts to your PATH, you can
provide the full path to the script each time you run. For example, the
following command prints the MetaPhlAn2 help screen (replace $MDIR
with the full path to the MetaPhlAn2 folder): :
$ $MDIR/metaphlan2.py --help
Step 3: Install the MetaPhlAn2 dependencies.
Install bowtie2 in a folder
in your PATH or specify the path to the bowtie2 executable with each run
of MetaPhlAn2 using the option --bowtie2_exe <bowtie2>
.
MetaPhlAn2 accepts as input short reads from a single shotgun metagenomic sequencing experiment and outputs the list of detected microbes and their relative abundances.
MetaPhlAn2 accepts metagenomic sequence in several formats, including
.fasta
, .fastq
, and .tar.bz2
.
To see all the input formats type metaphlan2.py -h | less
. Use the
arrow key to move up and down. Type q
to quit back to the prompt.
- Which command line arguments are required?
- Which optional command line arguments seem most important or most commonly used?
For the purpose of this tutorial, we will use the following set of six input files that have been subsampled for rapid analysis:
- SRS014476-Supragingival_plaque.fasta.gz
- SRS014494-Posterior_fornix.fasta.gz
- SRS014459-Stool.fasta.gz
- SRS014464-Anterior_nares.fasta.gz
- SRS014470-Tongue_dorsum.fasta.gz
- SRS014472-Buccal_mucosa.fasta.gz
The original files, and many others, can be downloaded from the HMP DACC.
If you're running this tutorial locally, click on the links to download
each file. NOTE that if you are using the Google Chrome browser, it
may automatically decompress (gunzip) the gzip-ed files. The files
should download to your default Downloads
folder location.
If you're running this tutorial on a server or cloud instance, you can
use the curl
or wget
programs to download each file. Right click on
each link, choose "Copy Link Address" (or equivalent), and in a terminal
type:
$ curl -LO https://github.com/biobakery/biobakery/raw/master/test_suite/biobakery_tests/data/metaphlan2/input/SRS014476-Supragingival_plaque.fasta.gz
$ curl -LO https://github.com/biobakery/biobakery/raw/master/test_suite/biobakery_tests/data/metaphlan2/input/SRS014494-Posterior_fornix.fasta.gz
$ curl -LO https://github.com/biobakery/biobakery/raw/master/test_suite/biobakery_tests/data/metaphlan2/input/SRS014459-Stool.fasta.gz
...
Once you have downloaded all six input files, if you did not already place them in your tutorial working directory, make a new folder there and move the downloaded files to it using the following commands at the terminal prompt.
-
Make a new folder in your current working directory :
$ mkdir metaphlan2_analysis
-
Move the six input files from the Downloads folder (or wherever they were placed) to the metaphlan2_analysis folder :
$ mv ~/Downloads/SRS*.fasta.gz metaphlan2_analysis/
-
Now change directories into the metaphlan2_analysis folder and list the directory contents to see the six files :
$ cd metaphlan2_analysis $ ls
We are now ready to run the MetaPhlAn2 analysis.
- If you're familiar with the command line and file structure, does it matter where you place MetaPhlAn2's input files, or where you run it from?
- What about these input files might make them particularly appropriate for a short demonstration?
Here is the basic example to profile a single metagenome from raw reads. :
$ metaphlan2.py SRS014476-Supragingival_plaque.fasta.gz --input_type fasta > SRS014476-Supragingival_plaque_profile.txt
Running MetaPhlAn2, following the example in the prior section, will create two output files.
File 1: SRS014476-Supragingival_plaque.fasta.gz.bowtie2out.txt
This file contains the intermediate mapping results to unique sequence markers.
Alignments are listed one per line in tab-separated columns of read and reference marker: :
$ less -S SRS014476-Supragingival_plaque.fasta.gz.bowtie2out.txt
File 2: SRS014476-Supragingival_plaque_profile.txt
This file contains the final computed organism abundances.
Organism abundances are listed one clade per line, tab-separated from the clade's percent abundance: :
$ less -S SRS014476-Supragingival_plaque_profile.txt
The first column lists clades, ranging from taxonomic kingdoms
(Bacteria, Archaea, etc.) through species. The taxonomic level of each
clade is prefixed to indicate its level:
Kingdom: k__, Phylum: p__, Class: c__, Order: o__, Family: f__, Genus: g__, Species: s__
.
Since sequence-based profiling is relative and does not provide absolute
cellular abundance measures, clades are hierarchically summed. Each
level will sum to 100%; that is, the sum of all kindom-level clades is
100%, the sum of all genus-level clades (including unclassified) is also
100%, and so forth. OTU equivalents can be extracted by using only the
species-level s__
clades from this file (again, making sure to include
clades unclassified at this level).
- What do the parts of the read identifiers in the first column of the first, per-read/marker output file indicate?
- What do the parts of the gene identifiers in the second column of the first, per-read/marker output file indicate?
- Since the second, per-clade abundance output file is already normalized, you never need to sum-normalize these relative abundances. However, if you tried to, what would the sum of all clades' relative abundances be?
When available, MetaPhlAn2 can take advantage of multiple cores using
the nproc
flag: :
$ metaphlan2.py SRS014459-Stool.fasta.gz --input_type fasta --nproc 4 > SRS014459-Stool_profile.txt
- Under typical circumstances, how many cores (threads) should you ask MetaPhlAn2 to use?
Each MetaPhlAn2 execution processes exactly one sample, but the resulting single-sample analyses can easily be combined into an abundance table spanning multiple samples. Let's finish the last four samples from the input files tutorial section: :
$ metaphlan2.py SRS014464-Anterior_nares.fasta.gz --input_type fasta --nproc 4 > SRS014464-Anterior_nares_profile.txt
$ metaphlan2.py SRS014470-Tongue_dorsum.fasta.gz --input_type fasta --nproc 4 > SRS014470-Tongue_dorsum_profile.txt
$ metaphlan2.py SRS014472-Buccal_mucosa.fasta.gz --input_type fasta --nproc 4 > SRS014472-Buccal_mucosa_profile.txt
$ metaphlan2.py SRS014494-Posterior_fornix.fasta.gz --input_type fasta --nproc 4 > SRS014494-Posterior_fornix_profile.txt
Alternatively, if you're familiar with bash
shell syntax, you can loop
over the entire sample set: :
$ for f in SRS*.fasta.gz
> do
> metaphlan2.py $f --input_type fasta --nproc 4 > ${f%.fasta.gz}_profile.txt
> done
Either way, you will now have a complete set of six profile output files and six intermediate mapping files. If you'd like to skip this step to speed things up, the 12 demo file outputs can be downloaded from the following links (click the link and then right-click on the preview page to select "Save as...", or copy the URL to download on a server).
Profile output files:
- SRS014459-Stool_profile.txt
- SRS014464-Anterior_nares_profile.txt
- SRS014470-Tongue_dorsum_profile.txt
- SRS014472-Buccal_mucosa_profile.txt
- SRS014476-Supragingival_plaque_profile.txt
- SRS014494-Posterior_fornix_profile.txt
Intermediate mapping output files:
- SRS014459-Stool.fasta.gz.bowtie2out.txt
- SRS014464-Anterior_nares.fasta.gz.bowtie2out.txt
- SRS014470-Tongue_dorsum.fasta.gz.bowtie2out.txt
- SRS014472-Buccal_mucosa.fasta.gz.bowtie2out.txt
- SRS014476-Supragingival_plaque.fasta.gz.bowtie2out.txt
- SRS014494-Posterior_fornix.fasta.gz.bowtie2out.txt
Finally, the MetaPhlAn2 distribution includes a utility script that will create a single tab-delimited table from these files: :
$ merge_metaphlan_tables.py *_profile.txt > merged_abundance_table.txt
The resulting table can be opened in Excel, any gene expression analysis
program, less
(example below), or visualized graphically as per
subsequent tutorial sections. :
$ less -S merged_abundance_table.txt
- How might you convert these relative abundance measures to pseudo-RPKs that are sensitive to each sample's read depth?
- In what important ways does analysis of a relative abundance table differ from that of a gene expression (microarray or RNA-seq) transcript table? In what ways are they similar?
- Under what circumstances is this tab-delimited text data format particularly efficient or inefficient? Is this likely to be a problem for species-level taxonomic profiles?
A heatmap is one way to visualize tabular abundance results such as
those from MetaPhlAn2. The plotting tool we'll use here, hclust2
, is a
convenience script that can show any, some, or all of the microbes or
samples in a MetaPhlAn2 table. In this tutorial we will plot the heatmap
for all of the samples.
If you're using this tutorial from the bioBakery VM (locally or on the cloud), hclust2 is already installed! Otherwise, you can install hclust2 and other bioBakery tools automatically with Conda. :
$ conda install -c biobakery hclust2
This will install hclust2 and all of its dependencies.
Alternatively, you can manually install hclust2 from source by downloading hclust2 and then installing the hclust2 dependencies (numpy, pandas, biopython, scipy, and matplotlib).
Step 1: Generate the species only abundance table
Run the following command to create a species only abundance table, providing the abundance table ( merged_abundance_table.txt ) created in prior tutorial steps: :
$ grep -E "(s__)|(^ID)" merged_abundance_table.txt | grep -v "t__" | sed 's/^.*s__//g' > merged_abundance_table_species.txt
There are three parts to this command. The first grep searches the file
for the regular expression "(s__)|(^ID)"
which matches to those lines
with species information and also to the header. The second grep does
not print out any lines with strain information (labeled as t__
). The
sed removes the full taxonomy from each line so the first column only
includes the species name.
The new abundance table (merged_abundance_table_species.txt) will contain only the species abundances with just the species names (instead of the full taxonomy).
The first few lines of the file will look like: :
ID SRS014459-Stool_profile SRS014464-Anterior_nares_profile SRS014470-Tongue_dorsum_profile SRS014472-Buccal_mucosa_profile SRS014476-Supragingival_plaque_profile SRS014494-Posterior_fornix_profile
Actinomyces_graevenitzii 0.0 0.0 0.85102 0.0 0.0 0.0
Corynebacterium_matruchotii 0.0 0.0 0.0 0.0 58.21595 0.0
Corynebacterium_pseudodiphtheriticum 0.0 14.07759 0.0 0.0 0.0 0.0
Rothia_dentocariosa 0.0 0.0 0.0 0.0 32.10966 0.0
Bacteroides_cellulosilyticus 3.82206 0.0 0.0 0.0 0.0 0.0
Bacteroides_massiliensis 10.61295 0.0 0.0 0.0 0.0 0.0
Bacteroides_ovatus 4.08051 0.0 0.0 0.0 0.0 0.0
- Why is it useful to visualize (and particularly to cluster) only the "tips" of the taxonomic tree?
- Why does the grep / sed command work the way that it does?
- What assumptions does it make about the taxonomy?
Step 2: Generate the heatmap
Next generate the species only heatmap by running the following command: :
$ hclust2.py -i merged_abundance_table_species.txt -o abundance_heatmap_species.png --ftop 25 --f_dist_f braycurtis --s_dist_f braycurtis --cell_aspect_ratio 0.5 -l --flabel_size 6 --slabel_size 6 --max_flabel_len 100 --max_slabel_len 100 --minv 0.1 --dpi 300
The command above includes options to select the top 25 features, use Bray-Curtis as the distance measure both between samples and between features (microbes), set the ratio between the width/height of cells to 0.5, use a log scale for assigning heatmap colors, set the sample and feature label size to 6, set the max sample and feature label length to 100, select the minimum value to display as 0.1, and select an image resolution of 300.
Open the resulting heatmap
(abundance_heatmap_species.png)
to take a look. If you generated it on your local computer, just double
click. If you're using a server with just a terminal interface, you
might have to transfer the file locally first using a tool like scp
.
If you're using a server with a graphical interface, you can open the
file using a command like see abundance_heatmap_species.png
). Using
any of these methods, the results should look like:
Notice that due to the very large differences between body site communities in the human microbiome, we can still easily see site-specific species despite the small demonstration input files (each is subsampled to only 10,000 reads for efficiency).
- Which microbes are most abundant at each body site in these demonstration data?
- Under what circumstances is log-scaling the heatmap abundance colors good? When might it be bad (i.e. visually deceptive)?
- Why do we show only the 25 most abundant features in this example?
You can also visualize microbial abundances on a tree of life (also referred to as a phylogeny or cladogram) that captures their taxonomic (or phylogenetic) relatedness. Here, we'll use a tool called GraPhlAn that can render trees and annotate them with microbial names or data such as abundances. The instructions here assume that you will run GraPhlAn from the command line, but if you'd like to use an online Galaxy module instead, see the section on GraPhlAn in Galaxy. For more information on this tool, refer to the GraPhlAn tutorial.
If you're using this tutorial from the bioBakery VM (locally or on the cloud), GraPhlAn is already installed! Otherwise, you can install GraPhlAn and other bioBakery tools automatically with Conda. :
$ conda install -c biobakery graphlan
This will install GraPhlAn and all of its dependencies.
Alternatively, you can manually install GraPhlAn from source by downloading GraPhlAn and then install the GraPhlAn dependencies (numpy, pandas, biopython, scipy, and matplotlib).
Step 1: Create the GraPhlAn input files
GraPhlAn requires two inputs: (i) a tree structure to represent and (ii) graphical annotation options for the tree.
Run the following command to generate the two input files for GraPhlAn (the tree and annotation files) providing the abundance table (merged_abundance_table.txt) created in prior tutorial steps: :
$ export2graphlan.py --skip_rows 1,2 -i merged_abundance_table.txt --tree merged_abundance.tree.txt --annotation merged_abundance.annot.txt --most_abundant 100 --abundance_threshold 1 --least_biomarkers 10 --annotations 5,6 --external_annotations 7 --min_clade_size 1
The command above has options to skip rows 1 and 2, select the top 100
most abundance clades to highlight, set a minimum abundance threshold
for clades to be annotated, extract a minimum of 10 biomarkers, select
taxonomic levels 5 and 6 to be annotated within the tree, select
taxonomic level 7 to be used in the external legend, and set the minimum
size of clades annotated as biomarkers to 1. The output files created
are merged_abundance.tree.txt
and merged_abunance.annot.txt
.
- What are the contents and structure of the "tree" file?
- What are the contents and structure of the "annot" file?
Step 2: Create a cladogram
Run the following commands to generate the cladogram providing the tree ( merged_abundance.tree.txt ) and annotation ( merged_abundance.annot.txt ) files from the prior step :
$ graphlan_annotate.py --annot merged_abundance.annot.txt merged_abundance.tree.txt merged_abundance.xml
$ graphlan.py --dpi 300 merged_abundance.xml merged_abundance.png --external_legends
The first command creates an xml file from the tree and annotation inputs. The second command creates the image, setting the image resolution to 300 and requesting external legends.
The first few lines of the xml file are: :
<phyloxml xmlns="http://www.phyloxml.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.phyloxml.org http://www.phyloxml.org/1.10/phyloxml.xsd">
<phylogeny rooted="true">
<clade>
<clade>
<name>k__Viruses</name>
<branch_length>1.0</branch_length>
The generated cladogram (merged_abundance.png) is:
The annotation (merged_abundance_annot.png ) should be:
And the legend (merged_abundance_legend.png ) is:
As above, if you generated these images on your local computer, open
them by simply double clicking. If you're using a server with only a
terminal interface, transfer the file locally first using a tool like
scp
. If you're using a server with a graphical interface, you can open
the file using a command like see merged_abundance.png
).
- What is the PhyloXML format? Why might it be used in this context?
- Why is it often particularly useful to plot circular, rather than linear, cladograms?
- What other types of annotations might be useful on such a tree (either different graphical formats, or different types of data to take advantage of them)?
PanPhlAn provides one way to identify, track, and phylogenetically place individual strains from metagenomes. It uses phylogenomics, or the co-occurrence of gene presence / absence across strains in a metagenome collection. Here, we'll show a simple example of identifying strains in demonstration data and visualizing them as a marker-based heatmap.
PanPhlAn uses MetaPhlAn markers to identify the dominant strain within each (sufficiently abundant) species across one or more metagenomes. You will thus use a different sample set for this tutorial than in prior sections, since we provide multiple samples that all share the same abundant species (unlike the differing species across body sites above).
The files you will start this section with were generated by running MetaPhlAn2. They are intermediate mapping results output files (named "bowtie2out.txt"). If you would like to learn how to generate these files from the input fasta files, see the first step in the StrainPhlAn Tutorial.
For this tutorial section, please download the following MetaPhlAn intermediate bowtie2 output files (click the link and then right-click on the preview page to select "Save as...", or copy the URL to download as above):
- 13530241_SF05.fasta.gz.bowtie2out.txt
- 13530241_SF06.fasta.gz.bowtie2out.txt
- 19272639_SF05.fasta.gz.bowtie2out.txt
- 19272639_SF06.fasta.gz.bowtie2out.txt
- 40476924_SF05.fasta.gz.bowtie2out.txt
- 40476924_SF06.fasta.gz.bowtie2out.txt
- Different human individuals generally don't gain and lose different genes. Why do different microbial strains?
- What pros and cons are there in using gene presence / absence patterns for strain identification, as opposed to nucleotide polymorphisms or other types of variants?
Step 1: Create a file of abundances for all markers in the selected species
Run the following commands to create marker abundance files for all
samples. For this tutorial we will be selecting the species
s__Eubacterium_siraeum
and requiring at least 1% abundance. :
$ metaphlan2.py --input_type bowtie2out -t clade_specific_strain_tracker --clade s__Eubacterium_siraeum --min_ab 1.0 13530241_SF05.fasta.gz.bowtie2out.txt > 13530241_SF05.siraeum.txt
$ metaphlan2.py --input_type bowtie2out -t clade_specific_strain_tracker --clade s__Eubacterium_siraeum --min_ab 1.0 13530241_SF06.fasta.gz.bowtie2out.txt > 13530241_SF06.siraeum.txt
$ metaphlan2.py --input_type bowtie2out -t clade_specific_strain_tracker --clade s__Eubacterium_siraeum --min_ab 1.0 19272639_SF05.fasta.gz.bowtie2out.txt > 19272639_SF05.siraeum.txt
$ metaphlan2.py --input_type bowtie2out -t clade_specific_strain_tracker --clade s__Eubacterium_siraeum --min_ab 1.0 19272639_SF06.fasta.gz.bowtie2out.txt > 19272639_SF06.siraeum.txt
$ metaphlan2.py --input_type bowtie2out -t clade_specific_strain_tracker --clade s__Eubacterium_siraeum --min_ab 1.0 40476924_SF05.fasta.gz.bowtie2out.txt > 40476924_SF05.siraeum.txt
$ metaphlan2.py --input_type bowtie2out -t clade_specific_strain_tracker --clade s__Eubacterium_siraeum --min_ab 1.0 40476924_SF06.fasta.gz.bowtie2out.txt > 40476924_SF06.siraeum.txt
The commands above set options to indicate the input file is of type
bowtie2 output, the analysis is of type clade_specific_strain_tracker
,
the clade to track, and the minimum abundance is 1.0.
The following output files will be created:
- 13530241_SF05.siraeum.txt
- 13530241_SF06.siraeum.txt
- 19272639_SF05.siraeum.txt
- 19272639_SF06.siraeum.txt
- 40476924_SF05.siraeum.txt
- 40476924_SF06.siraeum.txt
Step 2: Merge the marker abundance files
Merge the output files by running the following command: :
$ merge_metaphlan_tables.py *.siraeum.txt > siraeum_tracker.txt
This will generate a merged marker abundance file ( siraeum_tracker.txt ).
Step 3: Create a heatmap
Run the following command to create the heatmap: :
$ hclust2.py -i siraeum_tracker.txt -o siraeum_tracker.png --skip_rows 1 --f_dist_f hamming --no_flabels --dpi 300 --cell_aspect_ratio 0.01
The command above sets options to skip row 1 in the input file (it skips reading the sample id header), the feature distance is computed with hamming, no feature labels will be included, the image resolution is 300, and the ratio between the width/height of cells is 0.01.
For more information on hclust2, see the section on how to create a heatmap with hclust2.
The resulting figure ( siraeum_tracker.png ) will show the E. siraeum markers present (yellow) and absent (black) in the clustered heatmap:
The strain level marker heatmap is an example of PanPhlAn results. For additional information on strain-level profiling, see the PanPhlAn Tutorial.
The LEfSe and MaAsLin tutorials describe additional tools to analyze MetaPhlAn2 output.
- HUMAnN 2.0
- HUMAnN 3.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MaAsLin 3.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- MTX model 3
- PARATHAA