Skip to content


ljmciver edited this page May 23, 2022 · 32 revisions
Clone this wiki locally

MetaPhlAn 3.0 Tutorial

MetaPhlAn is a tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data.

If you're using this tutorial as part of a workshop, or if you'd like to think about the algorithm and implementation details a bit, you'll occasionally see "discussion questions" at the end of tutorial sections, formatted like this:

  • What is your quest?
  • What is your favorite color?

Table of contents


The basic steps of MetaPhlAn are:


Create taxonomic profiles

MetaPhlAn accepts as input short reads from a single shotgun metagenomic sequencing experiment and outputs the list of detected microbes and their relative abundances.

Input files

MetaPhlAn accepts metagenomic sequence in several formats, including .fasta, .fastq, bowtie2out and sam.

To see all the input formats (and other arguments), type metaphlan -h | less. Use the arrow key to move up and down. Type q to quit back to the prompt.

usage: metaphlan --input_type {fastq,fasta,bowtie2out,sam} [--force]
             [--bowtie2db METAPHLAN_BOWTIE2_DB] [-x INDEX]
             [--bt2_ps BowTie2 presets] [--bowtie2_exe BOWTIE2_EXE]
             [--bowtie2_build BOWTIE2_BUILD] [--bowtie2out FILE_NAME]
             [--min_mapq_val MIN_MAPQ_VAL] [--no_map] [--tmp_dir]
             [--tax_lev TAXONOMIC_LEVEL] [--min_cu_len]
             [--min_alignment_len] [--add_viruses] [--ignore_eukaryotes]
             [--ignore_bacteria] [--ignore_archaea] [--stat_q]
             [--perc_nonzero] [--ignore_markers IGNORE_MARKERS]
             [--avoid_disqm] [--stat] [-t ANALYSIS TYPE]
             [--nreads NUMBER_OF_READS] [--pres_th PRESENCE_THRESHOLD]
             [--clade] [--min_ab] [-o output file] [--sample_id_key name]
             [--use_group_representative] [--sample_id value]
             [-s sam_output_file] [--legacy-output] [--CAMI_format_output]
             [--unknown_estimation] [--biom biom_output] [--mdelim mdelim]
             [--nproc N] [--install] [--force_download]
             [--read_min_len READ_MIN_LEN] [-v] [-h]
             [INPUT_FILE] [OUTPUT_FILE]

 MetaPhlAn version 3.0 (20 Mar 2020): 
 METAgenomic PHyLogenetic ANalysis for metagenomic taxonomic profiling.

AUTHORS: Nicola Segata (, Duy Tin Truong, Francesco Asnicar (, 
Francesco Beghini (

  • Which command line arguments are required?
  • Which optional command line arguments seem most important or most commonly used?

For the purpose of this tutorial, we will use the following set of six input files that have been subsampled for rapid analysis:

The original files, and many others, can be downloaded from the HMP DACC.

  • IF you are running this tutorial as a part of short course on a cloud instance, the input files have been pre-downloaded. Open a terminal and enter:

      cd Tutorials/metaphlan3
      ls input
  • Copy the input files into the current working directory

      cp input/*fasta.gz .

Use the ls command to check if the files have been correctly copied/downloaded. We are now ready to run MetaPhlAn.

Please proceed to Run a single sample section below.

  • IF you're running this tutorial locally (your PC/laptop), make a new folder in your current working directory with mkdir and access it with cd. Use this folder for all MetaPhlAn analysis.

      mkdir metaphlan_analysis
      cd metaphlan_analysis
  • Click on the links to download each file. NOTE that if you are using the Google Chrome browser, it may automatically decompress (gunzip) the gzip-ed files. The files should download to your default Downloads folder location. Copy files to this location from your Downloads folder:

     cp ~/Downloads/SRS*.fasta.gz .
  • OR, you can use the curl or wget programs to download each file directly into the folder you created. Make sure in the correct folder i.e. metaphlan_analysis with pwd. Right click on each link, choose "Copy Link Address" (or equivalent), and in a terminal type:

     curl -LO

Use the ls command to check if the files have been correctly copied/downloaded. We are now ready to run MetaPhlAn.

  • If you're familiar with the command line and file structure, does it matter where you place MetaPhlAn's input files, or where you run it from?
  • What about these input files might make them particularly appropriate for a short demonstration?

Run a single sample

Here is the basic example to profile a single metagenome from raw reads:

 metaphlan SRS014476-Supragingival_plaque.fasta.gz --input_type fasta > SRS014476-Supragingival_plaque_profile.txt

Output files

Running MetaPhlAn, following the example in the prior section, will create two output files. Check what files have been created with ls -ltr.

File 1: SRS014476-Supragingival_plaque.fasta.gz.bowtie2out.txt

This file contains the intermediate mapping results to unique sequence markers.

Alignments are listed one per line in tab-separated columns of read and reference marker.

 less -S SRS014476-Supragingival_plaque.fasta.gz.bowtie2out.txt


HWUSI-EAS1568_102539179:1:100:10001:7882/1      712117__F3PCC2__HMPREF9056_02717
HWUSI-EAS1568_102539179:1:100:10007:17628/1     712357__A0A0K2JD54__AMK43_09330
HWUSI-EAS1568_102539179:1:100:10017:5224/1      2047__E3H3C1__HMPREF0733_12099
HWUSI-EAS1568_102539179:1:100:10023:7402/1      712122__A0A0M4H4P0__AM609_02120
HWUSI-EAS1568_102539179:1:100:10025:16605/1     43768__C0E1K0__murJ
HWUSI-EAS1568_102539179:1:100:10033:18381/1     43768__E0DI09__HMPREF0299_5319
HWUSI-EAS1568_102539179:1:100:10083:4412/1      2047__E3H2T8__coaBC
HWUSI-EAS1568_102539179:1:100:10091:12482/1     505__C4GHX6__GCWU000324_00464
HWUSI-EAS1568_102539179:1:100:10094:10442/1     544581__U1RE23__HMPREF1979_00736
HWUSI-EAS1568_102539179:1:100:10103:1753/1      28133__F9DGE7__CBG57_05925
HWUSI-EAS1568_102539179:1:100:10109:14464/1     43768__E0DGH3__HMPREF0299_6971
HWUSI-EAS1568_102539179:1:100:10112:17904/1     43768__E0DEQ2__HMPREF0299_6337

File 2: SRS014476-Supragingival_plaque_profile.txt

This file contains the final computed organism abundances.

Organism abundances are listed one clade per line, tab-separated from the clade's percent abundance:

 less -S SRS014476-Supragingival_plaque_profile.txt


#/n/huttenhower_lab/tools/metaphlan3/bin/metaphlan SRS014476-Supragingival_plaque.fasta.gz --input_type fasta
#SampleID       Metaphlan_Analysis
#clade_name     NCBI_tax_id     relative_abundance      additional_species
k__Bacteria     2       100.0   
k__Bacteria|p__Actinobacteria   2|201174        100.0   
k__Bacteria|p__Actinobacteria|c__Actinobacteria 2|201174|1760   100.0   
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales    2|201174|1760|85007     65.25681        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales        2|201174|1760|85006     34.74319        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales|f__Corynebacteriaceae      2|201174|1760|85007|1653        65.25681        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Micrococcaceae      2|201174|1760|85006|1268        34.74319        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales|f__Corynebacteriaceae|g__Corynebacterium   2|201174|1760|85007|1653|1716   65.25681        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Micrococcaceae|g__Rothia    2|201174|1760|85006|1268|32207  34.74319        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales|f__Corynebacteriaceae|g__Corynebacterium|s__Corynebacterium_matruchotii    2|201174|1760|85007|1653|1716|43768     65.25681        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Micrococcaceae|g__Rothia|s__Rothia_dentocariosa     2|201174|1760|85006|1268|32207|2047     34.74319        k__Bacteria|p__Actinobacteria|
  • The file has a 4-line header. The first line lists the reference marker genes database that MetaPhlAn uses. There are ~1.1M unique clade-specific marker genes identified from ~100k reference genomes (~99,500 bacterial and archaeal and ~500 eukaryotic). The second line lists the path to the tool, the name of the input file and the arguments that mentioned. The fourth line has the column headers for the columns below.
  • The first column lists clades, ranging from taxonomic kingdoms (Bacteria, Archaea, etc.) through species. The taxonomic level of each clade is prefixed to indicate its level: Kingdom: k__, Phylum: p__, Class: c__, Order: o__, Family: f__, Genus: g__, Species: s__. Let us examine these more clearly by listing them out as per taxonomic hierarchy.

We will look for (grep) lines which contain the pattern s__ that is associated with species and print the first match with the -m1 argument. Remember that this file have 4 tab-separated columns and the taxonomy is listed in the first; so we will use cut -f1 to get (cut out) the first column only (the field at position 1). Finally, the taxonomic levels are separated by the | character which we will replace with the new line character \n.

  grep s__ -m1 SRS014476-Supragingival_plaque_profile.txt | cut -f1 | sed 's/|/\n/g'

Output (The taxonomy of the microbe C. matruchotii):

  • The second column lists the respective NCBI Taxon IDs.
  • The third column lists relative abundances. Since sequence-based profiling is relative and does not provide absolute cellular abundance measures, clades are hierarchically summed. Each level will sum to 100%; that is, the sum of all kingdom-level clades is 100%, the sum of all genus-level clades (including unclassified) is also 100%, and so forth. OTU equivalents can be extracted by using only the species-level s__ clades from this file (again, making sure to include clades unclassified at this level).

Let us check if all orders sum to 100% using grep. The orders 'Corynebacteriales' and 'Micrococcales' are in the class 'Actinobacteria'.

  grep o__ SRS014476-Supragingival_plaque_profile.txt | grep -v f__


k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales	2|201174|1760|85007	65.25681	
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales	2|201174|1760|85006	34.74319	

Similarly, the families must sum to 100%. In this example, let us also display the fields of interest i.e. taxonomy names and percentages for ease of viewing.

 grep f__ SRS014476-Supragingival_plaque_profile.txt | grep -v g__ | cut -f1,3


k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales|f__Corynebacteriaceae	65.25681
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Micrococcaceae	34.74319
  • The fourth column lists additional species for cases where the metagenome profile contains clades that represent multiple species. The species listed in column 1 is the representative species in such cases.

  • What do the parts of the read identifiers in the first column of the first, per-read/marker output file indicate?
  • What do the parts of the gene identifiers in the second column of the first, per-read/marker output file indicate?
  • Since the second, per-clade abundance output file is already normalized, you never need to sum-normalize these relative abundances. However, if you tried to, what would the sum of all clades' relative abundances be?

bowtie2out file as input

If available, it is recommended to use the bowtie2out file as an input to MetaPhlAn as it significantly speeds up metagenomic profiling. Let us delete the File 2 we created in the previous step and use the bowtie2out file (File1) to regenerate it. Notice that we will have to change the --input_type argument.

 rm -f SRS014476-Supragingival_plaque_profile.txt
 metaphlan SRS014476-Supragingival_plaque.fasta.gz.bowtie2out.txt --input_type bowtie2out > SRS014476-Supragingival_plaque_profile.txt
 ls -ltr

Run on multiple cores

When available, MetaPhlAn can take advantage of multiple cores using the nproc flag:

 metaphlan SRS014459-Stool.fasta.gz --input_type fasta --nproc 4 > SRS014459-Stool_profile.txt

Note: nproc is used by bowtie2 which processes 10K reads per second per thread. Since we have a very small number of reads in this demo, the difference in speed up is negligible.

Run multiple samples

  • Each MetaPhlAn execution processes exactly one sample, but the resulting single-sample analyses can easily be combined into an abundance table spanning multiple samples. Let's finish the last four samples from the input files tutorial section:
     metaphlan SRS014464-Anterior_nares.fasta.gz --input_type fasta --nproc 4 > SRS014464-Anterior_nares_profile.txt
     metaphlan SRS014470-Tongue_dorsum.fasta.gz --input_type fasta --nproc 4 > SRS014470-Tongue_dorsum_profile.txt
     metaphlan SRS014472-Buccal_mucosa.fasta.gz --input_type fasta --nproc 4 > SRS014472-Buccal_mucosa_profile.txt
     metaphlan SRS014494-Posterior_fornix.fasta.gz --input_type fasta --nproc 4 > SRS014494-Posterior_fornix_profile.txt
  • Alternatively, if you are familiar with shell syntax, you can loop over all input files (make sure you have deleted previously generated output files to prevent errors):
     for i in SRS*.fasta.gz; do metaphlan $i --input_type fasta --nproc 4 > ${i%.fasta.gz}_profile.txt; done

Either way, you will now have a complete set of six profile output files and six intermediate mapping files. If you'd like to skip this step to speed things up, the 12 demo file outputs can be downloaded from the following links (right-click on the link and pick 'Save Link as ..' or click on the link and then right-click on the preview page and select "Save Page as...", or copy the URL to download on a server).

  1. Profile output files

  2. Intermediate mapping output files

  • If you are running this tutorial as part of a short course, the output bowtie2out and profile files are in the output folder. To copy them to the current directory, enter:
     cp output/*_profile.txt .

Merge outputs

Finally, the MetaPhlAn distribution includes a utility script that will create a single tab-delimited table from these files: *_profile.txt > merged_abundance_table.txt

The resulting table can be opened in Excel, any gene expression analysis program, less (example below), or visualized graphically as per subsequent tutorial sections:

 less -S merged_abundance_table.txt

The first few lines look like:

clade_name      NCBI_tax_id     SRS014494-Posterior_fornix_profile      SRS014476-Supragingival_plaque_profile  SRS014472-Buccal_mucosa_profile SRS014470-Tongue_dorsum_profile SRS014464-Anterior_nares_profile        SRS014459-Sto
k__Bacteria     2       100.0   100.0   100.0   100.0   100.0   100.0
k__Bacteria|p__Actinobacteria   2|201174        0       100.0   0       0       0       0
k__Bacteria|p__Actinobacteria|c__Actinobacteria 2|201174|1760   0       100.0   0       0       0       0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales    2|201174|1760|85007     0       65.25681        0       0       0       0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales|f__Corynebacteriaceae      2|201174|1760|85007|1653        0       65.25681        0       0       0       0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales|f__Corynebacteriaceae|g__Corynebacterium   2|201174|1760|85007|1653|1716   0       65.25681        0       0       0       0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales|f__Corynebacteriaceae|g__Corynebacterium|s__Corynebacterium_matruchotii    2|201174|1760|85007|1653|1716|43768     0       65.25681        0       0       0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales        2|201174|1760|85006     0       34.743190000000006      0       0       0       0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Micrococcaceae      2|201174|1760|85006|1268        0       34.743190000000006      0       0       0       0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Micrococcaceae|g__Rothia    2|201174|1760|85006|1268|32207  0       34.743190000000006      0       0       0       0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Micrococcaceae|g__Rothia|s__Rothia_dentocariosa     2|201174|1760|85006|1268|32207|2047     0       34.743190000000006      0       0       0       0

  • How might you convert these relative abundance measures to pseudo-RPKs that are sensitive to each sample's read depth?
  • In what important ways does analysis of a relative abundance table differ from that of a gene expression (microarray or RNA-seq) transcript table? In what ways are they similar?
  • Under what circumstances is this tab-delimited text data format particularly efficient or inefficient? Is this likely to be a problem for species-level taxonomic profiles?

Visualize results

Create a heatmap with hclust2

A heatmap is one way to visualize tabular abundance results such as those from MetaPhlAn. The plotting tool we'll use here, hclust2, is a convenience script that can show any, some, or all of the microbes or samples in a MetaPhlAn table. In this tutorial we will plot the heatmap for all of the samples.

If you're using this tutorial from the bioBakery VM (locally or on the cloud), hclust2 is already installed! Otherwise, you can install hclust2 and other bioBakery tools automatically with Conda. :

 conda install -c biobakery hclust2

This will install hclust2 and all of its dependencies.

Alternatively, you can manually install hclust2 from source by downloading hclust2 and then installing the hclust2 dependencies (numpy, pandas, biopython, scipy, and matplotlib).

Step 1: Generate the species only abundance table

Run the following command to create a species only abundance table, providing the abundance table ( merged_abundance_table.txt ) created in prior tutorial steps: :

  grep -E "s__|clade" merged_abundance_table.txt | sed 's/^.*s__//g'\
| cut -f1,3-8 | sed -e 's/clade_name/body_site/g' > merged_abundance_table_species.txt

There are four parts to this command. The first grep searches the file for the regular expression "s__|clade" which matches to those lines with species information and also to the header which contains the names of the body sites. The sed removes the full taxonomy from each line so the first column only includes the species name. The cut gives us all columns except the NCBI Taxon ID (column 2) and the last sed helps us replace clade_name to body_site.

The new abundance table (merged_abundance_table_species.txt) will contain only the species abundances with just the species names (instead of the full taxonomy).

The first few lines of the file will look like:

body_site	SRS014494-Posterior_fornix_profile	SRS014476-Supragingival_plaque_profile	SRS014472-Buccal_mucosa_profile	SRS014470-Tongue_dorsum_profile	SRS014464-Anterior_nares_profile	SRS014459-Stool_profile
Corynebacterium_matruchotii	0	65.25681	0	0	0	0
Rothia_dentocariosa	0	34.743190000000006	0	0	0	0
Bacteroides_stercoris	0	0	0	0	0	31.62003
Prevotella_histicola	0	0	0	51.36481	0	0
Prevotella_pallens	0	0	0	5.10895	0	0
Gemella_haemolysans	0	0	19.17145	0	0	0
Dolosigranulum_pigrum	0	0	0	0	2.7635099999999997	0
Lactobacillus_crispatus	80.56118000000001	0	0	0	0	0
Lactobacillus_iners	19.43882	0	0	0	0	0

  • Why is it useful to visualize (and particularly to cluster) only the "tips" of the taxonomic tree?

Step 2: Generate the heatmap

Next generate the species only heatmap by running the following command: -i merged_abundance_table_species.txt -o abundance_heatmap_species.png --f_dist_f braycurtis --s_dist_f braycurtis --cell_aspect_ratio 0.5 -l --flabel_size 10 --slabel_size 10 --max_flabel_len 100 --max_slabel_len 100 --minv 0.1 --dpi 300

We have only 16 microbes in this demo file but typically, for ease of viewing, one can show the top 25 species using the --ftop 25 argument. This script uses Bray-Curtis as the distance measure both between samples (s) and between features (f) (microbes), sets the ratio between the width/height of cells to 0.5, uses a log scale for assigning heatmap colors, sets the sample and feature label size to 10, sets the max sample and feature label length to 100, selects the minimum value to display as 0.1, and selects an image resolution of 300 (in that order!).

Open the resulting heatmap (abundance_heatmap_species.png) to take a look. If you generated it on your local computer, just double click. If you're using a server with just a terminal interface, you might have to transfer the file locally first using a tool like scp. If you're using a server with a graphical interface, you can open the file using a command like see abundance_heatmap_species.png). Using any of these methods, the results should look like:

Notice that due to the very large differences between body site communities in the human microbiome, we can still easily see site-specific species despite the small demonstration input files (each is subsampled to only 10,000 reads for efficiency).

  • Which microbes are most abundant at each body site in these demonstration data?
  • Under what circumstances is log-scaling the heatmap abundance colors good? When might it be bad (i.e. visually deceptive)?

Create a cladogram with GraPhlAn

You can also visualize microbial abundances on a tree of life (also referred to as a phylogeny or cladogram) that captures their taxonomic (or phylogenetic) relatedness. Here, we'll use a tool called GraPhlAn that can render trees and annotate them with microbial names or data such as abundances. The instructions here assume that you will run GraPhlAn from the command line, but if you'd like to use an online Galaxy module instead, see the section on GraPhlAn in Galaxy. For more information on this tool, refer to the GraPhlAn tutorial. If you're using this tutorial from the bioBakery VM (locally or on the cloud), GraPhlAn is already installed! Otherwise, you can install GraPhlAn and other bioBakery tools automatically with Conda:

 conda install -c biobakery graphlan
 conda install -b bioconda export2graphlan

This will install GraPhlAn, export2graphlan, and all of its dependencies.

Alternatively, you can manually install them from source by downloading GraPhlAn and downloading export2graphlan and then install the GraPhlAn dependencies (numpy, pandas, biopython, scipy, and matplotlib).

Step 1: Create the GraPhlAn input files

GraPhlAn requires two inputs: (i) a tree structure to represent and (ii) graphical annotation options for the tree.

Run the following command to generate the two input files for GraPhlAn (the tree and annotation files) providing the abundance table (merged_abundance_table.txt) created in the prior tutorial steps reformatted to remove the version header and the NCBI taxon id in the second column.

 tail -n +2 merged_abundance_table.txt | cut -f1,3- > merged_abundance_table_reformatted.txt --skip_rows 1 -i merged_abundance_table_reformatted.txt --tree merged_abundance.tree.txt --annotation merged_abundance.annot.txt --most_abundant 100 --abundance_threshold 1 --least_biomarkers 10 --annotations 5,6 --external_annotations 7 --min_clade_size 1

The command above has options to skip rows 1 and 2 (headers), select the top 100 most abundance clades to highlight, set a minimum abundance threshold for clades to be annotated, extract a minimum of 10 biomarkers, select taxonomic levels 5 and 6 to be annotated within the tree, select taxonomic level 7 to be used in the external legend, and set the minimum size of clades annotated as biomarkers to 1. The output files created are merged_abundance.tree.txt and merged_abundance.annot.txt.

  • What are the contents and structure of the "tree" file?
  • What are the contents and structure of the "annot" file?

Step 2: Create a cladogram

Run the following commands to generate the cladogram providing the tree ( merged_abundance.tree.txt ) and its annotation ( merged_abundance.annot.txt ) files from the prior step : --annot merged_abundance.annot.txt merged_abundance.tree.txt merged_abundance.xml --dpi 300 merged_abundance.xml merged_abundance.png --external_legends

The first command creates an xml file from the tree and annotation inputs. The second command creates the image, sets the image resolution to 300 and requests external legends.

The first few lines of the xml file are:

<phyloxml xmlns="" xmlns:xsi="" xsi:schemaLocation="">
  <phylogeny rooted="true">
        <property applies_to="clade" datatype="xsd:string" id_ref="clade_marker_size" ref="A:1">10.0</property>

The generated cladogram (merged_abundance.png) is:

The annotation (merged_abundance_annot.png ) should be:

And the legend (merged_abundance_legend.png ) is:

As above, if you generated these images on your local computer, open them by simply double clicking. If you're using a server with only a terminal interface, transfer the file locally first using a tool like scp. If you're using a server with a graphical interface, you can open the file using a command like see merged_abundance.png).

  • What is the PhyloXML format? Why might it be used in this context?
  • Why is it often particularly useful to plot circular, rather than linear, cladograms?
  • What other types of annotations might be useful on such a tree (either different graphical formats, or different types of data to take advantage of them)?