Skip to content

MetaPhlAn 4

abmiguez edited this page Aug 25, 2022 · 19 revisions

MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With StrainPhlAn, it is possible to perform accurate strain-level microbial profiling.

MetaPhlAn 4 relies on ~5.1M unique clade-specific marker genes (the latest marker information file can be found here) identified from ~1M microbial genomes (~236,600 references and 771,500 metagenomic assembled genomes) spanning 26,970 species-level genome bins (SGBs, http://segatalab.cibio.unitn.it/data/Pasolli_et_al.html), 4,992 of them taxonomically unidentified at the species level (the full list of the species included in the latest database can be found here), allowing:

  • unambiguous taxonomic assignments;
  • an accurate estimation of organismal relative abundance;
  • SGB-level resolution for bacteria, archaea and eukaryotes;
  • strain identification and tracking
  • orders of magnitude speedups compared to existing methods.
  • metagenomic strain-level population genomics

Pre-requisites

MetaPhlAn requires python 3 or newer with numpy, and Biopython libraries installed. Python libraries are automatically installed by pip. MetaPhlAn relies on BowTie2 (version 2.3 or higher) to map reads against marker genes. Check that bowtie2 is present in the system path with execute and read permissions.

If MetaPhlAn is installed using conda, no pre-requisites are needed.

MetaPhlAn is integrated with advanced heatmap plotting with hclust2 and cladogram visualization with GraPhlAn. If you use such visualization tools please refer to their prerequisites.

The best way to install MetaPhlAn is through conda via the Bioconda channel. If you have not configured your Anaconda installation to fetch packages from Bioconda, please follow these steps to set up the channels.

You can install MetaPhlAn by running

$ conda install -c bioconda metaphlan

It is recommended to create an isolated conda environment and install MetaPhlAn into it.

$ conda create --name mpa -c bioconda python=3.7 metaphlan

If during the installation you encounter an incompatibility error with the glibc package, we suggest you to add the conda-forge channel to conda or run one of the following commands.

$ conda install -c conda-forge -c bioconda metaphlan
$ conda create --name mpa -c conda-forge -c bioconda python=3.7 metaphlan

This allows having the correct version of all the dependencies isolated from the system's python installation.

Before using MetaPhlAn, you should activate the mpa environment:

$ conda activate mpa

MetaPhlAn is also available in PyPi

$ pip install metaphlan

Alternatively, you can manually download from GitHub or clone the repository using the following command

$ git clone https://github.com/biobakery/MetaPhlAn.git

and install MetaPhlAn by running

$ pip install .

If you choose this way, you'll need to install manually some dependencies!

MetaPhlAn needs the clade markers and the database to be downloaded locally. To obtain them:

$ metaphlan --install 

Important! The MetaPhlAn 4 database has been substantially increased in comparison with the previous 3.1 version. Thus, for running MetaPhlAn 4, a minimum of 15GB or memory is needed.

If you have installed MetaPhlAn using Anaconda, it is advised to install the database in a folder outside the Conda environment. To do this, run

$ metaphlan --install --bowtie2db <database folder>

If you install the database in a different location, remember to run MetaPhlAn using --bowtie2db <database folder>!

By default, the latest MetaPhlAn database is downloaded and built. You can download a specific version with the --index parameter

$ metaphlan --install --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2db <database folder>

When --index is specified, MetaPhlAn skips the check for the latest database version and run the analysis using the database version provided by --index located in --bowtie2db.

This option is recommended when MetaPhlAn is run on HPC clusters or containerized

If you have issues in downloading the database, you can get it from:

Just download the .tar, .md5, and the mpa_latest files and place them in the metaphlan_databases folder.


Basic Usage

Important! The MetaPhlAn 4 database has been substantially increased in comparison with the previous 3.1 version. Thus, for running MetaPhlAn 4, a minimum of 15GB or memory is needed.

$ metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt

It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (--bowtie2out), and use multiple CPUs (--nproc) if available:

$ metaphlan metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt

If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved --bowtie2out file and specifying the input (--input_type bowtie2out):

$ metaphlan metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out -o profiled_metagenome.txt

bowtie2out files generated with MetaPhlAn versions below 3.0 are not compatible. Starting from MetaPhlAn 3, the BowTie2 output now includes the size of the profiled metagenome.

You can also provide an externally BowTie2-mapped SAM if you specify this format with --input_type. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn with the obtained SAM:

$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103  -U metagenome.fastq
$ metaphlan metagenome.sam --input_type sam -o profiled_metagenome.txt

MetaPhlAn can also natively handle paired-end metagenomes (but does not use the paired-end information), and, more generally, metagenomes stored in multiple files (but you need to specify the --bowtie2out parameter):

$ metaphlan metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt

Starting from version 3, MetaPhlAn can estimate the unclassified fraction of the metagenome. The relative abundance profile is scaled according to the percentage of reads mapping to a clade in the database.

$ metaphlan metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq --unclassified_estimation -o profiled_metagenome.txt

If you want to estimate the unknown fraction of a metagenome and your input file is a SAM file, remember to specify the metagenome size using --nreads. You can get easily get the metagenome size from a SAM file if you have run bowtie2 without the --no-unal parameter by running

$ wc -l metagenome.sam

Otherwise, read_fastx.py is your choice: this will print on the standard error the metagenome size.

$ read_fastx.py metagenome.fastq > /dev/null

You can provide the specific database version with --index.

By default MetaPhlAn is run with --index latest: the latest version of the database is used; if it is not available, MetaPhlAn will try to download it.

When --index is specified, MetaPhlAn skips the check for the latest database version and run the analysis using the database version provided by --index located in --bowtie2db.

For advanced options and other analysis types (such as strain tracking) please refer to the full command-line options metaphlan --help.

Utility Scripts

MetaPhlAn's repository features a few utility scripts to aid in the manipulation of sample output and its visualization. These scripts can be found under the utils folder in the MetaPhlAn directory.

Merging Tables

The script merge_metaphlan_tables.py allows to combine MetaPhlAn output from several samples to be merged into one table Bugs (rows) vs Samples (columns) with the table enlisting the relative normalized abundances per sample per bug.

To merge multiple output files, run the script as below

$ merge_metaphlan_tables.py metaphlan_output1.txt metaphlan_output2.txt > metaphlan_output3.txt output/merged_abundance_table.txt

Wildcards can be used as needed:

$ merge_metaphlan_tables.py metaphlan_output*.txt > output/merged_abundance_table.txt

Output files can be merged only if the profiling was performed with the same version of the MetaPhlAn database.

There is no limit to how many files you can merge.

Converting SGB profiles to the GTDB taxonomy

The script sgb_to_gtdb_profile.py allows to convert a SGB-based MetaPhlAn 4 output into a GTDB-taxonomy-based profile.

To do so, run the script as below

$ sgb_to_gtdb_profile.py -i metaphlan_output.txt -o metaphlan_output_gtdb.txt

Alpha and beta diversity calculation

The script calculate_diversity.R allows to compute alpha and/or beta diversity, with different metrics of choice, starting from a merged MetaPhlAn table. Available alpha-diversity metrics are richness, shannon, simpson, and gini. Available beta-diversity distance functions are bray-curtis, jaccard, weighted-unifrac, unweighted-unifrac, centered log-ratio, and aitchison. For example, to generate a beta diversity distance matrix with bray-curtis, you need to run the script as below:

Rscript calculate_diversity.R -f merged_mpa4_profiles.tsv -d beta -m bray-curtis

To compute UniFrac distances, the SGB tree in the Newick format (available here) must be provided.

For the full list of options, please run:

Rscript calculate_diversity.R

Heatmap Visualization

The hclust2 script generates a hierarchically-clustered heatmap from MetaPhlAn abundance profiles. To generate the heatmap for a merged MetaPhlAn output table (as described above), you need to run the script as below:

hclust2.py \
  -i HMP.species.txt \
  -o HMP.sqrt_scale.png \
  --skip_rows 1 \
  --ftop 50 \
  --f_dist_f correlation \
  --s_dist_f braycurtis \
  --cell_aspect_ratio 9 \
  -s --fperc 99 \
  --flabel_size 4 \
  --metadata_rows 2,3,4 \
  --legend_file HMP.sqrt_scale.legend.png \
  --max_flabel_len 100 \
  --metadata_height 0.075 \
  --minv 0.01 \
  --no_slabels \
  --dpi 300 \
  --slinkage complete

GraPhlAn Visualization

The tutorial of using GraPhlAn can be found from the GraPhlAn wiki.

Customizing the database

In order to add a marker to the database, the user needs the following steps:

  • Reconstruct the marker sequences (in fasta format) from the MetaPhlAn BowTie2 database by:
bowtie2-inspect metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103 > metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_markers.fasta
  • Add the marker sequence stored in a file new_marker.fasta to the marker set:
cat new_marker.fasta >> metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_markers.fasta
  • Rebuild the bowtie2 database:
bowtie2-build metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_markers.fasta metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_NEW
  • Assume that the new marker was extracted from genome1, genome2. Update the taxonomy file from the Python console as follows:
import pickle
import bz2

db = pickle.load(bz2.open('metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.pkl', 'r'))

# Add the taxonomy of the new genomes
db['taxonomy']['7-levels taxonomy with clade names of genome1'] = ('7-levels NCBI taxonomy id of genome1', length of genome1)
db['taxonomy']['7-levels taxonomy with clade names of genome2'] = ('7-levels NCBI taxonomy id of genome1', length of genome2)

# Add the information of the new marker as the other markers
db['markers'][new_marker_name] = {
                                   'clade': the clade that the marker belongs to,
                                   'ext': {the GCA of the first external genome where the marker appears,
                                           the GCA of the second external genome where the marker appears,
                                          },
                                   'len': length of the marker,
                                   'taxon': the taxon of the marker
                                }
                                   
To see an example, try to print the first marker information:
 print list(db['markers'].items())[0]

# Save the new mpa_pkl file
with bz2.BZ2File('metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_NEW.pkl', 'w') as ofile:
    pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)

  • To use the new database, remember to run metaphlan.py with the "--index mpa_vJan21_CHOCOPhlAnSGB_NEW" parameter.