Skip to content

StrainPhlAn 3

abmiguez edited this page Jul 25, 2022 · 1 revision

StrainPhlAn: metagenomic strain-level population genomics

Description

StrainPhlAn is a computational tool for tracking individual strains across a large set of samples. The input of StrainPhlAn is a set of metagenomic samples and for each species, the output is a multiple sequence alignment (MSA) file of all species strains reconstructed directly from the samples. From this MSA, StrainPhlAn calls (PhyloPhlAn 3)[http://segatalab.cibio.unitn.it/tools/phylophlan3/index.html] to build the phylogenetic tree showing the strain evolution of the sample strains.

For each sample, StrainPhlAn is able to identify strains of a species by merging and concatenating all reads mapped against species-specific MetaPhlAn markers.

In detail, let us start from a toy example with 6 HMP gut metagenomic samples (SRS055982-subjectID_638754422, SRS022137-subjectID_638754422, SRS019161-subjectID_763496533, SRS013951-subjectID_763496533, SRS014613-subjectID_763840445, SRS064276-subjectID_763840445) from 3 three subjects (each was sampled at two time points) and one Bacteroides caccae genome G000273725. We would like to:

  • extract the Bacteroides caccae strains from these samples and compare them with the reference genome in a phylogenetic tree.
  • know how many SNPs between those strains and the reference genome.

By running StrainPhlAn on these samples, we will obtain the Bacteroides caccae phylogenetic tree and its multiple sequence alignment in the following figure (produced with ete2 and Jalview):

tree_alignment.png

We can see that the strains from the same subject are grouped together. The tree also highlights that the strains from each subject did not evolve between the two sampling time points. From the tree, we also know that the strains of subject "763496533" are closer to the reference genome than those of the others.

In addition, the table below shows the number of SNPs between the sample strains and the reference genome based on the strain alignment returned by StrainPhlAn.

svn_distances

In the next sections, we will illustrate step by step how to run StrainPhlAn on this toy example to reproduce the above figures.

Installation

StrainPhlAn is included in MetaPhlAn. You can get more info about MetaPhlAn installation here.

If you have installed MetaPhlAn using pip, you'll need to install some dependencies first

## Please run the commands in this order!
$ pip install numpy cython pysam pandas h5py
$ pip install biom-format cmseq phylophlan

StrainPhlAn is available also via Docker, we provide an image with all the needed dependencies already configured:

$ docker pull biobakery/strainphlan

Usage

Let's reproduce the toy example result in the introduction section. The steps are as follows:

Step 1. Download the 6 HMP gut metagenomic samples, the metadata.txt file and one reference genome from the folder "fastqs" and "reference_genomes" in this link and put these folders under the "strainer_tutorial" folder.

Step 2. Obtain the sam files from these samples by mapping them against MetaPhlAn database:

This step will run MetaPhlAn: the metagenomic samples will be mapped against the MetaPhlAn marker database and then SAM files (*.sam.bz2) will produced. Each SAM file (in SAM format) contains the reads mapped against the marker database of MetaPhlAn. The commands to run are:

mkdir -p sams
mkdir -p bowtie2
mkdir -p profiles
for f in fastq/SRS*
do
    echo "Running MetaPhlAn on ${f}"
    bn=$(basename ${f})
    metaphlan ${f} --input_type fastq -s sams/${bn}.sam.bz2 --bowtie2out bowtie2/${bn}.bowtie2.bz2 -o profiles/${bn}_profiled.tsv
done

After this step, you will have a folder "sams" containing the SAM files (*.sam.bz2) and other MetaPhlAn output files in the "bowtie2" and "profiles" folders. If you want to skip this step, you can download the SAM files from the folder "sams" in this link.

Step 3. Produce the consensus-marker files which are the input for StrainPhlAn:

For each sample, this step will reconstruct all species strains found in it and store them in a pickle file (*.pkl). Those strains are referred as sample-reconstructed strains. The commands to run are:

mkdir -p consensus_markers
sample2markers.py -i sams/*.sam.bz2 -o consensus_markers -n 8

The result is the same if you want to run several sample2markers.py scripts in parallel with each run for a sample (this may be useful for some cluster-system settings). After this step, you will have a folder consensus_markers containing all sample-marker files (*.pkl).

If you want to skip this step, you can download the consensus marker files from the folder "consensus_markers" in this link.

Step 4. Extract the markers of Bacteroides_caccae from MetaPhlAn database (to add its reference genome later):

This step will extract the markers of Bacteroides_caccae in the database and then StrainPhlAn will identify the sequences in the reference genomes that are closet to them (in the next step by using blast). Those will be concatenated and referred as reference-genome-reconstructed strains. The commands to run are:

mkdir -p db_markers
extract_markers.py -c s__Bacteroides_caccae -o db_markers/

After this step, you should have one file in folder "db_markers": "s__Bacteroides_caccae.fna" containing the markers of Bacteroides caccae. Those markers can be found in the folder "db_markers" in this link.

Step 5. Build the multiple sequence alignment and the phylogenetic tree:

This step will filter the selected clade markers based on their presence in the sample-reconstructed strains (stored in the marker files produced in step 3) and reference-genomes (if specified). Also, the sample-reconstructed strains and reference-genomes will be filtered based on the presence of the selected clade markers. From this filtered markers and samples, StrainPhlAn will call PhyloPhlAn to produce a multiple sequence alignment (MSA) to then build the phylogenetic tree.

The commands to run are:

mkdir -p output
strainphlan -s consensus_markers/*.pkl -m db_markers/s__Bacteroides_caccae.fna -r reference_genomes/G000273725.fna.bz2 -o output -n 8 -c s__Bacteroides_caccae --mutation_rates

After this step, you will find the tree "output/RAxML_bestTree.s__Bacteroides_caccae.StrainPhlAn3.tre". All the output files can be found in the folder "output" in this link. You can view it by Archaeopteryx or any other viewers.

In order to add the metadata, we provide outside the conda installation a script called add_metadata_tree.py which can be used as follows:

add_metadata_tree.py -t output/RAxML_bestTree.s__Bacteroides_caccae.StrainPhlAn3.tre -f fastq/metadata.txt -m subjectID --string_to_remove .fastq.bz2

The script "add_metadata_tree.py" can accept multiple metadata files (space-separated, wild card can also be used) and multiple trees. A metadata file is a tab-separated file where the first row is the meta-headers, and the following rows contain the metadata for each sample. Multiple metadata files are used in the case where your samples come from more than one dataset and you do not want to merge the metadata files.

For more details of using "add_metadata_tree.py", please see its help (with option "-h"). An example of a metadata file is the "fastqs/metadata.txt" file with the below content:

sampleID        subjectID
SRS055982       638754422
SRS022137       638754422
SRS019161       763496533
SRS013951       763496533
SRS014613       763840445
SRS064276       763840445
G000273725  ReferenceGenomes

Note that "sampleID" is a compulsory field.

After adding the metadata, you will obtain the tree files "*.tre.metadata" with metadata and view them by Archaeopteryx as in the previous step.

If you have installed graphlan, you can plot the tree with the additional plot_tree_graphlan.py script:

plot_tree_graphlan.py -t output/RAxML_bestTree.s__Bacteroides_caccae.StrainPhlAn3.tre.metadata -m subjectID

and obtain the following figure (output/RAxML_bestTree.s__Bacteroides_caccae.StrainPhlAn3.tre.metadata.png):

graphlan_tree.png

Note that this Script must be executed using Python2.

Full command-line options

usage: strainphlan [-h] [-d DATABASE] [-m CLADE_MARKERS]
                       [-s SAMPLES [SAMPLES ...]]
                       [-r REFERENCES [REFERENCES ...]] [-c CLADE]
                       [-o OUTPUT_DIR] [-n NPROCS]
                       [--secondary_samples SECONDARY_SAMPLES [SECONDARY_SAMPLES ...]]
                       [--secondary_references SECONDARY_REFERENCES [SECONDARY_REFERENCES ...]]
                       [--trim_sequences TRIM_SEQUENCES]
                       [--marker_in_n_samples MARKER_IN_N_SAMPLES]
                       [--sample_with_n_markers SAMPLE_WITH_N_MARKERS]
                       [--secondary_sample_with_n_markers SECONDARY_SAMPLE_WITH_N_MARKERS]
                       [--phylophlan_configuration PHYLOPHLAN_CONFIGURATION]
                       [--mutation_rates]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        The input MetaPhlAn database
  -m CLADE_MARKERS, --clade_markers CLADE_MARKERS
                        The clade markers as FASTA file
  -s SAMPLES [SAMPLES ...], --samples SAMPLES [SAMPLES ...]
                        The reconstructed markers for each sample
  -r REFERENCES [REFERENCES ...], --references REFERENCES [REFERENCES ...]
                        The reference genomes
  -c CLADE, --clade CLADE
                        The clade to investigate
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        The output directory
  -n NPROCS, --nprocs NPROCS
                        The number of threads to use
  --secondary_samples SECONDARY_SAMPLES [SECONDARY_SAMPLES ...]
                        The reconstructed markers for each secondary sample
  --secondary_references SECONDARY_REFERENCES [SECONDARY_REFERENCES ...]
                        The secondary reference genomes
  --trim_sequences TRIM_SEQUENCES
                        The number of bases to remove from both ends when trimming 
                        markers. Default 50
  --marker_in_n_samples MARKER_IN_N_SAMPLES
                        Theshold defining the minimum percentage of samples to
                        keep a marker. Default 80 (%)
  --sample_with_n_markers SAMPLE_WITH_N_MARKERS
                        Threshold defining the minimun number of markers to
                        keep a sample. Default 20
  --secondary_sample_with_n_markers SECONDARY_SAMPLE_WITH_N_MARKERS
                        Threshold defining the minimun number of markers to
                        keep a secondary sample. Default 20
  --phylophlan_configuration PHYLOPHLAN_CONFIGURATION
                        The PhyloPhlAn configuration file
  --phylophlan_mode PHYLOPHLAN_MODE
                        The precision of the phylogenetic analysis {fast,
                        normal [default], accurate}
  --mutation_rates      If specified will produced a mutation rates table for
                        each of the aligned markers and a summary table for
                        the concatenated MSA. This operation can take long
                        time to finish  
  --print_clades_only   If specified only print the potential clades and stop
                        without building any tree

Some other useful output files

In the output folder, you can find the following files:

  1. clade_name.info: this file shows the general information like the total length of the concatenated markers (full sequence length), number of used markers, etc.
  2. clade_name.polymorphic: this file shows the statistics on the polymorphic site, where "sample" is the sample name, "percentage_of_polymorphic_sites" is the percentage of sites that are suspected to be polymorphic, "avg_freq" is the average frequency of the dominant alleles on all polymorphic sites, "avg_coverage" is the average coverage at all polymorphic sites.
  3. clade_name.StrainPhlAn3_concatenated.aln: the alignment file of all metagenomic strains.
  4. clade_name.mutation: this file contains a mutation rates summary table for the concatenated MSA. This file will be generated if --mutation_rates param is specified.
  5. clade_name_mutation_rates: this folder contains the mutation rates table for each of the aligned markers. This folder will be generated if --mutation_rates param is specified.