-
Notifications
You must be signed in to change notification settings - Fork 72
parathaa
PARATHAA - Preserving and Assimilating Region-specific Ambiguities in Taxonomic Hierarchical Assignments for Amplicons
PARATHAA is a tooled used for the taxonomic assignment of representative amplicon sequences that takes into account the uncertainty associated with using specific variable regions/primers. PARATHAA does this by 1) generating new primer-trimmed phylogenetic trees from reference marker genes such as the 16S rRNA gene and 2) determining the optimal phylogenetic distances within that tree for taxonomic labeling. PARATHAA then can use this tree to assign taxonomy to query sequences by aligning and placing those sequences into the new primer-trimmed reference database.
This tutorial will go over the basic usage of PARATHAA when applied to 16S rRBA genes and is split into two sections:
The first step in PARATHAA is to generate a phylogenetic tree of primer-trimmed sequences with taxonomically-labeled internal nodes. PARATHAA accomplishes this in the following way:
- Takes an input reference alignment set and trims those alignments to only the regions amplified by the given primer set.
- Uses the newly trimmed sequences to generate a phylogenetic tree using FastTree.
- Determines the optimal phylogenetic distance cutoffs for each taxonomic group based on the original full length reference taxonomy.
- This is a key step in the PARATHAA workflow and is accomplished by following:
Optimal cophenetic distance thresholds based on a GTR model of nucleotide evolution (as implemented in FastTree) are identified, which will define taxa at each level. For instance, any node whose underlying tips have pairwise distances less than the genus-defining threshold are considered to be of the same genus. To find the optimal thresholds at each taxonomic level, a threshold-finding step tries a range of distance cutoffs for each taxonomic level, and calculates an associated error based on misclassification of sequences (whose taxonomy is known). The chosen threshold is one that minimizes (1) grouping tips of multiple taxa under a taxon-defining node (“over-merging”), and (2) splitting of sequences from the same taxonomic group into multiple taxon-defining nodes (“over-splitting”).
- Uses these optimal phylogenetic distances to relabel the internal nodes of the trimmed phylogenetic tree.
All of these steps can be completed using the parathaa_run_tree_analysis
command with the proper inputs.
To generate the phylogenetic tree of primer-trimmed sequences using parathaa_run_tree_analysis
, three different inputs are required:
- Primers
- A reference multiple-sequence alignment
- A reference taxonomy file
For this tutorial, we have provided input files which are located in the input directory of this repository.
Let's first take a look at the primer files
ls input/primers
Results:
EMPV4.oligos V1V2.oligos V4V5.oligos
Here we see that there are three different .oligos
files each containing a different primer set. We can take a look at the primer sequences using the less
command.
less input/primers/V4V5.oligos
Results:
primer GTGYCAGCMGCCGCGGTAA CCGYCAATTYMTTTRAGTTT V4V5
Lets now take a look at the reference MSA
less input/testing/tree_construction/subset_silva_seed_v138_1.align
Results:
>CP011827.XtnCi232 100 Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Xanthomonas;
...........................................................................................................................................................>
>LT992463.F3C41489 100 Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;
...........................................................................................................................................................>
>AJ009499.UncB6820 100 Bacteria;Spirochaetota;Spirochaetia;Spirochaetales;Spirochaetaceae;uncultured;
...........................................................................................................................................................>
>ACOD01000398.CllGram3 100 Eukaryota;Amorphea;Obazoa;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Pezizomycotina;Sordariomycetes;Glomerellales;Gl>
...........................................................................................................................................................>
>AY328677.Unc0fxxb 100 Bacteria;Proteobacteria;Alphaproteobacteria;Caulobacterales;Hyphomonadaceae;uncultured;
...........................................................................................................................................................>
>AJ408961.Unc57378 100 Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Blautia;
...........................................................................................................................................................>
In the MSA file we have the silva v138.1 accession numbers for each sequence along with their accompanying taxonomy.
Finally the last input file we need is a separate taxonomy file for the accession numbers found in the MSA file. First we will need to decompress the file.
gunzip input/silva_v138/taxmap_slv_ssu_ref_138.1.txt.gz
less -S input/silva_v138/taxmap_slv_ssu_ref_138.1.txt
Results:
primaryAccession start stop path organism_name taxid
AX664486 3 1987 Eukaryota;Amorphea;Obazoa;Opisthokonta;Holozoa;Choanozoa;Metazoa;Animalia;BCP clade;Bilateria;Protostomia;Ec>
BD307583 1 1747 Eukaryota;SAR;Alveolata;Apicomplexa;Conoidasida;Coccidia;Eimeriorina;Toxoplasma; Neospora sp. 8805
BD359735 3 2145 Eukaryota;SAR;Alveolata;Apicomplexa;Aconoidasida;Haemosporoidia;Plasmodium; Plasmodium malariae 8775
BD359736 3 2150 Eukaryota;SAR;Alveolata;Apicomplexa;Aconoidasida;Haemosporoidia;Plasmodium; Plasmodium malariae 8775
CS214259 13 1872 Eukaryota;Amorphea;Obazoa;Opisthokonta;Holozoa;Choanozoa;Metazoa;Animalia;BCP clade;Bilateria;Deuterostomia;>
In this file we can see that we have the accession number, the start and stop position of the sequence, its taxonomic assignment up to genus in the path
column, its species in the organism_name
column and then final its NCBI taxID.
Now that we took a look at all the input files lets run the command:
parthaa_run_tree_analysis --primers input/primers/V4V5.oligos \
--database input/testing/tree_construction/subset_silva_seed_v138_1.align \
--output output \
--taxonomy input/silva_v138/taxmap_slv_ssu_ref_138.1.txt
This will create a new primer-trimmed, taxonomically labelled, phylogenetic tree based on the database alignment and taxonomy files we inspected above. Note that for the purposes of this tutorial the database file has been significantly shortened.
Lets take a look at the output
ls output
Result:
optimal_scores.RData
optimal_scores.png
region_specific.tree
resultTree_bestThresholds.RData
subset_silva_seed_v138_1.bad.accnos
subset_silva_seed_v138_1.pcr.align
subset_silva_seed_v138_1.scrap.pcr.align
treelog.txt
-
optimal_scores.RData
this is an RData file containing the optimal phylogenetic distances for taxonomic assignment at each taxonomic level. -
optimal_scores.png
this plot shows the optimal distance threshold (on the x-axis) and the penalty score (on the y-axis). The penalty score is based on whether each distance resulted in over-splitting of taxonomic groups or over-merging of those groups. The weighting of these scores can be adjusted using the--sweight
and--mweight
options. -
region_specific.tree
this is the primer trimmed tree in newick format -
resultTree_bestThresholds.RData
this is an RData file containing the primer-trimmed tree along with taxonomically-labelled internal nodes based on the optimal distance thresholds calculated in optimal_scores.RData. -
subset_silva_seed_v138_1.pcr.align
this is the new primer trimmed MSA. -
treelog.txt
this is the tree log file output by FastTree indicating the optimal settings for tree construction and future sequence placement.
These files are used in the step 2 when we go to assign taxonomy to query sequences!
The second step of PARATHAA uses the phylogenetic tree, multiple sequence alignment (MSA) and tree data files we just generated to create taxonomic assignments to 16S rRNA gene sequences. This step is running using the following command:
parathaa_run_taxa_assignment
Briefly this step takes in the newly created primer trimmed tree, MSA, tree reference files, and thresholding information to assign taxonomy to query 16S sequences by the following steps.
- Aligns query sequences to primer trimmed MSA generated in step 1
- Places those sequences into the primer trimmed tree generated in step 1 using pplacer
- Assigns taxonomy to those query sequences based on their placement and their distance from taxonomically labelled interior nodes. Note the thresholding distance computed in step one determines the appropriate distance away a node can be to be assigned to a taxonomic level.
PARATHAA will come with a number of pre-computed trees (see user manual) for this step so that users do not need to generate their own primer trimmed trees for commonly used reference databases.
For this tutorial we will download and use the pre-computed V4V5 SILVA v138.1 tree.
We can download this tree using the following command:
wget http://huttenhower.sph.harvard.edu/parathaa_db/SILVA_V4V5.tar.gz
tar -xvf SILVA_V4V5.tar.gz
rm SILVA_V4V5.tar.gz
The input files for taxonomic assignment using PARATHAA are those generated during step 1. For most users the pre-computed-files will be sufficient, however, those that use uncommon primer choices or want to use a different database will need to run step 1 to create the required files for taxonomic assignment.
This command takes the following inputs:
-
trimmedDatabase
the primer trimmed MSA generated from step 1 -
trimmedTree
the trimmed phylogenetic tree generated from step 1 -
treeLog
the treelog.txt file generated from step 1 -
query
the 16S sequences that you want to assign taxonomy -
thresholds
an RData file containing the optimal phylogenetic distances for taxonomic assignment -
namedTree
an RData file containing the annotated trimmed phylogenetic tree
For the demo run of PARATHAA we will use the pre-computed-files for silva v138 using V4V5 primers (found in input/primers/
).
parathaa_run_taxa_assignment --trimmedDatabase SILVA_V4V5/silva.seed_v138_1.pcr.align \
--treeLog SILVA_V4V5/treelog.txt \
--query input/testing/taxa_assignment/SRR3225703_V4V5_subset.fasta \
--thresholds SILVA_V4V5/optimal_scores.RData \
--namedTree SILVA_V4V5/resultTree_bestThresholds.RData \
--output output_taxa_test \
--trimmedTree SILVA_V4V5/region_specific.tree
Let's take a look at the output folder and see what files were generated.
ls output_taxa_test
Result:
SRR3225703_V4V5_subset.align
SRR3225703_V4V5_subset.flip.accnos
SRR3225703_V4V5_subset.align_report
merged.fasta
merged_sub.fasta
anadama.log
ref.refpkg
merged_sub.jplace
taxonomic_assignments.tsv
-
SRR3225703_V4V5_subset.align
the aligned query sequences to the trimmedDatabase input file -
SRR3225703_V4V5_subset.flip.accnos
sequences removed due to poor alignment -
SRR3225703_V4V5_subset.align_report
alignment report -
anadama.log
anadama workflow log -
merged.fasta
andmerged_sub.fasta
are alignment files combined with the original database that are used for sequence placement by pplacer. The former of the two is the same file with some formatting changes to meet pplacer requirements. -
ref.refpkg
the reference settings for sequence placement by pplacer -
merged_sub.jplace
placement file containing the location of the placed sequences -
taxonomic_assignments.tsv
the taxonomic assignment of the query sequences.
Lets take a look at the taxonomic assignment profile
column -t output_taxa_test/taxonomic_assignments.tsv| less -S
Result
query.name Kingdom Phylum Class Order Family Genus Species maxDist
SRR3225703.1.1 Bacteria Bacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides NA 0.00536901463742
SRR3225703.10.1 Bacteria Bacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides NA 0.00366493185698465
SRR3225703.100.1 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus Lactobacillus gasseri;Lactobacillus johnsonii 0.0016927384
This is the main output of PARATHAA and contains the taxonomic assignment of the 16S query sequences. The headers of this file indicate the query sequence name under query.name
.
Under each taxonomic level there is a taxonomic label. NA
represents the inability to assign a sequence to that taxonomic level based on the optimal thresholds calculated in step 1. You will also notice that at some taxonomic levels there may be multiple assignments. For example in the above output we can see that the third query sequence was assigned at the species level to Lactobacillus gasseri;Lactobacillus johnsonii
. This is because PARATHAA found that both of these species are within the optimal distance threshold for taxonomic classification and therefore cannot be certain of the specific species that this 16S rRNA gene sequence belongs to. This allows parathaa to make ambiguous classifications of multiple taxonomic labels.
Finally the maxDist column represents the maximum distance the query sequence is from any of its parents underlying tips.
- HUMAnN 2.0
- HUMAnN 3.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- PARATHAA