-
Notifications
You must be signed in to change notification settings - Fork 81
parathaa
PARATHAA - Preserving and Assimilating Region-specific Ambiguities in Taxonomic Hierarchical Assignments for Amplicons
PARATHAA is a tooled used for the taxonomic assignment of representative amplicon sequences that takes into account the uncertainty associated with using specific variable regions/primers. PARATHAA does this by 1) generating new primer-trimmed phylogenetic trees from reference marker genes such as the 16S rRNA gene and 2) determining the optimal phylogenetic distances within that tree for taxonomic labeling. PARATHAA then can use this tree to assign taxonomy to query sequences by aligning and placing those sequences into the new primer-trimmed reference database.
This tutorial will go over the basic usage of PARATHAA when applied to 16S rRBA genes and is split into two sections:
The first step in PARATHAA is to generate a phylogenetic tree of primer-trimmed sequences with taxonomically-labeled internal nodes. PARATHAA accomplishes this in the following way:
- Takes an input reference alignment set and trims those alignments to only the regions amplified by the given primer set.
- Uses the newly trimmed sequences to generate a phylogenetic tree using FastTree.
- Determines the optimal phylogenetic distance cutoffs for each taxonomic group based on the original full length reference taxonomy.
- This is a key step in the PARATHAA workflow and is accomplished by following:
Optimal cophenetic distance thresholds based on a GTR model of nucleotide evolution (as implemented in FastTree) are identified, which will define taxa at each level. For instance, any node whose underlying tips have pairwise distances less than the genus-defining threshold are considered to be of the same genus. To find the optimal thresholds at each taxonomic level, a threshold-finding step tries a range of distance cutoffs for each taxonomic level, and calculates an associated error based on misclassification of sequences (whose taxonomy is known). The chosen threshold is one that minimizes (1) grouping tips of multiple taxa under a taxon-defining node (“over-merging”), and (2) splitting of sequences from the same taxonomic group into multiple taxon-defining nodes (“over-splitting”).
- Uses these optimal phylogenetic distances to relabel the internal nodes of the trimmed phylogenetic tree.
All of these steps can be completed using the parathaa_run_tree_analysis command with the proper inputs.
To generate the phylogenetic tree of primer-trimmed sequences using parathaa_run_tree_analysis, three different inputs are required:
- Primers
- A reference multiple-sequence alignment
- A reference taxonomy file
For this tutorial, we have provided input files which are located in the input directory of this repository.
Let's first take a look at the primer files
ls input/primers
Results:
EMPV4.oligos V1V2.oligos V4V5.oligos
Here we see that there are three different .oligos files each containing a different primer set. We can take a look at the primer sequences using the less command.
less input/primers/V4V5.oligos
Results:
primer GTGYCAGCMGCCGCGGTAA CCGYCAATTYMTTTRAGTTT V4V5
Lets now take a look at the reference MSA
less input/testing/tree_construction/subset_silva_seed_v138_1.align
Results:
>CP011827.XtnCi232 100 Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Xanthomonas;
...........................................................................................................................................................>
>LT992463.F3C41489 100 Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;
...........................................................................................................................................................>
>AJ009499.UncB6820 100 Bacteria;Spirochaetota;Spirochaetia;Spirochaetales;Spirochaetaceae;uncultured;
...........................................................................................................................................................>
>ACOD01000398.CllGram3 100 Eukaryota;Amorphea;Obazoa;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Pezizomycotina;Sordariomycetes;Glomerellales;Gl>
...........................................................................................................................................................>
>AY328677.Unc0fxxb 100 Bacteria;Proteobacteria;Alphaproteobacteria;Caulobacterales;Hyphomonadaceae;uncultured;
...........................................................................................................................................................>
>AJ408961.Unc57378 100 Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Blautia;
...........................................................................................................................................................>
In the MSA file we have the silva v138.1 accession numbers for each sequence along with their accompanying taxonomy.
Finally the last input file we need is a separate taxonomy file for the accession numbers found in the MSA file. First we will need to decompress the file.
gunzip input/silva_v138/taxmap_slv_ssu_ref_138.1.txt.gz
less -S input/silva_v138/taxmap_slv_ssu_ref_138.1.txt
Results:
primaryAccession start stop path organism_name taxid
AX664486 3 1987 Eukaryota;Amorphea;Obazoa;Opisthokonta;Holozoa;Choanozoa;Metazoa;Animalia;BCP clade;Bilateria;Protostomia;Ec>
BD307583 1 1747 Eukaryota;SAR;Alveolata;Apicomplexa;Conoidasida;Coccidia;Eimeriorina;Toxoplasma; Neospora sp. 8805
BD359735 3 2145 Eukaryota;SAR;Alveolata;Apicomplexa;Aconoidasida;Haemosporoidia;Plasmodium; Plasmodium malariae 8775
BD359736 3 2150 Eukaryota;SAR;Alveolata;Apicomplexa;Aconoidasida;Haemosporoidia;Plasmodium; Plasmodium malariae 8775
CS214259 13 1872 Eukaryota;Amorphea;Obazoa;Opisthokonta;Holozoa;Choanozoa;Metazoa;Animalia;BCP clade;Bilateria;Deuterostomia;>
In this file we can see that we have the accession number, the start and stop position of the sequence, its taxonomic assignment up to genus in the path column, its species in the organism_name column and then final its NCBI taxID.
Now that we took a look at all the input files lets run the command:
parathaa_run_tree_analysis --primers input/primers/V4V5.oligos \
--database input/testing/tree_construction/subset_silva_seed_v138_1.align \
--output output \
--taxonomy input/silva_v138/taxmap_slv_ssu_ref_138.1.txt
This will create a new primer-trimmed, taxonomically labelled, phylogenetic tree based on the database alignment and taxonomy files we inspected above. Note that for the purposes of this tutorial the database file has been significantly shortened.
Lets take a look at the output
ls output
Result:
anadama.log
optimal_scores.RData
optimal_scores.png
region_specific.tree
resultData.RData
resultTree_bestThresholds.RData
subset_silva_seed_v138_1.pcr.align
treelog.txt
-
anadama.logthis is the log file with information on all of the tasks run including standard out and standard error from every command line task. It also includes the values of all workflow arguments. -
optimal_scores.RDatathis is an RData file containing the optimal phylogenetic distances for taxonomic assignment at each taxonomic level. -
optimal_scores.pngthis plot shows the optimal distance threshold (on the x-axis) and the penalty score (on the y-axis). The penalty score is based on whether each distance resulted in over-splitting of taxonomic groups or over-merging of those groups. The weighting of these scores can be adjusted using the--sweightand--mweightoptions.
-
region_specific.treethis is the primer trimmed tree in newick format -
resultTree_bestThresholds.RDatathis is an RData file containing the primer-trimmed tree along with taxonomically-labelled internal nodes based on the optimal distance thresholds calculated in optimal_scores.RData. -
subset_silva_seed_v138_1.pcr.alignthis is the new primer trimmed MSA. -
treelog.txtthis is the tree log file output by FastTree indicating the optimal settings for tree construction and future sequence placement.
These files are used in the step 2 when we go to assign taxonomy to query sequences!
The second step of PARATHAA uses the phylogenetic tree, multiple sequence alignment (MSA) and tree data files we just generated to create taxonomic assignments to 16S rRNA gene sequences. This step is running using the following command:
parathaa_run_taxa_assignment
Briefly this step takes in the newly created primer trimmed tree, MSA, tree reference files, and thresholding information to assign taxonomy to query 16S sequences by the following steps.
- Aligns query sequences to primer trimmed MSA generated in step 1
- Places those sequences into the primer trimmed tree generated in step 1 using pplacer
- Assigns taxonomy to those query sequences based on their placement and their distance from taxonomically labelled interior nodes. Note the thresholding distance computed in step one determines the appropriate distance away a node can be to be assigned to a taxonomic level.
PARATHAA will come with a number of pre-computed trees (see user manual) for this step so that users do not need to generate their own primer trimmed trees for commonly used reference databases.
For this tutorial we will download and use the pre-computed V4V5 SILVA v138.1 tree.
We can download this tree using the following command:
wget http://huttenhower.sph.harvard.edu/parathaa_db/SILVA_V4V5.tar.gz
tar -xvf SILVA_V4V5.tar.gz
rm SILVA_V4V5.tar.gz
The input files for taxonomic assignment using PARATHAA are those generated during step 1. For most users the pre-computed-files will be sufficient, however, those that use uncommon primer choices or want to use a different database will need to run step 1 to create the required files for taxonomic assignment.
This command takes the following inputs:
-
trimmedDatabasethe primer trimmed MSA generated from step 1 -
trimmedTreethe trimmed phylogenetic tree generated from step 1 -
treeLogthe treelog.txt file generated from step 1 -
querythe 16S sequences that you want to assign taxonomy -
thresholdsan RData file containing the optimal phylogenetic distances for taxonomic assignment -
namedTreean RData file containing the annotated trimmed phylogenetic tree
For the demo run of PARATHAA we will use the pre-computed-files for silva v138 using V4V5 primers (found in input/primers/).
parathaa_run_taxa_assignment --trimmedDatabase SILVA_V4V5/silva.seed_v138_1.pcr.align \
--treeLog SILVA_V4V5/treelog.txt \
--query input/testing/taxa_assignment/SRR3225703_V4V5_subset.fasta \
--thresholds SILVA_V4V5/optimal_scores.RData \
--namedTree SILVA_V4V5/resultTree_bestThresholds.RData \
--output output_taxa_test \
--trimmedTree SILVA_V4V5/region_specific.tree
Let's take a look at the output folder and see what files were generated.
ls output_taxa_test
Result:
SRR3225703_V4V5_subset.align
SRR3225703_V4V5_subset.align_report
anadama.log
merged.fasta
merged_filt.fasta
merged_sub.fasta
merged_sub.jplace
poor_query_alignments.txt
ref.refpkg
taxonomic_assignments.tsv
-
SRR3225703_V4V5_subset.alignthe aligned query sequences to the trimmedDatabase input file -
SRR3225703_V4V5_subset.align_reportalignment report -
anadama.loganadama workflow log -
merged.fastaalignment file combined with the original database used for sequence placement by pplacer -
merged_filt.fastaintermediate file of the query and reference sequences needed to build the phylotree -
merged_sub.fastamerged alignment file with some formatting changes to meet pplacer requirements -
merged_sub.jplaceplacement file containing the location of the placed sequences -
poor_query_alignments.txtsequences removed due to poor alignment -
ref.refpkgthe reference settings for sequence placement by pplacer -
taxonomic_assignments.tsvthe taxonomic assignment of the query sequences.
Let's take a look at the taxonomic assignment profile.
column -t output_taxa_test/taxonomic_assignments.tsv| less -S
Result:
query.name Kingdom Phylum Class Order Family Genus Species maxDist
SRR3225703.1.1 Bacteria Bacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides NA 0.00538859891159
SRR3225703.10.1 Bacteria Bacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides NA 0.00355971899189465
SRR3225703.100.1 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus NA 0.00175279329997
SRR3225703.101.1 Bacteria Bacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides NA 0.00355971899189465
SRR3225703.102.1 Bacteria Firmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus NA 0.00176216098145
SRR3225703.103.1 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus NA 0.00178509943753875
SRR3225703.104.1 Bacteria Bacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides NA 0.00533614143408
SRR3225703.105.1 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus Lactobacillus gasseri;Lactobacillus johnsonii 1.564476501465e-05
This is the main output of PARATHAA and contains the taxonomic assignment of the 16S query sequences. The headers of this file indicate the query sequence name under query.name.
Under each taxonomic level there is a taxonomic label. NA represents the inability to assign a sequence to that taxonomic level based on the optimal thresholds calculated in step 1. You will also notice that at some taxonomic levels there may be multiple assignments. For example in the above output we can see that the eighth query sequence was assigned at the species level to Lactobacillus gasseri;Lactobacillus johnsonii. This is because PARATHAA found that both of these species are within the optimal distance threshold for taxonomic classification and therefore cannot be certain of the specific species that this 16S rRNA gene sequence belongs to. This allows parathaa to make ambiguous classifications of multiple taxonomic labels.
Finally the maxDist column represents the maximum distance the query sequence is from any of its parents underlying tips.
The placement in the tree of any taxonomic assignment can be plotted using the following command:
parathaa_plot_assignment.R
This command takes the following inputs:
-
parathaa_db_treethe primer-trimmed tree generated by parathaa_run_tree_analysis -
assignmentsthe taxonomic assignments generated by parathaa_run_taxa_assignment -
jplacethe location of the placed sequences generated by parathaa_run_taxa_assignment -
levelthe name of the taxonomic level to plot -
steps_backthe number of steps back from the tip to show in the tree -
othe output directory -
idthe query sequence name
Let's try this for the query sequence assigned as Lactobacillus gasseri;Lactobacillus johnsonii in the previous section.
parathaa_plot_assignment.R --parathaa_db_tree SILVA_V4V5/resultTree_bestThresholds.RData \
--assignments output_taxa_test/taxonomic_assignments.tsv \
--jplace output_taxa_test/merged_sub.jplace \
--level "Species" \
--steps_back 3 \
-o output_visualization \
--id SRR3225703.105.1

- HUMAnN 2.0
- HUMAnN 3.0
- HUMAnN 4.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE v1.5
- WAAFLE v1.0
- MetaWIBELE
- MACARRoN
- FUGAsseM
- BAQLaVa
- Assembly workflow
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MaAsLin 3.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- MTX model 3
- PARATHAA