Skip to content

parathaa

Thomas Kuntz edited this page Apr 30, 2025 · 14 revisions

PARATHAA 0.1.1 tutorial

PARATHAA - Preserving and Assimilating Region-specific Ambiguities in Taxonomic Hierarchical Assignments for Amplicons

PARATHAA is a tooled used for the taxonomic assignment of representative amplicon sequences that takes into account the uncertainty associated with using specific variable regions/primers. PARATHAA does this by 1) generating new primer-trimmed phylogenetic trees from reference marker genes such as the 16S rRNA gene and 2) determining the optimal phylogenetic distances within that tree for taxonomic labeling. PARATHAA then can use this tree to assign taxonomy to query sequences by aligning and placing those sequences into the new primer-trimmed reference database.

image

This tutorial will go over the basic usage of PARATHAA when applied to 16S rRBA genes and is split into two sections:


Generation of primer-trimmed phylogenetic trees

The first step in PARATHAA is to generate a phylogenetic tree of primer-trimmed sequences with taxonomically-labeled internal nodes. PARATHAA accomplishes this in the following way:

  1. Takes an input reference alignment set and trims those alignments to only the regions amplified by the given primer set.
  2. Uses the newly trimmed sequences to generate a phylogenetic tree using FastTree.
  3. Determines the optimal phylogenetic distance cutoffs for each taxonomic group based on the original full length reference taxonomy.
  • This is a key step in the PARATHAA workflow and is accomplished by following:

Optimal cophenetic distance thresholds based on a GTR model of nucleotide evolution (as implemented in FastTree) are identified, which will define taxa at each level. For instance, any node whose underlying tips have pairwise distances less than the genus-defining threshold are considered to be of the same genus. To find the optimal thresholds at each taxonomic level, a threshold-finding step tries a range of distance cutoffs for each taxonomic level, and calculates an associated error based on misclassification of sequences (whose taxonomy is known). The chosen threshold is one that minimizes (1) grouping tips of multiple taxa under a taxon-defining node (“over-merging”), and (2) splitting of sequences from the same taxonomic group into multiple taxon-defining nodes (“over-splitting”).

  1. Uses these optimal phylogenetic distances to relabel the internal nodes of the trimmed phylogenetic tree.

All of these steps can be completed using the parathaa_run_tree_analysis command with the proper inputs.

Input files

To generate the phylogenetic tree of primer-trimmed sequences using parathaa_run_tree_analysis, three different inputs are required:

  1. Primers
  2. A reference multiple-sequence alignment
  3. A reference taxonomy file

For this tutorial, we have provided input files which are located in the input directory of this repository.

Primers

Let's first take a look at the primer files

ls input/primers

Results:

EMPV4.oligos V1V2.oligos  V4V5.oligos

Here we see that there are three different .oligos files each containing a different primer set. We can take a look at the primer sequences using the less command.

less input/primers/V4V5.oligos

Results:

primer GTGYCAGCMGCCGCGGTAA CCGYCAATTYMTTTRAGTTT V4V5

Reference MSA

Lets now take a look at the reference MSA

less input/testing/tree_construction/subset_silva_seed_v138_1.align

Results:

>CP011827.XtnCi232      100     Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Xanthomonas;    
...........................................................................................................................................................>
>LT992463.F3C41489      100     Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;    
...........................................................................................................................................................>
>AJ009499.UncB6820      100     Bacteria;Spirochaetota;Spirochaetia;Spirochaetales;Spirochaetaceae;uncultured;    
...........................................................................................................................................................>
>ACOD01000398.CllGram3  100     Eukaryota;Amorphea;Obazoa;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Pezizomycotina;Sordariomycetes;Glomerellales;Gl>
...........................................................................................................................................................>
>AY328677.Unc0fxxb      100     Bacteria;Proteobacteria;Alphaproteobacteria;Caulobacterales;Hyphomonadaceae;uncultured;    
...........................................................................................................................................................>
>AJ408961.Unc57378      100     Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Blautia;    
...........................................................................................................................................................>

In the MSA file we have the silva v138.1 accession numbers for each sequence along with their accompanying taxonomy.

Reference Taxonomy

Finally the last input file we need is a separate taxonomy file for the accession numbers found in the MSA file. First we will need to decompress the file.

gunzip input/silva_v138/taxmap_slv_ssu_ref_138.1.txt.gz
less -S input/silva_v138/taxmap_slv_ssu_ref_138.1.txt

Results:

primaryAccession        start   stop    path    organism_name   taxid
AX664486        3       1987    Eukaryota;Amorphea;Obazoa;Opisthokonta;Holozoa;Choanozoa;Metazoa;Animalia;BCP clade;Bilateria;Protostomia;Ec>
BD307583        1       1747    Eukaryota;SAR;Alveolata;Apicomplexa;Conoidasida;Coccidia;Eimeriorina;Toxoplasma;        Neospora sp.    8805
BD359735        3       2145    Eukaryota;SAR;Alveolata;Apicomplexa;Aconoidasida;Haemosporoidia;Plasmodium;     Plasmodium malariae     8775
BD359736        3       2150    Eukaryota;SAR;Alveolata;Apicomplexa;Aconoidasida;Haemosporoidia;Plasmodium;     Plasmodium malariae     8775
CS214259        13      1872    Eukaryota;Amorphea;Obazoa;Opisthokonta;Holozoa;Choanozoa;Metazoa;Animalia;BCP clade;Bilateria;Deuterostomia;>

In this file we can see that we have the accession number, the start and stop position of the sequence, its taxonomic assignment up to genus in the path column, its species in the organism_name column and then final its NCBI taxID.

Demo Run

Now that we took a look at all the input files lets run the command:

parathaa_run_tree_analysis --primers input/primers/V4V5.oligos \
 --database input/testing/tree_construction/subset_silva_seed_v138_1.align \
  --output output \
 --taxonomy input/silva_v138/taxmap_slv_ssu_ref_138.1.txt

This will create a new primer-trimmed, taxonomically labelled, phylogenetic tree based on the database alignment and taxonomy files we inspected above. Note that for the purposes of this tutorial the database file has been significantly shortened.

Examining Output

Lets take a look at the output

ls output

Result:

anadama.log
optimal_scores.RData
optimal_scores.png
region_specific.tree
resultData.RData
resultTree_bestThresholds.RData
subset_silva_seed_v138_1.pcr.align
treelog.txt
  • anadama.log this is the log file with information on all of the tasks run including standard out and standard error from every command line task. It also includes the values of all workflow arguments.

  • optimal_scores.RData this is an RData file containing the optimal phylogenetic distances for taxonomic assignment at each taxonomic level.

  • optimal_scores.png this plot shows the optimal distance threshold (on the x-axis) and the penalty score (on the y-axis). The penalty score is based on whether each distance resulted in over-splitting of taxonomic groups or over-merging of those groups. The weighting of these scores can be adjusted using the --sweight and --mweight options. optimal_scores

  • region_specific.tree this is the primer trimmed tree in newick format

  • resultTree_bestThresholds.RData this is an RData file containing the primer-trimmed tree along with taxonomically-labelled internal nodes based on the optimal distance thresholds calculated in optimal_scores.RData.

  • subset_silva_seed_v138_1.pcr.align this is the new primer trimmed MSA.

  • treelog.txt this is the tree log file output by FastTree indicating the optimal settings for tree construction and future sequence placement.

These files are used in the step 2 when we go to assign taxonomy to query sequences!


Taxonomic assignment of query sequences

The second step of PARATHAA uses the phylogenetic tree, multiple sequence alignment (MSA) and tree data files we just generated to create taxonomic assignments to 16S rRNA gene sequences. This step is running using the following command:

parathaa_run_taxa_assignment

Briefly this step takes in the newly created primer trimmed tree, MSA, tree reference files, and thresholding information to assign taxonomy to query 16S sequences by the following steps.

  1. Aligns query sequences to primer trimmed MSA generated in step 1
  2. Places those sequences into the primer trimmed tree generated in step 1 using pplacer
  3. Assigns taxonomy to those query sequences based on their placement and their distance from taxonomically labelled interior nodes. Note the thresholding distance computed in step one determines the appropriate distance away a node can be to be assigned to a taxonomic level.

PARATHAA will come with a number of pre-computed trees (see user manual) for this step so that users do not need to generate their own primer trimmed trees for commonly used reference databases.

For this tutorial we will download and use the pre-computed V4V5 SILVA v138.1 tree.

We can download this tree using the following command:

wget  http://huttenhower.sph.harvard.edu/parathaa_db/SILVA_V4V5.tar.gz
tar -xvf SILVA_V4V5.tar.gz
rm SILVA_V4V5.tar.gz

Input Files

The input files for taxonomic assignment using PARATHAA are those generated during step 1. For most users the pre-computed-files will be sufficient, however, those that use uncommon primer choices or want to use a different database will need to run step 1 to create the required files for taxonomic assignment.

This command takes the following inputs:

  • trimmedDatabase the primer trimmed MSA generated from step 1
  • trimmedTree the trimmed phylogenetic tree generated from step 1
  • treeLog the treelog.txt file generated from step 1
  • query the 16S sequences that you want to assign taxonomy
  • thresholds an RData file containing the optimal phylogenetic distances for taxonomic assignment
  • namedTree an RData file containing the annotated trimmed phylogenetic tree

Demo run:

For the demo run of PARATHAA we will use the pre-computed-files for silva v138 using V4V5 primers (found in input/primers/).

parathaa_run_taxa_assignment --trimmedDatabase SILVA_V4V5/silva.seed_v138_1.pcr.align \
 --treeLog SILVA_V4V5/treelog.txt \
 --query input/testing/taxa_assignment/SRR3225703_V4V5_subset.fasta \
 --thresholds SILVA_V4V5/optimal_scores.RData \
 --namedTree SILVA_V4V5/resultTree_bestThresholds.RData \
 --output output_taxa_test \
 --trimmedTree SILVA_V4V5/region_specific.tree

Examination of output files

Let's take a look at the output folder and see what files were generated.

ls output_taxa_test

Result:

SRR3225703_V4V5_subset.align
SRR3225703_V4V5_subset.align_report
anadama.log
merged.fasta
merged_filt.fasta
merged_sub.fasta
merged_sub.jplace
poor_query_alignments.txt
ref.refpkg
taxonomic_assignments.tsv
  • SRR3225703_V4V5_subset.align the aligned query sequences to the trimmedDatabase input file
  • SRR3225703_V4V5_subset.align_report alignment report
  • anadama.log anadama workflow log
  • merged.fasta alignment file combined with the original database used for sequence placement by pplacer
  • merged_filt.fasta intermediate file of the query and reference sequences needed to build the phylotree
  • merged_sub.fasta merged alignment file with some formatting changes to meet pplacer requirements
  • merged_sub.jplace placement file containing the location of the placed sequences
  • poor_query_alignments.txt sequences removed due to poor alignment
  • ref.refpkg the reference settings for sequence placement by pplacer
  • taxonomic_assignments.tsv the taxonomic assignment of the query sequences.

Let's take a look at the taxonomic assignment profile.

column -t output_taxa_test/taxonomic_assignments.tsv| less -S

Result:

query.name        Kingdom   Phylum            Class                Order              Family             Genus                       Species         maxDist
SRR3225703.1.1    Bacteria  Bacteroidota      Bacteroidia          Bacteroidales      Bacteroidaceae     Bacteroides                 NA              0.00538859891159
SRR3225703.10.1   Bacteria  Bacteroidota      Bacteroidia          Bacteroidales      Bacteroidaceae     Bacteroides                 NA              0.00355971899189465
SRR3225703.100.1  Bacteria  Firmicutes        Bacilli              Lactobacillales    Lactobacillaceae   Lactobacillus               NA              0.00175279329997
SRR3225703.101.1  Bacteria  Bacteroidota      Bacteroidia          Bacteroidales      Bacteroidaceae     Bacteroides                 NA              0.00355971899189465
SRR3225703.102.1  Bacteria  Firmicutes        Bacilli              Staphylococcales   Staphylococcaceae  Staphylococcus              NA              0.00176216098145
SRR3225703.103.1  Bacteria  Firmicutes        Bacilli              Lactobacillales    Streptococcaceae   Streptococcus               NA              0.00178509943753875
SRR3225703.104.1  Bacteria  Bacteroidota      Bacteroidia          Bacteroidales      Bacteroidaceae     Bacteroides                 NA              0.00533614143408
SRR3225703.105.1  Bacteria  Firmicutes        Bacilli              Lactobacillales    Lactobacillaceae   Lactobacillus               Lactobacillus   gasseri;Lactobacillus     johnsonii                   1.564476501465e-05

This is the main output of PARATHAA and contains the taxonomic assignment of the 16S query sequences. The headers of this file indicate the query sequence name under query.name.

Under each taxonomic level there is a taxonomic label. NA represents the inability to assign a sequence to that taxonomic level based on the optimal thresholds calculated in step 1. You will also notice that at some taxonomic levels there may be multiple assignments. For example in the above output we can see that the eighth query sequence was assigned at the species level to Lactobacillus gasseri;Lactobacillus johnsonii. This is because PARATHAA found that both of these species are within the optimal distance threshold for taxonomic classification and therefore cannot be certain of the specific species that this 16S rRNA gene sequence belongs to. This allows parathaa to make ambiguous classifications of multiple taxonomic labels.

Finally the maxDist column represents the maximum distance the query sequence is from any of its parents underlying tips.

Visualization of taxonomic assignment

The placement in the tree of any taxonomic assignment can be plotted using the following command:

parathaa_plot_assignment.R

This command takes the following inputs:

  • parathaa_db_tree the primer-trimmed tree generated by parathaa_run_tree_analysis
  • assignments the taxonomic assignments generated by parathaa_run_taxa_assignment
  • jplace the location of the placed sequences generated by parathaa_run_taxa_assignment
  • level the name of the taxonomic level to plot
  • steps_back the number of steps back from the tip to show in the tree
  • o the output directory
  • id the query sequence name

Let's try this for the query sequence assigned as Lactobacillus gasseri;Lactobacillus johnsonii in the previous section.

parathaa_plot_assignment.R --parathaa_db_tree SILVA_V4V5/resultTree_bestThresholds.RData \ 
 --assignments output_taxa_test/taxonomic_assignments.tsv \
 --jplace output_taxa_test/merged_sub.jplace \
 --level "Species" \
 --steps_back 3 \
 -o output_visualization \
 --id SRR3225703.105.1

SRR3225703 105 1_placement1

Clone this wiki locally