Skip to content

parathaa

Jacob T. Nearing edited this page Dec 14, 2023 · 8 revisions

PARATHAA 0.1.1 tutorial

PARATHAA - Preserving and Assimilating Region-specific Ambiguities in Taxonomic Hierarchical Assignments for Amplicons

PARATHAA is a tooled used for the taxonomic assignment of representative amplicon sequences that takes into account the uncertainty associated with using specific variable regions/primers. PARATHAA does this by 1) generating new primer-trimmed phylogenetic trees from reference marker genes such as the 16S rRNA gene and 2) determining the optimal phylogenetic distances within that tree for taxonomic labeling. PARATHAA then can use this tree to assign taxonomy to query sequences by aligning and placing those sequences into the new primer-trimmed reference database.

image

This tutorial will go over the basic usage of PARATHAA when applied to 16S rRBA genes and is split into two sections:


Generation of primer-trimmed phylogenetic trees

The first step in PARATHAA is to generate a phylogenetic tree of primer-trimmed sequences with taxonomically-labeled internal nodes. PARATHAA accomplishes this in the following way:

  1. Takes an input reference alignment set and trims those alignments to only the regions amplified by the given primer set.
  2. Uses the newly trimmed sequences to generate a phylogenetic tree using FastTree.
  3. Determines the optimal phylogenetic distance cutoffs for each taxonomic group based on the original full length reference taxonomy.
  • This is a key step in the PARATHAA workflow and is accomplished by following:

Optimal cophenetic distance thresholds based on a GTR model of nucleotide evolution (as implemented in FastTree) are identified, which will define taxa at each level. For instance, any node whose underlying tips have pairwise distances less than the genus-defining threshold are considered to be of the same genus. To find the optimal thresholds at each taxonomic level, a threshold-finding step tries a range of distance cutoffs for each taxonomic level, and calculates an associated error based on misclassification of sequences (whose taxonomy is known). The chosen threshold is one that minimizes (1) grouping tips of multiple taxa under a taxon-defining node (“over-merging”), and (2) splitting of sequences from the same taxonomic group into multiple taxon-defining nodes (“over-splitting”).

  1. Uses these optimal phylogenetic distances to relabel the internal nodes of the trimmed phylogenetic tree.

All of these steps can be completed using the parathaa_run_tree_analysis command with the proper inputs.

Input files

To generate the phylogenetic tree of primer-trimmed sequences using parathaa_run_tree_analysis, three different inputs are required:

  1. Primers
  2. A reference multiple-sequence alignment
  3. A reference taxonomy file

For this tutorial, we have provided input files which are located in the input directory of this repository.

Primers

Let's first take a look at the primer files

ls input/primers

Results:

EMPV4.oligos V1V2.oligos  V4V5.oligos

Here we see that there are three different .oligos files each containing a different primer set. We can take a look at the primer sequences using the less command.

less input/primers/V4V5.oligos

Results:

primer GTGYCAGCMGCCGCGGTAA CCGYCAATTYMTTTRAGTTT V4V5

Reference MSA

Lets now take a look at the reference MSA

less input/testing/tree_construction/subset_silva_seed_v138_1.align

Results:

>CP011827.XtnCi232      100     Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Xanthomonas;    
...........................................................................................................................................................>
>LT992463.F3C41489      100     Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;    
...........................................................................................................................................................>
>AJ009499.UncB6820      100     Bacteria;Spirochaetota;Spirochaetia;Spirochaetales;Spirochaetaceae;uncultured;    
...........................................................................................................................................................>
>ACOD01000398.CllGram3  100     Eukaryota;Amorphea;Obazoa;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Pezizomycotina;Sordariomycetes;Glomerellales;Gl>
...........................................................................................................................................................>
>AY328677.Unc0fxxb      100     Bacteria;Proteobacteria;Alphaproteobacteria;Caulobacterales;Hyphomonadaceae;uncultured;    
...........................................................................................................................................................>
>AJ408961.Unc57378      100     Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Blautia;    
...........................................................................................................................................................>

In the MSA file we have the silva v138.1 accession numbers for each sequence along with their accompanying taxonomy.

Reference Taxonomy

Finally the last input file we need is a separate taxonomy file for the accession numbers found in the MSA file. First we will need to decompress the file.

gunzip input/silva_v138/taxmap_slv_ssu_ref_138.1.txt.gz
less -S input/silva_v138/taxmap_slv_ssu_ref_138.1.txt

Results:

primaryAccession        start   stop    path    organism_name   taxid
AX664486        3       1987    Eukaryota;Amorphea;Obazoa;Opisthokonta;Holozoa;Choanozoa;Metazoa;Animalia;BCP clade;Bilateria;Protostomia;Ec>
BD307583        1       1747    Eukaryota;SAR;Alveolata;Apicomplexa;Conoidasida;Coccidia;Eimeriorina;Toxoplasma;        Neospora sp.    8805
BD359735        3       2145    Eukaryota;SAR;Alveolata;Apicomplexa;Aconoidasida;Haemosporoidia;Plasmodium;     Plasmodium malariae     8775
BD359736        3       2150    Eukaryota;SAR;Alveolata;Apicomplexa;Aconoidasida;Haemosporoidia;Plasmodium;     Plasmodium malariae     8775
CS214259        13      1872    Eukaryota;Amorphea;Obazoa;Opisthokonta;Holozoa;Choanozoa;Metazoa;Animalia;BCP clade;Bilateria;Deuterostomia;>

In this file we can see that we have the accession number, the start and stop position of the sequence, its taxonomic assignment up to genus in the path column, its species in the organism_name column and then final its NCBI taxID.

Demo Run

Now that we took a look at all the input files lets run the command:

parthaa_run_tree_analysis --primers input/primers/V4V5.oligos \
 --database input/testing/tree_construction/subset_silva_seed_v138_1.align \
  --output output \
 --taxonomy input/silva_v138/taxmap_slv_ssu_ref_138.1.txt

This will create a new primer-trimmed, taxonomically labelled, phylogenetic tree based on the database alignment and taxonomy files we inspected above. Note that for the purposes of this tutorial the database file has been significantly shortened.

Examining Output

Lets take a look at the output

ls output

Result:

optimal_scores.RData
optimal_scores.png
region_specific.tree
resultTree_bestThresholds.RData
subset_silva_seed_v138_1.bad.accnos
subset_silva_seed_v138_1.pcr.align
subset_silva_seed_v138_1.scrap.pcr.align
treelog.txt
  • optimal_scores.RData this is an RData file containing the optimal phylogenetic distances for taxonomic assignment at each taxonomic level.

  • optimal_scores.png this plot shows the optimal distance threshold (on the x-axis) and the penalty score (on the y-axis). The penalty score is based on whether each distance resulted in over-splitting of taxonomic groups or over-merging of those groups. The weighting of these scores can be adjusted using the --sweight and --mweight options. optimal_scores

  • region_specific.tree this is the primer trimmed tree in newick format

  • resultTree_bestThresholds.RData this is an RData file containing the primer-trimmed tree along with taxonomically-labelled internal nodes based on the optimal distance thresholds calculated in optimal_scores.RData.

  • subset_silva_seed_v138_1.pcr.align this is the new primer trimmed MSA.

  • treelog.txt this is the tree log file output by FastTree indicating the optimal settings for tree construction and future sequence placement.

These files are used in the step 2 when we go to assign taxonomy to query sequences!


Taxonomic assignment of query sequences

The second step of PARATHAA uses the phylogenetic tree, multiple sequence alignment (MSA) and tree data files we just generated to create taxonomic assignments to 16S rRNA gene sequences. This step is running using the following command:

parathaa_run_taxa_assignment

Briefly this step takes in the newly created primer trimmed tree, MSA, tree reference files, and thresholding information to assign taxonomy to query 16S sequences by the following steps.

  1. Aligns query sequences to primer trimmed MSA generated in step 1
  2. Places those sequences into the primer trimmed tree generated in step 1 using pplacer
  3. Assigns taxonomy to those query sequences based on their placement and their distance from taxonomically labelled interior nodes. Note the thresholding distance computed in step one determines the appropriate distance away a node can be to be assigned to a taxonomic level.

PARATHAA will come with a number of pre-computed trees (see user manual) for this step so that users do not need to generate their own primer trimmed trees for commonly used reference databases.

For this tutorial we will download and use the pre-computed V4V5 SILVA v138.1 tree.

We can download this tree using the following command:

wget  http://huttenhower.sph.harvard.edu/parathaa_db/SILVA_V4V5.tar.gz
tar -xvf SILVA_V4V5.tar.gz
rm SILVA_V4V5.tar.gz

Input Files

The input files for taxonomic assignment using PARATHAA are those generated during step 1. For most users the pre-computed-files will be sufficient, however, those that use uncommon primer choices or want to use a different database will need to run step 1 to create the required files for taxonomic assignment.

This command takes the following inputs:

  • trimmedDatabase the primer trimmed MSA generated from step 1
  • trimmedTree the trimmed phylogenetic tree generated from step 1
  • treeLog the treelog.txt file generated from step 1
  • query the 16S sequences that you want to assign taxonomy
  • thresholds an RData file containing the optimal phylogenetic distances for taxonomic assignment
  • namedTree an RData file containing the annotated trimmed phylogenetic tree

Demo run:

For the demo run of PARATHAA we will use the pre-computed-files for silva v138 using V4V5 primers (found in input/primers/).

parathaa_run_taxa_assignment --trimmedDatabase SILVA_V4V5/silva.seed_v138_1.pcr.align \
 --treeLog SILVA_V4V5/treelog.txt \
 --query input/testing/taxa_assignment/SRR3225703_V4V5_subset.fasta \
 --thresholds SILVA_V4V5/optimal_scores.RData \
 --namedTree SILVA_V4V5/resultTree_bestThresholds.RData \
 --output output_taxa_test \
 --trimmedTree SILVA_V4V5/region_specific.tree

Examination of output files

Let's take a look at the output folder and see what files were generated.

ls output_taxa_test

Result:

SRR3225703_V4V5_subset.align
SRR3225703_V4V5_subset.flip.accnos
SRR3225703_V4V5_subset.align_report
merged.fasta
merged_sub.fasta           
anadama.log                          
ref.refpkg
merged_sub.jplace   
taxonomic_assignments.tsv
  • SRR3225703_V4V5_subset.align the aligned query sequences to the trimmedDatabase input file
  • SRR3225703_V4V5_subset.flip.accnos sequences removed due to poor alignment
  • SRR3225703_V4V5_subset.align_report alignment report
  • anadama.log anadama workflow log
  • merged.fasta and merged_sub.fasta are alignment files combined with the original database that are used for sequence placement by pplacer. The former of the two is the same file with some formatting changes to meet pplacer requirements.
  • ref.refpkg the reference settings for sequence placement by pplacer
  • merged_sub.jplace placement file containing the location of the placed sequences
  • taxonomic_assignments.tsv the taxonomic assignment of the query sequences.

Lets take a look at the taxonomic assignment profile

column -t output_taxa_test/taxonomic_assignments.tsv| less -S

Result

query.name      Kingdom Phylum  Class   Order   Family  Genus   Species maxDist
SRR3225703.1.1  Bacteria        Bacteroidota    Bacteroidia     Bacteroidales   Bacteroidaceae  Bacteroides     NA      0.00536901463742
SRR3225703.10.1 Bacteria        Bacteroidota    Bacteroidia     Bacteroidales   Bacteroidaceae  Bacteroides     NA      0.00366493185698465
SRR3225703.100.1        Bacteria        Firmicutes      Bacilli Lactobacillales Lactobacillaceae        Lactobacillus   Lactobacillus gasseri;Lactobacillus johnsonii   0.0016927384

This is the main output of PARATHAA and contains the taxonomic assignment of the 16S query sequences. The headers of this file indicate the query sequence name under query.name.

Under each taxonomic level there is a taxonomic label. NA represents the inability to assign a sequence to that taxonomic level based on the optimal thresholds calculated in step 1. You will also notice that at some taxonomic levels there may be multiple assignments. For example in the above output we can see that the third query sequence was assigned at the species level to Lactobacillus gasseri;Lactobacillus johnsonii. This is because PARATHAA found that both of these species are within the optimal distance threshold for taxonomic classification and therefore cannot be certain of the specific species that this 16S rRNA gene sequence belongs to. This allows parathaa to make ambiguous classifications of multiple taxonomic labels.

Finally the maxDist column represents the maximum distance the query sequence is from any of its parents underlying tips.

Clone this wiki locally