Skip to content

Tutorial

Chris Jackson edited this page Jul 25, 2023 · 78 revisions

Documentation current for Paragone version 0.0.11rc


Welcome to the ParaGone tutorial!

The purpose of this tutorial is to help familiarize you with the format of the input you need and output you should expect from running ParaGone. The tutorial uses a test dataset of paralog sequences for five genes from eleven taxa, as output by the command hybpiper paralog_retriever from the program HybPiper, together with a fasta file of outgroup sequences.

This tutorial assumes that you have some experience executing programs from the command line on a UNIX-like system.

Test dataset

Click this link to download the files for the test dataset. If you installed ParaGone by cloning the repository, the file test_dataset.tar.gz will already be in the repository directory. Extract the file by either double-clicking on it, or via a terminal using the command:

tar -zxvf test_dataset.tar.gz

The extracted directory test_dataset contains the following folder and files:

  • paralog_input is a folder containing *.fasta files for eight genes from the Angiosperms353 data set, named 4527, 4932, 4992, 5620, 6139, 6462, 6717 and 7128.

    Each file contains one or more paralog sequences for up to eleven taxa, as output by the HybPiper command hybpiper paralog_retriever. These taxa include ten ingroup samples, as well as one outgroup sample (see below more details on the outgroup sample). The taxon names are: 79678, 79679, 79682, 79683, 79684, 79685, 79686, 79687, 79688, and 79689 and 80974.

    For example, the file 4527_paralogs_all.fasta contains the sequences :

    >79688
    GAGAGAGTGGCTGTGTTGGTTATTGGAGGAGGGGGAAGGGAGCACGCGCTTTGTTACGCTTTGAAACGGTCTCCTTCCTG
    TGACGCTGTGTTCTGCGCACCTGGAAACGCTGGAATTTCAAACTCTGGGGATGCCACGTGTATTGAGGACCTCGACATCT
    ...
    >79687.1
    GAGAGAGTGGCTGTGTTGGTTATTGGAGGAGGGGGAAGGGAGCACGCCCTTTGTTACGCTTTGAAACGGTCTCCTTCCTG
    TGACGCTGTGTTTTGCGCACCTGGAAATGCTGGAATTTCAAACTCTGGGGATGCGACGTGTATTGAGGACCTCAACATAT
    ...
    >79687.main
    GAGAGTGTGTCTGTGTTGGTTATTGGAGGAGGGGGAAGGGAGCACGCGCTTTGTTACGCTTTGAAACGGTCTCCTTCCTG
    TGACGCTGTGTTCTGCGCACCTGGAAACGCTGGAATTTCAAACTCTGGGGACGCGACGTGTATTGAGGACCTCGACATCT
    ...
    

    As can be seen, only a single sequence is present for taxon 79688, whereas there are two sequences (putative paralogs) for taxon 79687. The *.fasta headers of these two sequences have the suffixes .main and .1; see the HybPiper wiki for more details.

  • external_outgroups.fasta is a *.fasta file containing 'outgroup' sequences for the eight genes in this test dataset. There is a single outgroup sequence for each gene, originating from taxon 80973. For example:

    >80973-4527
    GAAAGAGTCAACGTGTTGGTTATTGGAGGTGGTGGAAGAGAACACGCCCTTTGTTACGCTCTAAAGCGATCTCCTTCGTG
    TGATGCTGTGTTTTGTGCCCCGGGAAATGCCGGAATTTCAAGTTCAGGTGATGCGACTTGCATCGAGAATGTAAACATCT
    ...
    >80973-4932
    GACAACTCTGTTTCTGACATGCTAATGGATTCATTCGGGAGATTACACACATATTTAAGAATTTCATTGACAGAGCGATG
    CAATTTGAGATGCAAGTATTGTATGCCCGATGAAGGCGTACAACTCACTCCAAAGCCCGAGCTTCTCTCCACTAATGAGA
    ...
    

    Note that when running ParaGone on your own data, it's fine if the external outgroups file contains multiple outgroup sequences for each gene, originating from different outgroup taxa (e.g. sequences with names taxon1-gene01, taxon2-gene01 etc.).

  • run_paragone_test_dataset.sh is a shell script. It contains the example commands we will cover in this tutorial. The script can be run from within the test_dataset directory simply by typing:

    ./run_paragone_test_dataset.sh
    

    ...following by enter/return.

    All of the commands in the bash script will be covered in more detail in the following sections.

Running ParaGone

The full ParaGone pipeline can by run with a single command (see here). Alternatively, if you would like to run the pipeline in stages, you can use six consecutive commands. This latter approach can be useful e.g. when running the pipeline on a HPC cluster, where wall time is a limitation. This tutorial will describe the step-by-step method of running the pipeline.

ParaGone requires a folder of *.fasta files as input. Each *.fasta file should contain paralog sequences for a single gene, for all ingroup samples of interest.

In addition to ingroup samples, some of the paralogy resolution algorithms implemented by ParaGone (as originally implemented by Yang and Smith 2014) require designated outgroups. These outgroup sequences can be supplied in two ways:

  1. 'Internal' outgroups. If your paralog *.fasta files already contain your outgroup taxa along with your ingroups, these outgroup sequences (called 'internal outgroups' for the purposes of the ParaGone pipeline) can be specified using the parameter --internal_outgroup, see here. Note that if you have more than one 'internal' outgroup sample, they each need to be specified e.g. --internal_outgroup <taxon_name_1> --internal_outgroup <taxon_name_2>.

  2. 'External' outgroups. If you have outgroup sequences from a different source (e.g., mined from NCBI, a different run of HybPiper, etc.) you can supply these sequences for each gene via a single *.fasta file. ParaGone calls these 'external' outgroups (i.e., not included in your input paralog *.fasta files).

Either 'internal' or 'external' outgroup sequences can be specified, or both. This tutorial uses a file of 'external' outgroup sequences (from a single taxon, 80973) called external_outgroups.fasta. In addition, taxon 80974 in the input paralog files is specified as an 'internal' outgroup.

Reducing ParaGone run time: For the time-consuming alignment and tree generation stages of the pipeline, ParaGone can process multiple alignments/trees concurrently. This is controlled by the parameters --pool and --threads, both of which take an integer as an argument (e.g. --pool 2). The integer corresponds to the number of CPUs/threads to use, and should be calculated based on the number of CPUs/threads available on your computer, or the proportion of available CPUs/threads you would like to use. The --pool value corresponds to the number of alignments (MAFFT) or trees (IQ-TREE or FastTree) to run concurrently, whereas the --threads value corresponds to the number of CPUs/threads used by each concurrent process. For example, if you would like to use a maximum of 10 CPUs/threads, you could provide the options --pool 5 --threads 2. This will run 5 alignments/trees concurrently, which each alignment/tree process using 2 threads. Note that in the tutorial commands below we are specifying --pool 1 --threads 1, meaning that only one alignment/tree will be run at a time, using a single thread. These are also the default values.

Step 1: checking input files and aligning paralogs

In the first step of the pipeline the input files are checked, file and sequence names are sanitised, and outgroup coverage is assessed for each gene. Then, the paralog *.fasta files are aligned using MAFFT, and the terminal ends of each alignment are trimmed using trimal to remove poorly aligned regions. Note that trimal is run with the default parameters -gapthreshold 0.12 -terminalonly -gw 1. Finally, any poorly aligned regions within each sequence (e.g. due to assembly errors) are removed using HmmCleaner.pl.

From within the test_dataset directory, run the command:

paragone check_and_align paralog_input --external_outgroups_file external_outgroups.fasta --internal_outgroup 80974 --pool 1 --threads 1

The following folders will be produced:

  • 00_logs_and_reports

  • 01_input_paralog_fasta_with_sanitised_filenames

  • 02_alignments

  • 03_alignments_trimmed

  • 04_alignments_trimmed_hmmcleaned

...along with the file

  • external_outgroups_sanitised.fasta

...a log file under the 00_logs_and_reports/logs directory:

  • check_and_align_<date_time>.log

...and two report files under the 00_logs_and_reports/reports directory:

  • outgroup_taxon_list.tsv
  • outgroup_coverage_report.tsv

If desired, you can check the alignments by opening them in an alignment viewer (Geneious, AliView, etc).

Take note of the output folder 00_logs_and_reports - this is created here in step one on the pipeline, and is also used by later steps to store log and report files. It contains two subfolders: logs and reports. The logs folder contains text *.log files that provide more detail and debugging information on a given pipeline step, whereas the reports folder contains reports generated by a given pipeline step in tab-separated-values (*.tsv) format. Note that not all pipeline steps write a report file; those that do are described below.

Importantly, after running step one you should check the outgroup coverage report, located at 00_logs_and_reports -> reports -> outgroup_coverage_report.tsv. Open the file in a spreadsheet program (Excel, Numbers, Calc, etc.) and check that the outgroup coverage for each gene is as expected. For this test dataset, the outgroup_coverage_report.tsv file contains:

Gene_name Internal_outgroup_taxa External_outgroup_taxa
4527 80974 80973
4932 80974 80973
4992 80974 80973
5620 80974 80973
6139 80974 80973
6462 80974 80973
6717 80974 80973
7128 80974 80973

As can be seen, each gene has an outgroup sequence from the 'internal' taxon specified (80974) and from the external fasta file (80973).

Note: for the paralogy resolution algorithms MO, RT, and 1to1, any gene tree containing more than one tip for at least one taxon (i.e., putative paralogs) but no designated outgroup(s) will be skipped.

The outgroup_taxon_list.tsv file produced in 00_logs_and_reports/reports directory lists the outgroups specified at the command line, and whether they are 'internal' or external. For this test dataset, the file contains:

INTERNAL_OUTGROUP 80974
EXTERNAL_OUTGROUP 80973

This information is read during step four below (align_selected_and_tree).

Step 2: phylogenetic trees from alignments

In the second step of the pipeline a phylogenetic tree is produced for each alignment generated in step one. By default, trees are produced using IQ-TREE. Optionally, FastTree can be used, as in this tutorial, by providing the flag --use_fasttree.

From within the test_dataset directory, run the command:

paragone alignment_to_tree 04_alignments_trimmed_hmmcleaned --use_fasttree --pool 1 --threads 1

Note: if you had chosen not to trim or clean the alignments in step one (using optional flags --no_trimming and/or --no_cleaning), your alignments folder might be called 02_alignments, 03_alignments_trimmed or 04_alignments_cleaned. In these cases, provide this folder name in place of 04_alignments_trimmed_hmmcleaned in the command above.

The following folder will be generated:

  • 05_trees_pre_quality_control

...along with a log file under the existing 00_logs_and_reports/logs directory:

  • alignment_to_tree_<date_time>.log

The trees can be viewed in a viewing program such as FigTree. Before they are analysed using paralogy-resolution algorithms, we need to apply some cleaning and quality control steps.

Step 3: tree QC and sequence extraction

In the third step of the pipeline a number of quality control and cleaning steps are applied to each phylogenetic tree. These are, in the order applied:

Trimming tree tips

This step used the program TreeShrink to remove long tip branches in trees, under the assumption that such branches indicate sequence assembly or alignment errors. The stringency of the tip removal is adjusted using the TreeShrink parameter --quantile; ParaGone uses the default value of 0.05, but this can be adjusted with the parameter --treeshrink_q_value.

Masking tree tips

This step removes ('masks' using Yang and Smith terminology) all but one tip from monophyletic clades that contain multiple tips with the same taxon name. This is designed to retain only a single representative tip in cases where a tree contains alleles or close paralogs, as these would interfere with identification of paralogs later in the pipeline. The tip corresponding to the sequence with the greatest number of unambiguous characters is kept.

Cutting deep paralogs

This step involves identifying any putative deep paralogs in each tree, and cutting/splitting the tree at these nodes to generate two or more subtrees. Internal branches above a given threshold length are cut. This threshold value can be changed using the parameter --cut_deep_paralogs_internal_branch_length_cutoff <float>. The default value is 0.3.

After these three steps have been applied, the fasta sequences corresponding to the tips in each output tree are recovered from the alignment files from step one.

From within the test_dataset directory, run the command:

paragone qc_trees_and_extract_fasta 04_alignments_trimmed_hmmcleaned --treeshrink_q_value 0.20 --cut_deep_paralogs_internal_branch_length_cutoff 0.04

The following folders will be produced:

  • 06_trees_trimmed

  • 07_trees_trimmed_masked

  • 08_trees_trimmed_masked_cut

  • 09_sequences_from_qc_trees

...along with a log file under the existing 00_logs_and_reports/logs directory:

  • qc_trees_and_extract_fasta_<date_time>.log

...and several report files under the existing 00_logs_and_reports/reports directory:

  • fasta_from_qc_trees_report.tsv

  • trees_trimmed_report.tsv

  • trees_trimmed_masked_report.tsv

  • trees_trimmed_masked_cut_report.tsv

If you open the trees_trimmed_report.tsv in a spreadsheet program, you'll see the following:

Tree name Tips removed by TreeShrink with quantile 0.2 Tip names Trimmed trees > 4 taxa Trimmed trees < 4 taxa
4527.aln.trimmed.hmm.fasta.treefile 0 N/A Y N
4932.aln.trimmed.hmm.fasta.treefile 2 80974.0; 80974.main Y N
4992.aln.trimmed.hmm.fasta.treefile 1 80974 Y N
5620.aln.trimmed.hmm.fasta.treefile 1 80974 Y N
6139.aln.trimmed.hmm.fasta.treefile 0 N/A Y N
6462.aln.trimmed.hmm.fasta.treefile 0 N/A Y N
6717.aln.trimmed.hmm.fasta.treefile 0 N/A Y N
7128.aln.trimmed.hmm.fasta.treefile 1 80974 Y N

Here, you can see that tips with the taxon 80974 was removed for genes 4932, 4992, 5620 and 7128. Note that 80974 is an outgroup taxon; don't worry about it being removed here, as the corresponding sequence will be added back in to alignments in step four below.

The trees_trimmed_masked_report.tsv file contains:

Tree name Monophyletic tips removed ("masked") Removed tip names Masked trees < 4 taxa after masking mono
4527.aln.trimmed.hmm.fasta.treefile.tt 1 80974.main N
4932.aln.trimmed.hmm.fasta.treefile.tt 0 N/A N
4992.aln.trimmed.hmm.fasta.treefile.tt 0 N/A N
5620.aln.trimmed.hmm.fasta.treefile.tt 0 N/A N
6139.aln.trimmed.hmm.fasta.treefile.tt 2 80974.0; 80974.main N
6462.aln.trimmed.hmm.fasta.treefile.tt 1 80974.main N
6717.aln.trimmed.hmm.fasta.treefile.tt 1 80974.main N
7128.aln.trimmed.hmm.fasta.treefile.tt 0 N/A N

You can see that the tree for gene 6717 contained a monophyletic clade with multiple tips for taxon 80974, and that tip 80974.0 was retained because the corresponding sequence has the greatest number of non-ambiguous nucleotide characters. Again, note that taxon 80974 is an 'internal' outgroup in this tutorial dataset; because the sequences for 80974 were produced by the HybPiper command hybpiper paralog_retriever, there can be multiple 80974 sequences for a given gene. This is more clearly demonstrated in the case of gene 6139, where two 80974 sequences were removed from a monophyletic 80974 clade.

The trees_trimmed_masked_cut_report.tsv file contains:

Tree_name Num subtrees retained after cutting Num subtrees discarded after cutting Number of subtrees discarded after cutting as < 4 taxa
4527.aln.trimmed.hmm.fasta.treefile.tt.mm 1 0 0
4932.aln.trimmed.hmm.fasta.treefile.tt.mm 2 0 0
4992.aln.trimmed.hmm.fasta.treefile.tt.mm 1 0 0
5620.aln.trimmed.hmm.fasta.treefile.tt.mm 2 0 0
6139.aln.trimmed.hmm.fasta.treefile.tt.mm 1 0 0
6462.aln.trimmed.hmm.fasta.treefile.tt.mm 1 0 0
6717.aln.trimmed.hmm.fasta.treefile.tt.mm 1 0 0
7128.aln.trimmed.hmm.fasta.treefile.tt.mm 1 0 0
       
Tree_name Subtree discarded after cutting Reason  

As can be seen, for genes 4932 and 5620 each tree was cut at a putative deep paralog, producing two subtrees; these trees can be found at e.g. 08_trees_trimmed_masked_cut -> 5620_1.subtree and 08_trees_trimmed_masked_cut -> 5620_2.subtree.

Step 4: phylogenetic trees from selected sequences

In the fourth step of the pipeline, outgroup fasta sequences are added to each of the fasta files generated at the end of step three (i.e., sequences corresponding to tips in the final subtrees produced).

  • 'External' outgroup sequences will be obtained from the file external_outgroups_sanitised.fasta generated in step one, and specified here using the parameter --external_outgroups_file.

  • 'Internal' outgroup sequences will be added to each corresponding gene fasta file, if necessary (e.g. if they were removed during tree quality control measures in step three above). As discussed above, 'internal' outgroup taxa might contain paralogs for a given gene; this interferes with the paralogy resolution steps below, and so in such cases only a single paralog sequence for each outgroup taxon is added to the gene fasta file. The paralog sequence most distant from the ingroup taxa sequences is chosen.

Next, all fasta files (now containing outgroup sequences) are aligned, and a phylogenetic tree is produced for each alignment. By default, trees are produced using IQ-TREE. Optionally, FastTree can be used, as in this tutorial, by providing the flag --use_fasttree.

From within the test_dataset directory, run the command:

paragone align_selected_and_tree 04_alignments_trimmed_hmmcleaned --use_fasttree --pool 1 --threads 1

The following folders will be produced:

  • 10_sequences_from_qc_outgroups_added

  • 11_pre_paralog_resolution_alignments

  • 12_pre_paralog_resolution_alignments_trimmed

  • 13_pre_paralog_resolution_trees

...along with three report files under the existing 00_logs_and_reports/reports directory:

  • in_and_outgroups_list.txt

  • per_locus_paralogy_report_post_tree_qc.tsv

  • per_taxon_paralogy_report_post_tree_qc.tsv

The in_and_outgroups_list.txt file records which taxon names belong to specified outgroups (i.e., they are 'external' outgroups that originate from the external_outgroups.fasta file, or are 'internal' outgroups that were specified using the --internal_outgroup parameter, and which taxon names are ingroups. For example:

OUT	80974
OUT	80973
IN	79684
IN	79689
IN	79688
IN	79685
IN	79682
IN	79687
IN	79678
IN	79686
IN	79683
IN	79679

This list is required by some of the paralogy resolution algorithms as implemented by Yang and Smith.

The per_locus_paralogy_report_post_tree_qc.tsv and per_taxon_paralogy_report_post_tree_qc.tsv reports contain summaries of putative paralogy across the dataset on a per-locus and per-taxon basis. These reports are generated by parsing the post-QC phylogenetic trees.

The per_locus_paralogy_report_post_tree_qc.tsv reports contains:

locus num_taxa_total num_taxa_>1_tip >1_tip_taxa_names
4527_1 18 6 79678; 79679; 79682; 79685; 79686; 79687
4932_1 12 0 None
4932_2 19 7 79678; 79679; 79682; 79685; 79686; 79687; 79689
4992_1 17 5 79678; 79683; 79685; 79686; 79689
5620_1 11 0 None
5620_2 12 0 None
6139_1 16 4 79678; 79679; 79684; 79685
6462_1 17 5 79678; 79679; 79682; 79686; 79687
6717_1 19 7 79678; 79679; 79682; 79684; 79685; 79686; 79687
7128_1 16 4 79678; 79679; 79682; 79687

The per_taxon_paralogy_report_post_tree_qc.tsv report contains:

taxon loci_where_>1_tip
79678 4527_1; 4932_2; 4992_1; 6139_1; 6462_1; 6717_1; 7128_1
79679 4527_1; 4932_2; 6139_1; 6462_1; 6717_1; 7128_1
79682 4527_1; 4932_2; 6462_1; 6717_1; 7128_1
79683 4992_1
79684 6139_1; 6717_1
79685 4527_1; 4932_2; 4992_1; 6139_1; 6717_1
79686 4527_1; 4932_2; 4992_1; 6462_1; 6717_1
79687 4527_1; 4932_2; 6462_1; 6717_1; 7128_1
79688 None
79689 4932_2; 4992_1
80973 None
80974 None

Step 5: resolving paralogs

In the fifth step of the pipeline, each of the trees produced in step four are processed using the paralogy resolution algorithms described in Yang and Smith 2014. ParaGone uses modified versions of the original Yang and Smith scripts for this step. The three algorithms implemented are MonoPhyletic Outgroups (MO), Maximum Inclusion (MI), and Rooted subTrees (RT); please see the 2014 manuscript for a more detailed description. Note that you can select one or more of the algorithms to use - in the command below, we are using all three.

From within the test_dataset directory, run the command:

paragone prune_paralogs --mo --rt --mi

The following folders will be produced:

  • 14_pruned_MO

  • 15_pruned_MI

  • 16_pruned_RT

...along with several report files under the existing 00_logs_and_reports/reports directory:

  • MO_report.tsv

  • MI_report.tsv

  • RT_report.tsv

These reports are useful to track how each paralogy resolution algorithm has processed each of the input trees. In each report, the Number of trees row contains top level stats for the total number of input trees, and the following rows contains stats for each individual tree.

Monophyletic Outgroups (MO)

Open the output directory 14_pruned_MO. Here you will see a number of tree types output by the Monophyletic Outgroups paralogy resolution approach:

  • *.reroot. These trees are produced if a given input tree had monophyletic outgroups. Each tree has been re-rooted on the outgroups clade.
  • *.1to1ortho.tre. These trees are produced if there are no putative paralogs in the input tree. For this tutorial dataset, you can see that both subtrees produced by pruning deep paralogs for gene 5620 (see pipeline step three) contain no paralogs.
  • *.ortho.tre. These trees are produced by extracting the clade with the greatest number of non-repeating taxa from a tree rooted on a monophyletic outgroup.

For example, in the input tree for gene 6139 below (13_pre_paralog_resolution_trees -> 6139_1.outgroup_added.aln.trimmed.fasta.treefile), there are two clades containing non-repeating taxon names (i.e., putative paralog clades), as shown by the red and green boxes:

6139_pre

The red clade contains five taxa, whereas the green clade contains eight taxa. The MO algorithm extracts the larger green clade, and writes the output tree 6139_1.ortho.tre:

6139_MO

Maximum Inclusion (MI)

Open the output directory 15_pruned_MI. Here you will see a number of tree types output by the Maximum Inclusion paralogy resolution approach:

  • *.1to1ortho.tre. These trees are produced if there are no putative paralogs in the input tree. For this tutorial dataset, you can see that both subtrees produced by pruning deep paralogs for gene 5620 (see pipeline step three) contain no paralogs.

  • *.MIortho1.tre, *.MIortho2.tre, etc. These trees are produced by iteratively cutting out the subtree with the highest number of taxa without taxon duplication.

For example, for the input tree for gene 6139 as shown above for the MO approach, the Maximum Inclusion method has extracted both the clade in the red box and the green box, producing the trees 6139_1.MIortho1.tre (green box) and 6139_1.MIortho2.tre (red box). Note that the outgroup sequences 80974.0 and 80973 are present in tree 6139_1.MIortho1.tre:

6139_MIortho1

..but not the tree 6139_1.MIortho2.tre:

6139_MIortho2

Rooted subTrees (RT)

Open the output directory 16_pruned_RT. Here you will see a number of tree types output by the Rooted subTrees paralogy resolution approach:

  • *.inclade1. These trees are produced by iteratively searching for the subtree with the highest number of ingroup taxa, and cutting it out as a rooted tree based on the position of one or more of the defined outgroup taxa.

  • *.inclade1.ortho1.tre, *.inclade1.ortho2.tre, etc. These trees are produced by examining the inclade*.tre subtrees described above, and inferring gene duplications from root to tips. When duplicated taxa are found between two child clades at a bifurcating node, the clade containing the smaller number of taxa is cut off.

Because the outgroups are monophyletic for all trees in this tutorial dataset, the *.inclade*.tre trees are simply the input tree with the outgroup sequences removed. For example, for gene 6462 the tree 6462_1.inclade1 looks like:

6462_RT inclade1

You'll see that the RT paralogy resolution apprach has recovered a single tree called 6462_1.inclade1.ortho1.tre, containing nine tips (compared to the fifteen tips in tree 6462_1.inclade1):

6462_RT inclade1 ortho1

To understand why these tips were removed, imagine starting at the root of tree 6462_1.inclade1 and moving node-by-node towards the tips, trimming according to the following reasoning:

  • Node 1: taxon 79688 occurs only once in the tree, and so it is not removed.
  • Node 2: taxon 79682 occurs in the tree twice (79682.main and 79682.0); for the current node, the child clade containing the least number of taxa is removed. In this case the removed 'clade' contains only one taxon (79682.main); as this is beneath the default minimum number of taxa for a clade to be retained (four), the 'clade is discarded rather than being written to file. Note that this minimum taxa value can be changed using the parameter --minimum_taxa.
  • Node 3: one of the child clades at this node contains taxa 79687 and 79686; both this taxa also occur in the other child clade, and so the smaller child clade is removed. As above, it contains fewer than the minimum required number of taxa, so it it discarded.
  • Node 4: one child clade contains the single taxon 79679, which also occurs in the other child. The former child clade is removed and discarded.
  • Node 5: the smaller child clade contains taxa 79689 and 79678; taxon 79678 occurs in the larger child clade, and so the smaller clade is removed and discarded.
  • Node 6: The remaining clade does not contain any repeated taxa, and is retained.

Information on the tips/clades that were removed can also be found in the log file for the step, located at 00_logs_and_reports-> logs -> prune_paralogs_<date_time>.log.

Step 6: final alignments

In the sixth and final step of the pipeline, fasta sequences corresponding to tips in the 'resolved' paralog trees (step five) are recovered from the fasta files produced in step four (i.e., with outgroup sequences added) . For each sequence file produced, the fasta headers (i.e., sequence names) are stripped to remove paralog IDs (for HybPiper, the ID is a suffix such as .main, .1 etc), and the sequences are aligned. Note that removal of the paralog IDs from sequence names allows concatenation of the individual paralog alignments to create a supermatrix, if desired for phylogenetic analyses. Alignments where the terminal ends of each alignment are trimmed using trimal are also produced, using the default parameters -gapthreshold 0.12 -terminalonly -gw 1.

From within the test_dataset directory, run the command:

paragone final_alignments --mo --rt --mi --pool 1 --threads 1 --keep_intermediate_files

The following folders will be produced:

  • 23_MO_final_alignments

  • 24_MI_final_alignments

  • 25_RT_final_alignments

  • 26_MO_final_alignments_trimmed

  • 27_MI_final_alignments_trimmed

  • 28_RT_final_alignments_trimmed

...along with several report files under the existing 00_logs_and_reports/reports directory:

  • fasta_from_tree_mo_report.tsv

  • fasta_from_tree_mi_report.tsv

  • fasta_from_tree_rt_report.tsv

Note the flag --keep_intermediate_files in the command above. Without this flag, the default behaviour of the paragone final_alignments command is to delete all intermediate folders and files from the pipeline run except the following:

  • 00_logs_and_reports
  • 23_MO_final_alignments
  • 24_MI_final_alignments
  • 25_RT_final_alignments
  • 26_MO_final_alignments_trimmed
  • 27_MI_final_alignments_trimmed
  • 28_RT_final_alignments_trimmed

This can be useful when you're running the pipeline on a HPC with a limit on the number of files that can be produced.

The pipeline run is now complete! The final paralog alignments can be used in single-gene coalescent phylogenetic analyses (e.g. using Astral), or concatenated in to a supermatrix for analyses with e.g. IQ-TREE, BEAST etc.