Tutorial
Documentation current for Paragone version 0.0.11rc
The purpose of this tutorial is to help familiarize you with the format of the input you need and output you should expect from running ParaGone. The tutorial uses a test dataset of paralog sequences for five genes from eleven taxa, as output by the command hybpiper paralog_retriever
from the program HybPiper, together with a fasta file of outgroup sequences.
This tutorial assumes that you have some experience executing programs from the command line on a UNIX-like system.
Click this link to download the files for the test dataset. If you installed ParaGone by cloning the repository, the file test_dataset.tar.gz
will already be in the repository directory. Extract the file by either double-clicking on it, or via a terminal using the command:
tar -zxvf test_dataset.tar.gz
The extracted directory test_dataset
contains the following folder and files:
-
paralog_input
is a folder containing*.fasta
files for eight genes from the Angiosperms353 data set, named4527
,4932
,4992
,5620
,6139
,6462
,6717
and7128
.Each file contains one or more paralog sequences for up to eleven taxa, as output by the HybPiper command
hybpiper paralog_retriever
. These taxa include ten ingroup samples, as well as one outgroup sample (see below more details on the outgroup sample). The taxon names are:79678
,79679
,79682
,79683
,79684
,79685
,79686
,79687
,79688
, and79689
and80974
.For example, the file
4527_paralogs_all.fasta
contains the sequences :>79688 GAGAGAGTGGCTGTGTTGGTTATTGGAGGAGGGGGAAGGGAGCACGCGCTTTGTTACGCTTTGAAACGGTCTCCTTCCTG TGACGCTGTGTTCTGCGCACCTGGAAACGCTGGAATTTCAAACTCTGGGGATGCCACGTGTATTGAGGACCTCGACATCT ... >79687.1 GAGAGAGTGGCTGTGTTGGTTATTGGAGGAGGGGGAAGGGAGCACGCCCTTTGTTACGCTTTGAAACGGTCTCCTTCCTG TGACGCTGTGTTTTGCGCACCTGGAAATGCTGGAATTTCAAACTCTGGGGATGCGACGTGTATTGAGGACCTCAACATAT ... >79687.main GAGAGTGTGTCTGTGTTGGTTATTGGAGGAGGGGGAAGGGAGCACGCGCTTTGTTACGCTTTGAAACGGTCTCCTTCCTG TGACGCTGTGTTCTGCGCACCTGGAAACGCTGGAATTTCAAACTCTGGGGACGCGACGTGTATTGAGGACCTCGACATCT ...
As can be seen, only a single sequence is present for taxon
79688
, whereas there are two sequences (putative paralogs) for taxon79687
. The*.fasta
headers of these two sequences have the suffixes.main
and.1
; see the HybPiper wiki for more details. -
external_outgroups.fasta
is a*.fasta
file containing 'outgroup' sequences for the eight genes in this test dataset. There is a single outgroup sequence for each gene, originating from taxon80973
. For example:>80973-4527 GAAAGAGTCAACGTGTTGGTTATTGGAGGTGGTGGAAGAGAACACGCCCTTTGTTACGCTCTAAAGCGATCTCCTTCGTG TGATGCTGTGTTTTGTGCCCCGGGAAATGCCGGAATTTCAAGTTCAGGTGATGCGACTTGCATCGAGAATGTAAACATCT ... >80973-4932 GACAACTCTGTTTCTGACATGCTAATGGATTCATTCGGGAGATTACACACATATTTAAGAATTTCATTGACAGAGCGATG CAATTTGAGATGCAAGTATTGTATGCCCGATGAAGGCGTACAACTCACTCCAAAGCCCGAGCTTCTCTCCACTAATGAGA ...
Note that when running ParaGone on your own data, it's fine if the external outgroups file contains multiple outgroup sequences for each gene, originating from different outgroup taxa (e.g. sequences with names
taxon1-gene01
,taxon2-gene01
etc.). -
run_paragone_test_dataset.sh
is a shell script. It contains the example commands we will cover in this tutorial. The script can be run from within thetest_dataset
directory simply by typing:./run_paragone_test_dataset.sh
...following by
enter
/return
.All of the commands in the bash script will be covered in more detail in the following sections.
The full ParaGone pipeline can by run with a single command (see here). Alternatively, if you would like to run the pipeline in stages, you can use six consecutive commands. This latter approach can be useful e.g. when running the pipeline on a HPC cluster, where wall time is a limitation. This tutorial will describe the step-by-step method of running the pipeline.
ParaGone requires a folder of *.fasta
files as input. Each *.fasta
file should contain paralog sequences for a single gene, for all ingroup samples of interest.
In addition to ingroup samples, some of the paralogy resolution algorithms implemented by ParaGone (as originally implemented by Yang and Smith 2014) require designated outgroups. These outgroup sequences can be supplied in two ways:
-
'Internal' outgroups. If your paralog
*.fasta
files already contain your outgroup taxa along with your ingroups, these outgroup sequences (called 'internal outgroups' for the purposes of the ParaGone pipeline) can be specified using the parameter--internal_outgroup
, see here. Note that if you have more than one 'internal' outgroup sample, they each need to be specified e.g.--internal_outgroup <taxon_name_1> --internal_outgroup <taxon_name_2>
. -
'External' outgroups. If you have outgroup sequences from a different source (e.g., mined from NCBI, a different run of HybPiper, etc.) you can supply these sequences for each gene via a single
*.fasta
file. ParaGone calls these 'external' outgroups (i.e., not included in your input paralog*.fasta
files).
Either 'internal' or 'external' outgroup sequences can be specified, or both. This tutorial uses a file of 'external' outgroup sequences (from a single taxon, 80973
) called external_outgroups.fasta
. In addition, taxon 80974
in the input paralog files is specified as an 'internal' outgroup.
Reducing ParaGone run time: For the time-consuming alignment and tree generation stages of the pipeline, ParaGone can process multiple alignments/trees concurrently. This is controlled by the parameters --pool
and --threads
, both of which take an integer as an argument (e.g. --pool 2
). The integer corresponds to the number of CPUs/threads to use, and should be calculated based on the number of CPUs/threads available on your computer, or the proportion of available CPUs/threads you would like to use. The --pool
value corresponds to the number of alignments (MAFFT) or trees (IQ-TREE or FastTree) to run concurrently, whereas the --threads
value corresponds to the number of CPUs/threads used by each concurrent process. For example, if you would like to use a maximum of 10 CPUs/threads, you could provide the options --pool 5 --threads 2
. This will run 5 alignments/trees concurrently, which each alignment/tree process using 2 threads. Note that in the tutorial commands below we are specifying --pool 1 --threads 1
, meaning that only one alignment/tree will be run at a time, using a single thread. These are also the default values.
In the first step of the pipeline the input files are checked, file and sequence names are sanitised, and outgroup coverage is assessed for each gene. Then, the paralog *.fasta
files are aligned using MAFFT
, and the terminal ends of each alignment are trimmed using trimal
to remove poorly aligned regions. Note that trimal is run with the default parameters -gapthreshold 0.12 -terminalonly -gw 1
. Finally, any poorly aligned regions within each sequence (e.g. due to assembly errors) are removed using HmmCleaner.pl
.
From within the test_dataset
directory, run the command:
paragone check_and_align paralog_input --external_outgroups_file external_outgroups.fasta --internal_outgroup 80974 --pool 1 --threads 1
The following folders will be produced:
-
00_logs_and_reports
-
01_input_paralog_fasta_with_sanitised_filenames
-
02_alignments
-
03_alignments_trimmed
-
04_alignments_trimmed_hmmcleaned
...along with the file
external_outgroups_sanitised.fasta
...a log file under the 00_logs_and_reports/logs
directory:
check_and_align_<date_time>.log
...and two report files under the 00_logs_and_reports/reports
directory:
outgroup_taxon_list.tsv
outgroup_coverage_report.tsv
If desired, you can check the alignments by opening them in an alignment viewer (Geneious, AliView, etc).
Take note of the output folder 00_logs_and_reports
- this is created here in step one on the pipeline, and is also used by later steps to store log and report files. It contains two subfolders: logs
and reports
. The logs
folder contains text *.log
files that provide more detail and debugging information on a given pipeline step, whereas the reports
folder contains reports generated by a given pipeline step in tab-separated-values (*.tsv
) format. Note that not all pipeline steps write a report file; those that do are described below.
Importantly, after running step one you should check the outgroup coverage report, located at 00_logs_and_reports -> reports -> outgroup_coverage_report.tsv
. Open the file in a spreadsheet program (Excel, Numbers, Calc, etc.) and check that the outgroup coverage for each gene is as expected. For this test dataset, the outgroup_coverage_report.tsv
file contains:
Gene_name | Internal_outgroup_taxa | External_outgroup_taxa |
---|---|---|
4527 | 80974 | 80973 |
4932 | 80974 | 80973 |
4992 | 80974 | 80973 |
5620 | 80974 | 80973 |
6139 | 80974 | 80973 |
6462 | 80974 | 80973 |
6717 | 80974 | 80973 |
7128 | 80974 | 80973 |
As can be seen, each gene has an outgroup sequence from the 'internal' taxon specified (80974
) and from the external fasta file (80973
).
Note: for the paralogy resolution algorithms MO, RT, and 1to1, any gene tree containing more than one tip for at least one taxon (i.e., putative paralogs) but no designated outgroup(s) will be skipped.
The outgroup_taxon_list.tsv
file produced in 00_logs_and_reports/reports
directory lists the outgroups specified at the command line, and whether they are 'internal' or external. For this test dataset, the file contains:
INTERNAL_OUTGROUP | 80974 |
EXTERNAL_OUTGROUP | 80973 |
This information is read during step four below (align_selected_and_tree
).
In the second step of the pipeline a phylogenetic tree is produced for each alignment generated in step one. By default, trees are produced using IQ-TREE. Optionally, FastTree can be used, as in this tutorial, by providing the flag --use_fasttree
.
From within the test_dataset
directory, run the command:
paragone alignment_to_tree 04_alignments_trimmed_hmmcleaned --use_fasttree --pool 1 --threads 1
Note: if you had chosen not to trim or clean the alignments in step one (using optional flags --no_trimming
and/or --no_cleaning
), your alignments folder might be called 02_alignments
, 03_alignments_trimmed
or 04_alignments_cleaned
. In these cases, provide this folder name in place of 04_alignments_trimmed_hmmcleaned
in the command above.
The following folder will be generated:
05_trees_pre_quality_control
...along with a log file under the existing 00_logs_and_reports/logs
directory:
alignment_to_tree_<date_time>.log
The trees can be viewed in a viewing program such as FigTree. Before they are analysed using paralogy-resolution algorithms, we need to apply some cleaning and quality control steps.
In the third step of the pipeline a number of quality control and cleaning steps are applied to each phylogenetic tree. These are, in the order applied:
This step used the program TreeShrink to remove long tip branches in trees, under the assumption that such branches indicate sequence assembly or alignment errors. The stringency of the tip removal is adjusted using the TreeShrink parameter --quantile
; ParaGone uses the default value of 0.05
, but this can be adjusted with the parameter --treeshrink_q_value
.
This step removes ('masks' using Yang and Smith terminology) all but one tip from monophyletic clades that contain multiple tips with the same taxon name. This is designed to retain only a single representative tip in cases where a tree contains alleles or close paralogs, as these would interfere with identification of paralogs later in the pipeline. The tip corresponding to the sequence with the greatest number of unambiguous characters is kept.
This step involves identifying any putative deep paralogs in each tree, and cutting/splitting the tree at these nodes to generate two or more subtrees. Internal branches above a given threshold length are cut. This threshold value can be changed using the parameter --cut_deep_paralogs_internal_branch_length_cutoff <float>
. The default value is 0.3.
After these three steps have been applied, the fasta sequences corresponding to the tips in each output tree are recovered from the alignment files from step one.
From within the test_dataset
directory, run the command:
paragone qc_trees_and_extract_fasta 04_alignments_trimmed_hmmcleaned --treeshrink_q_value 0.20 --cut_deep_paralogs_internal_branch_length_cutoff 0.04
The following folders will be produced:
-
06_trees_trimmed
-
07_trees_trimmed_masked
-
08_trees_trimmed_masked_cut
-
09_sequences_from_qc_trees
...along with a log file under the existing 00_logs_and_reports/logs
directory:
qc_trees_and_extract_fasta_<date_time>.log
...and several report files under the existing 00_logs_and_reports/reports
directory:
-
fasta_from_qc_trees_report.tsv
-
trees_trimmed_report.tsv
-
trees_trimmed_masked_report.tsv
-
trees_trimmed_masked_cut_report.tsv
If you open the trees_trimmed_report.tsv
in a spreadsheet program, you'll see the following:
Tree name | Tips removed by TreeShrink with quantile 0.2 | Tip names | Trimmed trees > 4 taxa | Trimmed trees < 4 taxa |
---|---|---|---|---|
4527.aln.trimmed.hmm.fasta.treefile | 0 | N/A | Y | N |
4932.aln.trimmed.hmm.fasta.treefile | 2 | 80974.0; 80974.main | Y | N |
4992.aln.trimmed.hmm.fasta.treefile | 1 | 80974 | Y | N |
5620.aln.trimmed.hmm.fasta.treefile | 1 | 80974 | Y | N |
6139.aln.trimmed.hmm.fasta.treefile | 0 | N/A | Y | N |
6462.aln.trimmed.hmm.fasta.treefile | 0 | N/A | Y | N |
6717.aln.trimmed.hmm.fasta.treefile | 0 | N/A | Y | N |
7128.aln.trimmed.hmm.fasta.treefile | 1 | 80974 | Y | N |
Here, you can see that tips with the taxon 80974
was removed for genes 4932
, 4992
, 5620
and 7128
. Note that 80974
is an outgroup taxon; don't worry about it being removed here, as the corresponding sequence will be added back in to alignments in step four below.
The trees_trimmed_masked_report.tsv
file contains:
Tree name | Monophyletic tips removed ("masked") | Removed tip names | Masked trees < 4 taxa after masking mono |
---|---|---|---|
4527.aln.trimmed.hmm.fasta.treefile.tt | 1 | 80974.main | N |
4932.aln.trimmed.hmm.fasta.treefile.tt | 0 | N/A | N |
4992.aln.trimmed.hmm.fasta.treefile.tt | 0 | N/A | N |
5620.aln.trimmed.hmm.fasta.treefile.tt | 0 | N/A | N |
6139.aln.trimmed.hmm.fasta.treefile.tt | 2 | 80974.0; 80974.main | N |
6462.aln.trimmed.hmm.fasta.treefile.tt | 1 | 80974.main | N |
6717.aln.trimmed.hmm.fasta.treefile.tt | 1 | 80974.main | N |
7128.aln.trimmed.hmm.fasta.treefile.tt | 0 | N/A | N |
You can see that the tree for gene 6717
contained a monophyletic clade with multiple tips for taxon 80974
, and that tip 80974.0
was retained because the corresponding sequence has the greatest number of non-ambiguous nucleotide characters. Again, note that taxon 80974
is an 'internal' outgroup in this tutorial dataset; because the sequences for 80974
were produced by the HybPiper command hybpiper paralog_retriever
, there can be multiple 80974
sequences for a given gene. This is more clearly demonstrated in the case of gene 6139
, where two 80974
sequences were removed from a monophyletic 80974
clade.
The trees_trimmed_masked_cut_report.tsv
file contains:
Tree_name | Num subtrees retained after cutting | Num subtrees discarded after cutting | Number of subtrees discarded after cutting as < 4 taxa |
---|---|---|---|
4527.aln.trimmed.hmm.fasta.treefile.tt.mm | 1 | 0 | 0 |
4932.aln.trimmed.hmm.fasta.treefile.tt.mm | 2 | 0 | 0 |
4992.aln.trimmed.hmm.fasta.treefile.tt.mm | 1 | 0 | 0 |
5620.aln.trimmed.hmm.fasta.treefile.tt.mm | 2 | 0 | 0 |
6139.aln.trimmed.hmm.fasta.treefile.tt.mm | 1 | 0 | 0 |
6462.aln.trimmed.hmm.fasta.treefile.tt.mm | 1 | 0 | 0 |
6717.aln.trimmed.hmm.fasta.treefile.tt.mm | 1 | 0 | 0 |
7128.aln.trimmed.hmm.fasta.treefile.tt.mm | 1 | 0 | 0 |
Tree_name | Subtree discarded after cutting | Reason |
As can be seen, for genes 4932
and 5620
each tree was cut at a putative deep paralog, producing two subtrees; these trees can be found at e.g. 08_trees_trimmed_masked_cut -> 5620_1.subtree
and 08_trees_trimmed_masked_cut -> 5620_2.subtree
.
In the fourth step of the pipeline, outgroup fasta sequences are added to each of the fasta files generated at the end of step three (i.e., sequences corresponding to tips in the final subtrees produced).
-
'External' outgroup sequences will be obtained from the file
external_outgroups_sanitised.fasta
generated in step one, and specified here using the parameter--external_outgroups_file
. -
'Internal' outgroup sequences will be added to each corresponding gene fasta file, if necessary (e.g. if they were removed during tree quality control measures in step three above). As discussed above, 'internal' outgroup taxa might contain paralogs for a given gene; this interferes with the paralogy resolution steps below, and so in such cases only a single paralog sequence for each outgroup taxon is added to the gene fasta file. The paralog sequence most distant from the ingroup taxa sequences is chosen.
Next, all fasta files (now containing outgroup sequences) are aligned, and a phylogenetic tree is produced for each alignment. By default, trees are produced using IQ-TREE. Optionally, FastTree can be used, as in this tutorial, by providing the flag --use_fasttree
.
From within the test_dataset
directory, run the command:
paragone align_selected_and_tree 04_alignments_trimmed_hmmcleaned --use_fasttree --pool 1 --threads 1
The following folders will be produced:
-
10_sequences_from_qc_outgroups_added
-
11_pre_paralog_resolution_alignments
-
12_pre_paralog_resolution_alignments_trimmed
-
13_pre_paralog_resolution_trees
...along with three report files under the existing 00_logs_and_reports/reports
directory:
-
in_and_outgroups_list.txt
-
per_locus_paralogy_report_post_tree_qc.tsv
-
per_taxon_paralogy_report_post_tree_qc.tsv
The in_and_outgroups_list.txt
file records which taxon names belong to specified outgroups (i.e., they are 'external' outgroups that originate from the external_outgroups.fasta
file, or are 'internal' outgroups that were specified using the --internal_outgroup
parameter, and which taxon names are ingroups. For example:
OUT 80974
OUT 80973
IN 79684
IN 79689
IN 79688
IN 79685
IN 79682
IN 79687
IN 79678
IN 79686
IN 79683
IN 79679
This list is required by some of the paralogy resolution algorithms as implemented by Yang and Smith.
The per_locus_paralogy_report_post_tree_qc.tsv
and per_taxon_paralogy_report_post_tree_qc.tsv
reports contain summaries of putative paralogy across the dataset on a per-locus and per-taxon basis. These reports are generated by parsing the post-QC phylogenetic trees.
The per_locus_paralogy_report_post_tree_qc.tsv
reports contains:
locus | num_taxa_total | num_taxa_>1_tip | >1_tip_taxa_names |
---|---|---|---|
4527_1 | 18 | 6 | 79678; 79679; 79682; 79685; 79686; 79687 |
4932_1 | 12 | 0 | None |
4932_2 | 19 | 7 | 79678; 79679; 79682; 79685; 79686; 79687; 79689 |
4992_1 | 17 | 5 | 79678; 79683; 79685; 79686; 79689 |
5620_1 | 11 | 0 | None |
5620_2 | 12 | 0 | None |
6139_1 | 16 | 4 | 79678; 79679; 79684; 79685 |
6462_1 | 17 | 5 | 79678; 79679; 79682; 79686; 79687 |
6717_1 | 19 | 7 | 79678; 79679; 79682; 79684; 79685; 79686; 79687 |
7128_1 | 16 | 4 | 79678; 79679; 79682; 79687 |
The per_taxon_paralogy_report_post_tree_qc.tsv
report contains:
taxon | loci_where_>1_tip |
---|---|
79678 | 4527_1; 4932_2; 4992_1; 6139_1; 6462_1; 6717_1; 7128_1 |
79679 | 4527_1; 4932_2; 6139_1; 6462_1; 6717_1; 7128_1 |
79682 | 4527_1; 4932_2; 6462_1; 6717_1; 7128_1 |
79683 | 4992_1 |
79684 | 6139_1; 6717_1 |
79685 | 4527_1; 4932_2; 4992_1; 6139_1; 6717_1 |
79686 | 4527_1; 4932_2; 4992_1; 6462_1; 6717_1 |
79687 | 4527_1; 4932_2; 6462_1; 6717_1; 7128_1 |
79688 | None |
79689 | 4932_2; 4992_1 |
80973 | None |
80974 | None |
In the fifth step of the pipeline, each of the trees produced in step four are processed using the paralogy resolution algorithms described in Yang and Smith 2014. ParaGone uses modified versions of the original Yang and Smith scripts for this step. The three algorithms implemented are MonoPhyletic Outgroups (MO), Maximum Inclusion (MI), and Rooted subTrees (RT); please see the 2014 manuscript for a more detailed description. Note that you can select one or more of the algorithms to use - in the command below, we are using all three.
From within the test_dataset
directory, run the command:
paragone prune_paralogs --mo --rt --mi
The following folders will be produced:
-
14_pruned_MO
-
15_pruned_MI
-
16_pruned_RT
...along with several report files under the existing 00_logs_and_reports/reports
directory:
-
MO_report.tsv
-
MI_report.tsv
-
RT_report.tsv
These reports are useful to track how each paralogy resolution algorithm has processed each of the input trees. In each report, the Number of trees
row contains top level stats for the total number of input trees, and the following rows contains stats for each individual tree.
Open the output directory 14_pruned_MO
. Here you will see a number of tree types output by the Monophyletic Outgroups paralogy resolution approach:
-
*.reroot
. These trees are produced if a given input tree had monophyletic outgroups. Each tree has been re-rooted on the outgroups clade. -
*.1to1ortho.tre
. These trees are produced if there are no putative paralogs in the input tree. For this tutorial dataset, you can see that both subtrees produced by pruning deep paralogs for gene5620
(see pipeline step three) contain no paralogs. -
*.ortho.tre
. These trees are produced by extracting the clade with the greatest number of non-repeating taxa from a tree rooted on a monophyletic outgroup.
For example, in the input tree for gene 6139
below (13_pre_paralog_resolution_trees -> 6139_1.outgroup_added.aln.trimmed.fasta.treefile
), there are two clades containing non-repeating taxon names (i.e., putative paralog clades), as shown by the red and green boxes:
The red clade contains five taxa, whereas the green clade contains eight taxa. The MO algorithm extracts the larger green clade, and writes the output tree 6139_1.ortho.tre
:
Open the output directory 15_pruned_MI
. Here you will see a number of tree types output by the Maximum Inclusion paralogy resolution approach:
-
*.1to1ortho.tre
. These trees are produced if there are no putative paralogs in the input tree. For this tutorial dataset, you can see that both subtrees produced by pruning deep paralogs for gene5620
(see pipeline step three) contain no paralogs. -
*.MIortho1.tre
,*.MIortho2.tre
, etc. These trees are produced by iteratively cutting out the subtree with the highest number of taxa without taxon duplication.
For example, for the input tree for gene 6139
as shown above for the MO approach, the Maximum Inclusion method has extracted both the clade in the red box and the green box, producing the trees 6139_1.MIortho1.tre
(green box) and 6139_1.MIortho2.tre
(red box). Note that the outgroup sequences 80974.0
and 80973
are present in tree 6139_1.MIortho1.tre
:
..but not the tree 6139_1.MIortho2.tre
:
Open the output directory 16_pruned_RT
. Here you will see a number of tree types output by the Rooted subTrees paralogy resolution approach:
-
*.inclade1
. These trees are produced by iteratively searching for the subtree with the highest number of ingroup taxa, and cutting it out as a rooted tree based on the position of one or more of the defined outgroup taxa. -
*.inclade1.ortho1.tre
,*.inclade1.ortho2.tre
, etc. These trees are produced by examining theinclade*.tre
subtrees described above, and inferring gene duplications from root to tips. When duplicated taxa are found between two child clades at a bifurcating node, the clade containing the smaller number of taxa is cut off.
Because the outgroups are monophyletic for all trees in this tutorial dataset, the *.inclade*.tre
trees are simply the input tree with the outgroup sequences removed. For example, for gene 6462
the tree 6462_1.inclade1
looks like:
You'll see that the RT paralogy resolution apprach has recovered a single tree called 6462_1.inclade1.ortho1.tre
, containing nine tips (compared to the fifteen tips in tree 6462_1.inclade1
):
To understand why these tips were removed, imagine starting at the root of tree 6462_1.inclade1
and moving node-by-node towards the tips, trimming according to the following reasoning:
-
Node 1: taxon
79688
occurs only once in the tree, and so it is not removed. -
Node 2: taxon
79682
occurs in the tree twice (79682.main
and79682.0
); for the current node, the child clade containing the least number of taxa is removed. In this case the removed 'clade' contains only one taxon (79682.main
); as this is beneath the default minimum number of taxa for a clade to be retained (four), the 'clade is discarded rather than being written to file. Note that this minimum taxa value can be changed using the parameter--minimum_taxa
. -
Node 3: one of the child clades at this node contains taxa
79687
and79686
; both this taxa also occur in the other child clade, and so the smaller child clade is removed. As above, it contains fewer than the minimum required number of taxa, so it it discarded. -
Node 4: one child clade contains the single taxon
79679
, which also occurs in the other child. The former child clade is removed and discarded. -
Node 5: the smaller child clade contains taxa
79689
and79678
; taxon79678
occurs in the larger child clade, and so the smaller clade is removed and discarded. - Node 6: The remaining clade does not contain any repeated taxa, and is retained.
Information on the tips/clades that were removed can also be found in the log file for the step, located at 00_logs_and_reports-> logs -> prune_paralogs_<date_time>.log
.
In the sixth and final step of the pipeline, fasta sequences corresponding to tips in the 'resolved' paralog trees (step five) are recovered from the fasta files produced in step four (i.e., with outgroup sequences added) . For each sequence file produced, the fasta headers (i.e., sequence names) are stripped to remove paralog IDs (for HybPiper, the ID is a suffix such as .main
, .1
etc), and the sequences are aligned. Note that removal of the paralog IDs from sequence names allows concatenation of the individual paralog alignments to create a supermatrix, if desired for phylogenetic analyses. Alignments where the terminal ends of each alignment are trimmed using trimal
are also produced, using the default parameters -gapthreshold 0.12 -terminalonly -gw 1
.
From within the test_dataset
directory, run the command:
paragone final_alignments --mo --rt --mi --pool 1 --threads 1 --keep_intermediate_files
The following folders will be produced:
-
23_MO_final_alignments
-
24_MI_final_alignments
-
25_RT_final_alignments
-
26_MO_final_alignments_trimmed
-
27_MI_final_alignments_trimmed
-
28_RT_final_alignments_trimmed
...along with several report files under the existing 00_logs_and_reports/reports
directory:
-
fasta_from_tree_mo_report.tsv
-
fasta_from_tree_mi_report.tsv
-
fasta_from_tree_rt_report.tsv
Note the flag --keep_intermediate_files
in the command above. Without this flag, the default behaviour of the paragone final_alignments
command is to delete all intermediate folders and files from the pipeline run except the following:
00_logs_and_reports
23_MO_final_alignments
24_MI_final_alignments
25_RT_final_alignments
26_MO_final_alignments_trimmed
27_MI_final_alignments_trimmed
28_RT_final_alignments_trimmed
This can be useful when you're running the pipeline on a HPC with a limit on the number of files that can be produced.
The pipeline run is now complete! The final paralog alignments can be used in single-gene coalescent phylogenetic analyses (e.g. using Astral), or concatenated in to a supermatrix for analyses with e.g. IQ-TREE, BEAST etc.