Skip to content

6: Sequence Selection

Daniel Portik edited this page Oct 2, 2019 · 20 revisions

Sequence Selection


Overview

The sequence selection stage of SuperCRUNCH allows you to construct different types of datasets, including traditional species-level supermatrices as well as population-level datasets. There are a variety of methods for sorting and filtering sequences. These methods are perhaps most important for constructing supermatrices, where a single representative sequence often must be chosen from a pool of available sequences for each taxon for each locus. For example, the longest sequence can be selected or a random selection can be made, and a minimum length filter can be applied. Additional filters include requiring an error-free reading frame to retain a coding sequence, or only including sequences which have associated voucher information. The latter option, which only retains vouchered sequences, is particularly useful for constructing phylogeographic datasets. Extensive documentation for sequence selection is available in the Selecting Sequences section below.

Following the sequence selection step, there will likely be a reduction in the number of sequences in the filtered locus-specific fasta files (especially for supermatrices). At this point, there may be loci for which there simply aren't enough data to make them worth including. For a supermatrix, this translates to the number of species present for a locus, but for population-level datasets this is the actual number of retained sequences. It is easy to set a minimum threshold for the number of taxa/sequences and apply it to all loci, removing those which fall below the threshold. More details are provided in the Enforcing Minimum Sequence/Taxon Requirement section below.

A major challenge of working with GenBank data is organizing the accession numbers for publication, which allows the dataset to be reproduced. This process can be performed automatically using SuperCRUNCH, which will produce a table containing all taxa and their corresponding accession numbers for all included sequences. More information is provided in the Creating a Table of Accession Numbers From All Loci below.

Finally, one of the main motivations behind SuperCRUNCH was to construct supermatrix datasets from GenBank sequences in an objective, repeatable manner. For most organisms, it would be impossible to build identical supermatrix datasets using subjective criteria. When multiple sequences are available for taxa across loci, the number of possible supermatrices is staggering. As a fun exercise, I created a tool which calculates this number based on the dataset used to generate the supermatrix. I encourage you to try it out! Details are provided in the Calculating the Possible Number of Supermatrix Combinations below.


Selecting Sequences

The selection of sequences ultimately determines the type of dataset produced. There are two fundamentally different datasets produced by this step:

Species-level supermatrices

This type of dataset results from selecting one representative sequence per taxon per locus. The taxa in these datasets can include species and/or subspecies. For a given taxon, this requires selecting a representative sequence for each locus if multiple sequences are available. For a complete supermatrix, this would mean each locus would contain the same number of sequences, which is equal to the number of species included. Species-level supermatrices are typically used for phylogenetic analyses, to reconstruct the evolutionary relationships among the species included in the supermatrix.

Population-level datasets

This type of dataset results from retaining all available sequences per taxon. It is population-level or phylogeographic sampling. These datasets can contain high variation in the number of sequences per locus. A population-level dataset can also be composed of vouchered samples of a single species or a group of closely related species. Although the samples may represent the same species, they come from distinct vouchered specimens. The voucher code will result in an identical label for that sample across loci. This allows sequences to be assigned to the same sample, which can be used to produce phylogeographic supermatrices. Population-level datasets can be used for phylogenetic, phylogeographic, and population genetic analyses.

The Filter_Seqs_and_Species.py module can be used to produce both of the above datasets. The filtering options associated with both types of datasets is explained in greater detail below.

Back to top


Filter_Seqs_and_Species.py

The Filter_Seqs_and_Species.py module is used to filter sequences to produce distinct types of datasets, including species-level supermatrices and population-level datasets. This decision is set using the -s flag, with -s oneseq producing a species-level supermatrix and -s allseqs producing a population-level dataset. A list of taxa to include must be provided (-t). By default, subspecies names are included in the searches, but subspecies can be 'turned off' using the --no_subspecies flag. For more information on how taxon searches with and without subspecies work, please see the section here. Filter_Seqs_and_Species.py is designed to work with a directory of locus-specific fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta or NAME.fa. The NAME portion should not contain any periods or spaces, but can contain underscores. A subset of files in the input directory can be analyzed using the --onlyinclude flag, which requires the full path to a simple text file containing the names of the files to include (one file name per line).

The available filtering options differ depending on which dataset is selected, so they will be discussed separately to prevent confusion.

Species-level supermatrices

For the species-level supermatrix, a representative sequence must be selected for each taxon for each locus. There are several options for choosing a sequence, which are controlled by a combination of the required -f and -m flags, and the optional --randomize and --vouchered flags.

  • -f : The main options for choosing a sequence are by length (-f length) or by translation (-f translate). For any type of locus, the sequences can be sorted by length and the longest sequence selected (-f length). For coding loci, the sequences can be sorted by length and then translated into all reading frames (-f translate). The longest sequence with a valid reading frame (no internal stop codons) will be selected. If no sequences have a valid reading frame, the longest sequence will be selected (rather than exclude the taxon for that locus). For the -f translate option, the translation table can be specified using the --table flag. If your dataset contains a mix of loci that need to be filtered differently (e.g., coding mtDNA vs. coding nucDNA vs. noncoding), the --onlyinclude flag can be used. This requires the full path to a text file containing the names of the fasta files (one file name per line) to be processed for a particular run. A separate output directory (-o ) should be specified for each distinct run.

  • -m : The required -m flag sets a minimum base pair length required to keep sequences. That is, for any sequence to be selected using the -f length or -f translate method, it must be longer than the base pair value set by the -m flag. If no sequences are long enough, the taxon will be excluded for that locus.

  • --randomize: This option will change how sequences are initially sorted, resulting in a random choice for sequence selection. When used with -f length, a random sequence will be selected from the set available, but that sequence must be greater than the minimum base pair length set by the -m flag. When used with -f translate, a random sequence will be selected from only the sequences which pass translation, but that sequence must be greater than the minimum base pair length set by the -m flag. The randomize option can be used to generate different permutations of the supermatrix, otherwise the same supermatrix will be recovered every time using a specific combination of the -f and -m flags. These different sequence combinations for the supermatrix can be used to test the robustness of species relationships.

  • --vouchered: If this flag is used, only sequences that have a voucher tag will be considered for the above filtering options. The voucher field is present in the original description line (voucher, isolate, or strain), and produces a Voucher_[ID] tag in the description line during the Parse_Loci.py module. This is generally not recommended for creating species-level supermatrices, but could be useful under certain circumstances.

Population-level datasets

For population-level datasets, there is no need to select one sequence per taxon per locus - all sequences are retained. However, there are still some filters that can be used to ensure only high-quality sequences are retained. The required -f and -m flags, as well as the optional --vouchered flag, can be used to control this filtering. The optional --randomize flag will have no effect on population-level datasets.

  • -f : The main options for choosing a sequence are by length (-f length) or by translation (-f translate). For population-level datasets, only the translation option will have any effect. If the -f length option is used, all sequences will be retained. If the -f translate option is used, only sequences that pass translation will be retained. For the -f translate option, the translation table can be specified using the --table flag. If your dataset contains a mix of loci that need to be filtered differently (e.g., coding mtDNA vs. coding nucDNA vs. noncoding), the --onlyinclude flag can be used. This requires the full path to a text file containing the names of the fasta files (one file name per line) to be processed for a particular run. A separate output directory (-o ) should be specified for each distinct run.

  • -m : The required -m flag sets a minimum base pair length required to keep sequences. For population-level datasets, this will simply eliminate sequences that are shorter than the base pair value set by the -m flag.

  • --vouchered: If this flag is used, only sequences that have a voucher tag will be retained. The voucher field is present in the original description line (voucher, isolate, or strain), and produces a Voucher_[ID] tag in the description line during the Parse_Loci.py module. This option is extremely useful for reconstructing phylogeographic datasets, especially if the starting sequences come from a single study. The voucher code will result in an identical label for that sample across loci. This allows sequences to be assigned to the same sample, which can be used to produce cohesive phylogeographic datasets and phylogeographic supermatrices.

Basic Usage:

python Filter_Seqs_and_Species.py -i <input directory> -o <output directory> -s <choice> -f <dataset choice> -m <minimum base pair length> -t <taxon file>

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

Required: The full path to a directory which contains the fasta files to filter by species and sequence.

-o <path-to-directory> or --outdir <path-to-directory>

Required: The full path to an existing directory to write output files.

-s <choice> or --seq_selection <choice>

Required: Select one representative sequence per taxon or select all sequences available per taxon. Choices = oneseq, allseqs.

-f <choice> or --seq_filter <choice>

Required: Strategy for filtering sequence data, particularly for selecting one representative sequence using the -s 'oneseq' option. If the -s allseqs option is desired, use -f length. Choices = translate, length.

-m <integer> or --min_length <integer>

Required: An integer for the minimum number of base pairs required to keep a sequence (ex. 150).

-t <path-to-file> or --taxa <path-to-file>

Required: The full path to a text file containing all taxon names to cross-reference in the fasta file(s).

--no_subspecies

Optional: Ignore subspecies labels in both the taxon names file and the fasta file.

--randomize

Optional: For taxa with multiple sequences, shuffle order randomly. Overrides initial sorting by length for sequence selection step.

--vouchered

Optional: Select only sequences that contain a Voucher_[ID] tag in the description line (inserted by Parse_Loci.py module).

--table <choice>

Required for -f translate: Specifies translation table. Choices = standard, vertmtdna, invertmtdna, yeastmtdna, plastid, or any integer 1-31.

--onlyinclude

Optional: The full path to a text file containing the names of the fasta files to process for this run.

--quiet

Optional: Show less output while running (useful when filtering many, many loci).

Example Usage:

python Filter_Seqs_and_Species.py -i Filter-seqs/Input/ -o Filter-seqs/Output/ -s oneseq -f length -m 400 -t Filter-seqs/Taxon-List.txt 

Above command will create a species-level supermatrix using the length method with a minimum base pair length of 400, using the taxon list Taxon-List.txt, with subspecies allowed.

python Filter_Seqs_and_Species.py -i Filter-seqs/Input/ -o Filter-seqs/Output/ -s oneseq -f length -m 400 -t Filter-seqs/Taxon-List.txt --randomize --no_subspecies

Above command will create a species-level supermatrix using the length + randomize method (a true random selection) with a minimum base pair length of 400, using the taxon list Taxon-List.txt with the --no_subspecies option.

python Filter_Seqs_and_Species.py -i Filter-seqs/Input/ -o Filter-seqs/Output/ -s oneseq -f translate -m 200 -t Filter-seqs/Taxon-List.txt --no_subspecies --table standard --onlyinclude Filter-seqs/Nuc_File_List.txt --quiet 

Above command will create a species-level supermatrix using the translation method with standard code and a minimum base pair length of 200, using the taxon list Taxon-List.txt with the --no_subspecies option. It will only run for the fasta files included in the Nuc_File_List.txt file. The --quiet flag will make the screen output less verbose.

python Filter_Seqs_and_Species.py -i Filter-seqs/Input/ -o Filter-seqs/Output/ -s allseqs -f length -m 400 -t Filter-seqs/Taxon-List.txt 

Above command will create a population-level dataset using the length method with a minimum base pair length of 400, using the taxon list Taxon-List.txt, with subspecies allowed.

python Filter_Seqs_and_Species.py -i Filter-seqs/Input/ -o Filter-seqs/Output/ -s allseqs -f translate -m 200 -t Filter-seqs/Taxon-List.txt --no_subspecies --table vertmtdna --onlyinclude Filter-seqs/mtDNA_File_List.txt

Above command will create a population-level dataset using the translation method with vertebrate mitochondrial code and a minimum base pair length of 200, using the taxon list Taxon-List.txt with the --no_subspecies option. It will only run for the fasta files included in the mtDNA_File_List.txt file.

Outputs:

Several outputs are created in the output directory specified. A directory called /Results is created, which contains several additional directories.

  • Directory /Results/Accession-Number-Files - For each locus, this directory contains two files: NAME_accession_list_by_species.txt and NAME_accessions_for_BatchEntrez.txt.

The NAME_accession_list_by_species.txt file contains a list of accession numbers for all the sequences available for a given taxon. Here are the partial contents of such a file:

Acanthosaura armata	AB266452.1, NC_014175.1
Acanthosaura crucigera	AB031963.1
Acanthosaura lepidogaster	KR092427.1
Agama impalearis	AB271238.1, KT005791.1
Amblyrhynchus cristatus	KR350814.1, KR350815.1, KR350816.1, KR350817.1, KR350823.1, KR350824.1, KR350825.1, KR350826.1, KR350827.1, KR350828.1, KR350829.1, KR350830.1, KR350831.1, KR350832.1, KR350833.1, KR350834.1, KR350835.1, KR350836.1, KR350837.1, KT277937.1, NC_028031.1

The LOCUS_accessions_for_BatchEntrez.txt file contains a simple list of all accession numbers for all the sequences available for all taxa. Here are the partial contents of such a file:

AB266452.1
NC_014175.1
AB031963.1
KR092427.1
AB271238.1
KT005791.1
KR350814.1
KR350815.1
KR350816.1
KR350817.1
KR350823.1
KR350824.1
KR350825.1
KR350826.1
KR350827.1
KR350828.1
KR350829.1
KR350830.1
KR350831.1
KR350832.1
KR350833.1
KR350834.1
KR350835.1
KR350836.1
KR350837.1
KT277937.1
NC_028031.1
  • Directory /Results/Taxon-Log-Files - For each locus, this directory contains a summary file NAME_species_log.txt that contains the following columns:

    • Taxon - The taxon name for this sequence.

    • Acession - The accession number of the sequence selected for this taxon (species-level supermatrix) or simply the sequence in that row (population-level dataset).

    • SeqLength - The length of the sequence corresponding to this accession number.

    • Vouchered - A yes or no as to whether the sequence contained a Voucher_[ID] tag.

    • PassedTranslation - A yes or no as to whether the sequence passed translation for -f translate or a NA for -f length.

    • SeqsAvailable - The number of alternative sequences available for this taxon.

Here is an example from using the -s oneseq and -f length options:

Taxon	Acession	SeqLength	Vouchered	PassedTranslation	SeqsAvailable
Acanthosaura armata	AB266452.1	833	no	NA	2
Acanthosaura crucigera	AB031963.1	360	no	NA	1
Acanthosaura lepidogaster	KR092427.1	834	no	NA	1
Agama impalearis	KT005791.1	338	yes	NA	2
Amblyrhynchus cristatus	KT277937.1	944	no	NA	21

Here is an example from using the -s oneseq and -f length options:

Taxon	Acession	SeqLength	Vouchered	PassedTranslation	SeqsAvailable
Acanthosaura lepidogaster	JF806003.1	630	no	yes	1
Agama agama	DQ340698.1	688	yes	yes	4
Amblyrhynchus cristatus	KR350716.1	620	yes	yes	3
Amphibolurus muricatus	KF871470.1	718	yes	yes	61
Amphibolurus norrisi	KF871494.1	718	yes	yes	1

Here is an example from using the -s allseqs and -f length options:

Callisaurus draconoides bogerti	DQ001804.1	1087	yes	NA	2
Callisaurus draconoides bogerti	EU543757.1	1087	no	NA	2
Callisaurus draconoides brevipes	DQ001803.1	1082	yes	NA	1
Callisaurus draconoides carmenensis	DQ001780.1	1087	yes	NA	17
Callisaurus draconoides carmenensis	DQ001782.1	1087	yes	NA	17
Callisaurus draconoides carmenensis	DQ001784.1	1087	yes	NA	17
Callisaurus draconoides carmenensis	DQ001785.1	1087	yes	NA	17
Callisaurus draconoides carmenensis	DQ001790.1	1087	yes	NA	17
Callisaurus draconoides carmenensis	DQ001793.1	1087	yes	NA	17
Callisaurus draconoides carmenensis	DQ001781.1	1086	yes	NA	17
Callisaurus draconoides carmenensis	EU543755.1	1083	no	NA	17
Callisaurus draconoides carmenensis	DQ001783.1	1076	yes	NA	17
Callisaurus draconoides carmenensis	DQ001786.1	1076	yes	NA	17
Callisaurus draconoides carmenensis	DQ001787.1	1076	yes	NA	17
Callisaurus draconoides carmenensis	DQ001788.1	1076	yes	NA	17
Callisaurus draconoides carmenensis	DQ001789.1	1076	yes	NA	17
Callisaurus draconoides carmenensis	DQ001791.1	1076	yes	NA	17
Callisaurus draconoides carmenensis	DQ001792.1	1076	yes	NA	17
Callisaurus draconoides carmenensis	DQ001794.1	1076	yes	NA	17
Callisaurus draconoides carmenensis	DQ001795.1	1076	yes	NA	17
  • Directory /Results/Filtered-Fasta-Files - For each locus, this directory contains a fasta file with the sequences that passed the sequence selection filtering. If the -s oneseq was used, the file will be called NAME_oneseq.fasta. If the -s allseqs option was used, the file will be called NAME_allseqs.fasta. These files should be used for the downstream steps.

Back to top


Enforcing Minimum Sequence/Taxon Requirement

Following the sequence selection step, there will likely be a reduction in the number of sequences in the filtered locus-specific fasta files (especially for supermatrices). At this point, there may be loci for which there simply aren't enough data to make them worth including. For a supermatrix, this translates to the number of species present for a locus, but for population-level datasets this is the actual number of retained sequences. It is easy to set a minimum threshold for the number of taxa/sequences and apply it to all loci, removing those which fall below the threshold.

Back to top


Fasta_Filter_by_Min_Seqs.py

The purpose of the Fasta_Filter_by_Min_Seqs.py module is to examine fasta files for the number of sequences contained within. If taxa are represented by a single sequence, this is equivalent to the number of taxa. Running the script with only the -i and --min_seqs flags will simply print the number of fasta files passing and failing the minimum sequence filter. The minimum number of sequences is an integer specified using the --min_seqs flag. Running this module with the -i , --min_seqs, and the optional -o flags will copy the fasta files that pass the minimum sequence filter to the output directory specified with the -o flag.

Fasta_Filter_by_Min_Seqs.py is designed to work with a directory of locus-specific fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta or NAME.fa. The NAME portion should not contain any periods or spaces, but can contain underscores.

Basic Usage:

To simply show results on screen (no files copied):

python Fasta_Filter_by_Min_Seqs.py -i <input directory> --min_seqs <minimum sequences required>

To show results and copy files passing filter to output directory:

python Fasta_Filter_by_Min_Seqs.py -i <input directory> --min_seqs <minimum sequences required> -o <output directory>

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

Required: The full path to a directory of fasta files.

--min_seqs <integer>

Required: The minimum number of taxa/sequences required.

-o <path-to-directory> or --outdir <path-to-directory>

Optional: The full path to an existing directory to copy files passing the filter.

Example Usage:

python Fasta_Filter_by_Min_Seqs.py -i bin/Min-Seqs/Input --min_seqs 50

Above command will assess how many fasta files present in the input directory bin/Min-Seqs/Input have greater than or equal to 50 sequences, and print this to the screen.

python Fasta_Filter_by_Min_Seqs.py -i bin/Min-Seqs/Input --min_seqs 50 -o bin/Min-Seqs/Output

Above command will assess how many fasta files present in the input directory bin/Min-Seqs/Input have greater than or equal to 50 sequences, print this to the screen, and copy the files that passed to the designated output directory bin/Min-Seqs/Output.

Outputs:

If the module is run without the output directory flag (-o ), no outputs are produced. The information is simply printed to the screen. If the output directory flag (-o ) is specified, the following directory will be created in the output directory:

  • Directory /Filtered-Fasta-Files - This directory will contain all the fasta files that passed the minimum sequence threshold.

Here is an example of the information printed to the screen:

python Fasta_Filter_by_Min_Seqs.py -i bin/07-Filter-seqs/Output/Results/Filtered-Fasta-Files --min_seqs 100


Found 67 total fasta files.
Processing...


56 fasta files passed the minimum sequence filter (>=100 seqs).
11 fasta files failed the minimum sequence filter (>=100 seqs).




--------------------------------------------------------------------------------------

Finished. Elapsed time : 0:00:02.740741 (H:M:S)

--------------------------------------------------------------------------------------

If the number of loci retained is too low, it is trivial to re-run the module with different minimum sequence thresholds to find an appropriate value. Here is the same dataset with a lower threshold:

python Fasta_Filter_by_Min_Seqs.py -i bin/07-Filter-seqs/Output/Results/Filtered-Fasta-Files --min_seqs 30


Found 67 total fasta files.
Processing...


61 fasta files passed the minimum sequence filter (>=30 seqs).
6 fasta files failed the minimum sequence filter (>=30 seqs).




--------------------------------------------------------------------------------------

Finished. Elapsed time : 0:00:00.030574 (H:M:S)

--------------------------------------------------------------------------------------

After the ideal value is found, the -o flag can be used to copy the fasta files passing the filter to the relevant output directory.

Back to top


Creating a Table of Accession Numbers From All Loci

Organizing GenBank accession numbers for large projects can be exceptionally challenging. SuperCRUNCH includes a module called Make_Acc_Table.py that automates this process, resulting in a formatted table with rows representing species, subspecies, or vouchered specimens, and the columns representing loci. This table will only be created from a dataset which contains one sequence per locus for each taxon. If multiple sequences exist for a given taxon (species or vouchered specimen) for a given locus (as with some population-level datasets), this will result in an error.

Back to top


Make_Acc_Table.py

The Make_Acc_Table.py module is a tool for extracting taxon and accession number information from a directory of fasta files, for the purpose of creating a table of GenBank accession numbers.

A comprehensive list of taxa is collected across all fasta files in the directory. The taxon names recovered are two-part by default (e.g., Genus species). To include three-part names (e.g., Genus species subspecies), the optional -s flag can be used. This requires the full path to a taxon names file. This taxon names file can contain a mix of two-part and three-part names, and can be the same file used for earlier steps, but only the three-part names are used to aid the searches in Make_Acc_Table.py. If this is a vouchered data set (produced from the Filter_Seqs_and_Species.py module using the --voucherize option), the --voucherize flag should also be used here to construct the taxon labels with the voucher information present. This will produce an accession table where each entry is a taxon + voucher label, rather than just a taxon label. Examples of both types are shown below. If a vouchered data set is run without the --voucherize flag, an error will be thrown indicating duplicate taxon labels are present within a single fasta file, or that multiple sequences are available for the taxon.

The final table is written with species, subspecies, or vouchered specimen labels in the first column (sorted alphabetically) and the remaining columns represent the loci (also sorted alphabetically by fasta file name). The rows will either be filled with accession numbers if the sequence is present, or with dashes if there is missing data.

This tool can be used on unaligned or aligned fasta files, as long as they contain full description lines for sequences.

Make_Acc_Table.py is designed to work with a directory of locus-specific fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta or NAME.fa. The NAME portion should not contain any periods or spaces, but can contain underscores.

Basic Usage:

python -i <input directory> -o <output directory> 

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

Required: The full path to a directory which contains the input fasta files.

-o <path-to-directory> or --outdir <path-to-directory>

Required: The full path to an existing directory to write output files.

-s <path-to-file> or --subspecies <path-to-file>

Optional: If subspecies names are to be included, this should be the full path to a text file containing all subspecies names to cross-reference in the fasta file. This can be the same taxon list used in earlier steps.

--voucherize

Optional: If a 'Voucher_[ID]' entry is included in the sequence description, append the ID component to the end of the species/subspecies label (e.g., Hyperolius_nitidulus_MVZ236473). The Voucher_[ID] field is generated by the Parse_Loci.py module.

Example Usage:

python Make_Acc_Table.py -i bin/Acc-table/Input -o bin/Acc-table/Output

Above command will create an accession table from all fasta files present in bin/Acc-table/Input and write the outputs to the designated output directory. By default, subspecies labels will be ignored.

python Make_Acc_Table.py -i bin/Acc-table/Input -o bin/Acc-table/Output -s bin/Acc-table/Taxon-list.txt

Above command will create an accession table from all fasta files present in bin/Acc-table/Input and write the outputs to the designated output directory. The taxon labels will include any subspecies labels present in the file Taxon-list.txt.

python Make_Acc_Table.py -i bin/Acc-table/Input -o bin/Acc-table/Output --voucherize

Above command will create an accession table from all fasta files present in bin/Acc-table/Input and write the outputs to the designated output directory. The taxon labels will include the voucher codes, so that the same species can be represented by many uniquely labeled samples.

Outputs:

Several outputs are created in the output directory specified:

  • File Taxon_Summary.txt - A two-column file which contains the taxon label and the number of sequences present in the supermatrix. In other words, the taxon and the number of loci for that taxon.

Example output for a species-level supermatrix dataset:

Taxon	Loci
Acanthocercus_adramitanus	4
Acanthocercus_annectens	6
Acanthocercus_atricollis	6
Acanthocercus_cyanogaster	6
Acanthocercus_yemensis	8
Acanthosaura_armata	7
Acanthosaura_capra	2
Acanthosaura_crucigera	5
Acanthosaura_lepidogaster	44

Example output for a population-level vouchered dataset:

Taxon	Loci
Trachylepis_aurata_ZFMK75837	4
Trachylepis_aurata_ZFMK84085	4
Trachylepis_punctulata_MBUR00322	4
Trachylepis_punctulata_MCZA38906	4
Trachylepis_punctulata_MCZA38924	4
Trachylepis_sulcata_AMB4288	4
Trachylepis_sulcata_AMB4291	1
Trachylepis_sulcata_AMB4596	4
Trachylepis_sulcata_AMB4620	4
Trachylepis_sulcata_AMB4693	3
Trachylepis_sulcata_AMB4766	4
Trachylepis_sulcata_AMB4767	4
Trachylepis_sulcata_AMB4782	4
  • File GenBank_Accession_Table.txt - The actual table of taxa and accession numbers.

Example output for a species-level supermatrix dataset:

Taxon	12S	16S	ADNP	AHR	AKAP9	BACH1
Acanthocercus adramitanus	-	KU097507.1	-	-	-	-
Acanthocercus annectens	-	JX668129.1	-	-	-	-
Acanthocercus atricollis	-	JX668132.1	-	-	-	-
Acanthocercus cyanogaster	-	JX668135.1	-	-	-	-
Acanthocercus yemensis	-	JX668140.1	-	-	-	-
Acanthosaura armata	AB266452.1	AB266452.1	-	-	-	-
Acanthosaura capra	-	-	-	-	-	-
Acanthosaura crucigera	AB031963.1	MG935713.1	-	-	-	-
Acanthosaura lepidogaster	KR092427.1	KR092427.1	JF818219.1	JF818274.1	JF805815.1	JF805886.1

Example output for a population-level vouchered dataset:

Taxon	EXPH5_extracted_allseqs	KIF24_extracted_allseqs	ND2_extracted_allseqs	RAG1_extracted_allseqs	
Trachylepis aurata ZFMK75837	GU931406.1	GU931476.1	GU931545.1	GU931613.1
Trachylepis aurata ZFMK84085	GU931407.1	GU931477.1	GU931546.1	GU931614.1
Trachylepis punctulata MBUR00322	GU931414.1	GU931543.1	GU931612.1	GU931679.1
Trachylepis punctulata MCZA38906	GU931413.1	GU931542.1	GU931611.1	GU931678.1
Trachylepis punctulata MCZA38924	GU931415.1	GU931541.1	GU931610.1	GU931677.1
Trachylepis sulcata AMB4288	HQ829667.1	HQ829699.1	HQ829735.1	HQ829771.1
Trachylepis sulcata AMB4291	-	-	HQ829736.1	-
Trachylepis sulcata AMB4596	GU931439.1	GU931483.1	GU931552.1	GU931620.1
Trachylepis sulcata AMB4620	GU931440.1	GU931484.1	GU931553.1	GU931621.1
Trachylepis sulcata AMB4693	-	HQ829700.1	HQ829737.1	HQ829772.1
Trachylepis sulcata AMB4766	GU931437.1	GU931485.1	GU931554.1	GU931622.1
Trachylepis sulcata AMB4767	GU931438.1	GU931486.1	GU931555.1	GU931623.1
Trachylepis sulcata AMB4782	GU931442.1	GU931487.1	GU931556.1	GU931624.1

Back to top


Calculating the Possible Number of Supermatrix Combinations

When creating a supermatrix, if each taxon had only one sequence per locus then the possible number of supermatrix combinations is one. If there are multiple sequences available per locus, then the number of possible combinations can be incredibly high. The Infer_Supermatrix_Combinations.py module can be used to calculate the exact number of sequence combinations for a given supermatrix, given a set of sequences available for filtering. In other words, it will tell you how many possible supermatrices can be constructed based on the starting sequences.

Back to top


Infer_Supermatrix_Combinations.py

A tool for calculating how many supermatrix combinations are available, given the number of filtered sequences available for each taxon for each locus. If all taxa have only one sequence available, the answer is one, but if taxa have multiple sequences available, this number will be extremely large. Relies on the NAME_species_log.txt files produced from the Filter_Seqs_and_Species.py module, which are located in the Results/Taxon-Log-Files/ output directory, to calculate the number of sequences available per taxon. The log files for all loci should be present in the input directory for the calculation to be accurate. No output files are created, rather the information is logged to the screen. Note - Infer_Supermatrix_Combinations.py is unable to process population-level vouchered datasets, it only works for species/subspecies level supermatrices.

Basic Usage:

python -i <input directory>

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

Required: The full path to a directory which contains all the NAME_species_log.txt files. This should be the Results/Taxon-Log-Files/ output directory from the Filter_Seqs_and_Species.py module.

Example Usage:

python Infer_Supermatrix_Combinations.py -i bin/07-Filter-seqs/Output/Results/Taxon-Log-Files 

Above command will use the NAME_species_log.txt files in the input directory to calculate the number of supermatrix combinations.

Outputs:

No outputs are created - the results are written to the screen.

Here is an example from a very large supermatrix:

Found 67 loci to examine.


	Parsing information in 12S_extracted_contfiltered_species_log.txt
	Parsing information in 16S_extracted_contfiltered_species_log.txt
	Parsing information in ADNP_extracted_species_log.txt
	Parsing information in AHR_extracted_species_log.txt
	Parsing information in AKAP9_extracted_species_log.txt
	Parsing information in AMEL_extracted_species_log.txt
	Parsing information in BACH1_extracted_species_log.txt
	Parsing information in BACH2_extracted_species_log.txt
	Parsing information in BDNF_extracted_species_log.txt
	Parsing information in BHLHB2_extracted_species_log.txt
	Parsing information in BMP2_extracted_species_log.txt
	Parsing information in CAND1_extracted_species_log.txt
	Parsing information in CARD4_extracted_species_log.txt
	Parsing information in CILP_extracted_species_log.txt
	Parsing information in CMOS_extracted_species_log.txt
	Parsing information in CO1_extracted_species_log.txt
	Parsing information in CXCR4_extracted_species_log.txt
	Parsing information in CYTB_extracted_species_log.txt
	Parsing information in DLL1_extracted_species_log.txt
	Parsing information in DNAH3_extracted_species_log.txt
	Parsing information in ECEL1_extracted_species_log.txt
	Parsing information in ENC1_extracted_species_log.txt
	Parsing information in EXPH5_extracted_species_log.txt
	Parsing information in FSHR_extracted_species_log.txt
	Parsing information in FSTL5_extracted_species_log.txt
	Parsing information in GALR1_extracted_species_log.txt
	Parsing information in GHSR_extracted_species_log.txt
	Parsing information in GPR37_extracted_species_log.txt
	Parsing information in HLCS_extracted_species_log.txt
	Parsing information in INHIBA_extracted_species_log.txt
	Parsing information in KIAA2018_extracted_species_log.txt
	Parsing information in KIF24_extracted_species_log.txt
	Parsing information in LRRN1_extracted_species_log.txt
	Parsing information in LZTS1_extracted_species_log.txt
	Parsing information in MC1R_extracted_species_log.txt
	Parsing information in MKL1_extracted_species_log.txt
	Parsing information in MLL3_extracted_species_log.txt
	Parsing information in MSH6_extracted_species_log.txt
	Parsing information in MXRA5_extracted_species_log.txt
	Parsing information in ND1_extracted_species_log.txt
	Parsing information in ND2_extracted_species_log.txt
	Parsing information in ND4_extracted_species_log.txt
	Parsing information in NGFB_extracted_species_log.txt
	Parsing information in NKTR_extracted_species_log.txt
	Parsing information in NOS1_extracted_species_log.txt
	Parsing information in NT3_extracted_species_log.txt
	Parsing information in PDC_extracted_species_log.txt
	Parsing information in PNN_extracted_species_log.txt
	Parsing information in PRLR_extracted_species_log.txt
	Parsing information in PTGER4_extracted_species_log.txt
	Parsing information in PTPN_extracted_species_log.txt
	Parsing information in R35_extracted_species_log.txt
	Parsing information in RAG1p1_extracted_species_log.txt
	Parsing information in RAG1p2_extracted_species_log.txt
	Parsing information in RAG2_extracted_species_log.txt
	Parsing information in REV3L_extracted_species_log.txt
	Parsing information in RHO_extracted_species_log.txt
	Parsing information in SLC30A1_extracted_species_log.txt
	Parsing information in SLC8A1_extracted_species_log.txt
	Parsing information in SLC8A3_extracted_species_log.txt
	Parsing information in SNCAIP_extracted_species_log.txt
	Parsing information in SOCS5_extracted_species_log.txt
	Parsing information in TRAF6_extracted_species_log.txt
	Parsing information in UBN1_extracted_species_log.txt
	Parsing information in VCPIP_extracted_species_log.txt
	Parsing information in ZEB2_extracted_species_log.txt
	Parsing information in ZFP36L1_extracted_species_log.txt


Found 59,219 total sequences for 1,426 taxa.


Number of possible supermatrix combinations (unwieldy integer) = 
9,155,211,937,028,784,139,400,729,807,279,775,488,020,851,133,423,680,100,069,848,449,615,
916,157,495,787,793,718,608,810,311,404,937,233,000,214,247,836,252,455,768,476,628,861,
064,620,200,711,350,158,661,811,743,694,197,397,688,387,145,249,835,198,884,996,018,769,
083,230,608,987,218,568,872,312,142,336,848,250,842,612,365,321,530,772,903,030,234,569,
416,127,858,213,723,379,748,832,537,328,917,920,565,791,037,157,172,944,443,004,391,099,
462,043,175,267,678,305,264,769,042,131,976,141,366,968,096,832,712,391,298,765,975,927,
494,157,940,235,330,313,905,291,769,437,463,196,103,934,323,818,989,284,191,519,710,411,
868,781,740,939,402,237,791,189,661,891,854,367,398,490,891,577,453,810,697,124,756,345,
828,950,561,902,614,727,247,462,339,434,248,981,829,916,772,790,684,231,918,401,408,679,
804,593,518,264,530,597,506,699,156,559,180,838,083,697,452,342,693,019,916,934,095,427,
926,284,239,186,778,760,537,631,648,689,778,685,028,875,780,934,811,195,903,383,297,325,
404,674,321,132,173,191,772,706,976,318,676,694,222,478,165,122,633,462,198,760,623,367,
676,347,049,039,942,766,779,185,599,670,242,800,590,746,443,028,479,628,344,624,129,788,
986,052,267,044,293,545,467,128,252,676,303,437,334,660,672,457,382,059,030,736,971,425,
711,493,163,346,019,711,361,032,544,725,739,717,915,551,166,357,417,654,655,771,689,510,
332,660,513,586,187,033,518,591,298,795,741,174,413,719,973,333,787,649,349,315,999,314,
507,241,608,217,798,993,354,659,849,179,712,089,626,837,565,629,250,942,040,077,001,508,
191,915,175,670,077,378,115,091,776,926,928,482,860,747,377,191,043,304,997,916,940,735,
024,551,373,227,109,328,366,639,344,090,890,543,448,467,140,380,695,265,233,745,430,615,
945,648,409,727,981,469,031,706,286,879,836,149,901,114,757,957,593,600,527,674,727,143,
180,796,022,929,116,545,586,123,339,554,137,359,960,085,564,193,538,396,534,200,757,633,
184,200,669,679,266,085,615,357,765,708,228,346,907,859,684,495,445,211,346,974,768,841,
479,305,103,785,350,888,613,151,991,051,292,954,691,705,073,714,299,347,256,083,000,951,
211,085,722,425,097,834,216,271,773,699,005,376,686,144,260,411,771,234,840,626,714,359,
270,153,124,657,096,088,661,332,547,307,732,121,596,668,803,070,074,096,449,199,006,045,
247,706,986,924,718,606,717,585,336,204,217,255,444,742,327,718,898,432,886,194,233,760,
704,248,970,931,681,940,763,022,191,757,491,596,631,731,120,222,921,779,768,634,166,891,
835,935,791,116,498,217,263,382,315,692,521,267,910,732,306,133,095,900,014,405,539,127,
600,704,170,287,276,712,651,906,937,442,866,872,818,288,101,368,989,453,287,779,969,318,
819,898,191,343,359,904,988,879,067,810,763,926,073,380,355,658,042,061,087,634,040,016,
432,163,443,052,335,929,991,158,397,475,416,986,689,641,305,337,653,559,321,620,263,933,
413,094,202,049,855,727,193,018,278,271,867,141,665,326,736,449,502,598,153,467,584,350,
105,687,056,319,894,338,714,617,001,215,119,927,747,047,282,846,350,464,508,644,061,072,
057,356,052,375,610,897,167,524,000,976,693,378,953,815,577,177,018,947,825,769,466,688,
174,297,878,405,414,929,748,731,885,698,125,477,079,134,891,381,149,885,821,579,355,476,
251,275,796,728,042,602,310,223,150,781,666,995,693,268,999,220,016,697,340,239,326,593,
524,625,965,198,899,500,437,113,743,375,571,212,933,192,821,896,601,719,186,689,705,541,
886,040,385,798,403,663,116,845,223,343,872,067,947,547,542,713,178,802,957,834,364,493,
489,291,999,989,116,740,589,281,208,790,782,373,145,513,713,699,734,975,346,637,798,772,
073,453,971,876,231,105,322,349,886,484,465,571,212,532,870,605,116,947,646,621,165,966,
290,219,238,774,388,367,531,756,483,014,845,991,343,688,549,851,631,099,727,693,531,221,
675,011,436,766,560,256,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000.


Number of possible supermatrix combinations (scientific notation) = 9.16E+3219, or 9.16*10^3219.

That's an insane number.

Back to top


Last updated: September, 2019

For SuperCRUNCH v1.2

Clone this wiki locally