Skip to content

2: Starting Sequences

Daniel Portik edited this page Jun 1, 2021 · 18 revisions

Starting Sequences


Overview

To run SuperCRUNCH, you will need to provide a fasta file of nucleotide sequence records. This fasta file can contain records downloaded directly from GenBank, local (unpublished) sequence records, or a combination of GenBank and local sequences.

The naming conventions for records is described for both GenBank and local sequence data in the Obtaining GenBank Sequence Data and Using Local Sequences sections.

Sometimes, there are problems with downloaded data. For example, duplicate accessions will cause errors in the pipeline. In other cases, large shotgun genome sequences have resulted in a massive file size. These can be filtered out with a cleaning step, as they are not used in the analysis anyways. Both pre-cleaning steps are described in the Resolving Duplicate Records and Decreasing GenBank File Size sections.


Obtaining GenBank Sequence Data

Starting sequences for SuperCRUNCH can be easily downloaded from the NCBI nucleotide database (GenBank). The simplest way to obtain published sequence data is to search for a taxonomic term on the nucleotide database and download all the available records in fasta format. However, for larger taxonomic groups (ex. all birds or frogs) this may not be possible as there will be too much data to download directly. In this case, it is better to split searches using the taxonomy of the group (order, clade, family, etc), download each record set in fasta format, and then combine the resulting fasta files. The NCBI taxonomy browser is particularly useful for identifying the relevant taxonomic search terms. Advanced users may choose to use additional search terms to filter the results, which could help reduce the resulting file size, but this is not necessary.

The downloaded fasta files may be quite large in size (several GB) and it is unlikely you can use a text editor to open and edit them. Combining large files requires a different approach. The fasta files downloaded from GenBank will be called sequence.fasta by default. You can rename the files as they are downloaded, and if all have the extension .fasta intact, they can be quickly combined using Unix. If the fasta files are all in the same working directory, you can use the following command to accomplish this:

cat *.fasta > Combined.fasta

Fasta file structure

The starting fasta file can be in the interleaved format or sequential format, as demonstrated below:

Interleaved format:

>KP820543.1 Callisaurus draconoides voucher MVZ265543 aryl hydrocarbon receptor (anr) gene, partial cds
TAAAATTTCCTTTGAAAGGAACCTTTTTGTGGACACCAGGGATGAATTGGGTAATGTAATGGCGAGCGAT
TGGCAGGAAAATATTTTGCCAGTGAGGAATAACAGCATTCTCAAACAAGAGCAGACAGAGTGCCCCCAGG
AAAATAATTTAATGCTTCCTGAAGACAGCATGGGCATTTTTCAGGATAACAAAAATAATGAACTGTACAA

>KP820544.1 Urosaurus ornatus voucher UWBM7587 aryl hydrocarbon receptor (anr) gene, partial cds
TAAAATCTCCTTTGAAAGGAACCTTTTTGTGGACACCAGGGATGAATTAGGTAATGTAATGGCCAGCGAT
TGGCAGGAAAATATTTTGCCAGTGAGGAATAACAGCATCCTCAAACAAGAACAGACTGAGTGCCCCCAGG
TAAAATTTCCTTTGAAAGGAACCTTTTTGTGGACACCAGGGATGAATTGGGTAATGTAATGGCGAGCGAT
AAACTAATTTAATGCTTCCTGAAGACAGTATGGGTATTTTTCAGGATAACAAAAATAATGAACTGTACAA

Sequential format:

>KP820543.1 Callisaurus draconoides voucher MVZ265543 aryl hydrocarbon receptor (anr) gene, partial cds
TAAAATTTCCTTTGAAAGGAACCTTTTTGTGGACACCAGGGATGAATTGGGTAATGTAATGGCGAGCGATTGGCAGGAAAATATTTTGCCAGTGAGGAATAACAGCATTCTCAAACAAGAGCAGACAGAGTGCCCCCAGGAAAATAATTTAATGCTTCCTGAAGACAGCATGGGCATTTTTCAGGATAACAAAAATAATGAACTGTACAA

>KP820544.1 Urosaurus ornatus voucher UWBM7587 aryl hydrocarbon receptor (anr) gene, partial cds
TAAAATCTCCTTTGAAAGGAACCTTTTTGTGGACACCAGGGATGAATTAGGTAATGTAATGGCCAGCGATTGGCAGGAAAATATTTTGCCAGTGAGGAATAACAGCATCCTCAAACAAGAACAGACTGAGTGCCCCCAGGAAACTAATTTAATGCTTCCTGAAGACAGTATGGGTATTTTTCAGGATAACAAAAATAATGAACTGTACAA

Sequence record labels

The sequence record labels must have a > followed by a unique accession number, followed by a sequence description. If curated properly, the description line will contain a taxon name, gene/locus information, a voucher/isolate/strain identity, and possibly several other identifiers, all separated by whitespace. The general format would appear like this:

>[Accession] [Taxon label (two-part or three-part name)] [Remaining Description (gene abbreviation, gene description, voucher/isolate/strain identity, other identifiers)]

This general format can be seen in the above record labels:

>[Accession][----Taxon label------] [-voucher code--] [gene description/abbreviation]
>KP820543.1 Callisaurus draconoides voucher MVZ265543 aryl hydrocarbon receptor (anr)

>[Accession][--Taxon label--] [-voucher code-] [gene description/abbreviation]
>KP820544.1 Urosaurus ornatus voucher UWBM7587 aryl hydrocarbon receptor (anr) gene, partial cds

Of course there will be variations in what is present in the record label, depending on the content of the sequence and how much information was provided when it was uploaded. To work with all SuperCRUNCH steps the minimum following must be present in a record label:

  • accession
  • taxon label
  • gene description/abbreviation

There should not be any vertical bars (|) separating information contained in the sequence label. SuperCRUNCH cannot properly parse this format. For example, the following vertical bar fasta format is NOT supported by SuperCRUNCH:

>gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTG
AATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGG

>gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAGAACATACGATCGAGTG
AATCCGGAGGACCCGTGGTTACACGGCTCACCGTGGCTTTGCTCTCGTGGTGAACCCGGTTTGCGACCGG

This format cannot be used with SuperCRUNCH, and because the elements of the description line are out of order it cannot be easily reformatted to run in SuperCRUNCH.

Back to top


Using Local Sequences

SuperCRUNCH was designed to work with GenBank and local (unpublished) sequence records. This allows you to incorporate your own sequences with GenBank data, or to complete an analysis using just local sequences. To take advantage of all the available features, the format of sequence records should be similar to NCBI sequence records. That is, the general layout for a custom file would involve the following record format:

>[Accession] [Taxon label (two-part or three-part name)] [Remaining Description (gene abbreviation, gene description, voucher/isolate/strain identity, other identifiers)]

This is similar to a typical NCBI sequence record:
>KP820543.1 Callisaurus draconoides voucher MVZ265543 aryl hydrocarbon receptor (anr) gene, partial cds
CACAATGAGAAAGCCTTGATAAACCGTGATCGGACTTTGCCACTCGTTGAAGAAATAGATCAGAGGTATT
  • Each sequence requires a unique 'accession' number, which can be any type of identifier. Each accession can appear only once in a set of sequences. Duplicate accessions will cause errors in SuperCRUNCH (see Resolving Duplicate Records section). The accessions can contain a mix of letters, numbers, underscores, and periods. However, they cannot include any spaces.

  • Following this should be the taxon label, which should contain a two-part name (Genus species) or a three-part name (Genus species subspecies). Note that the three-part name does not require an actual subspecies name (e.g., Hyperolius balfouri viridistriatus). It could also consist of a numerical code in the third position (e.g., Hyperolius balfouri CAS256943). Both types are considered valid three-part names, and using the subspecies options in various steps would allow you to search for either type.

  • The description line should contain information about the gene from which the sequence was derived. This is necessary for parsing loci. For the best results, a gene abbreviation (RAG1) and description (recombination activating protein 1) should be included.

  • To generate a vouchered dataset, there must be a voucher, strain, or isolate identifier followed by a code. For example, voucher MVZ249370, strain MVZ249370, and isolate MVZ249370 in the description line would all produce a voucher code of Voucher_MVZ249370. See Parsing Loci for more details on this topic.

Given these requirements, a local sequence record could look like this:

>MVZ249370 Hyperolius nitidulus RAG1
CACAATGAGAAAGCCTTGATAAACCGTGATCGGACTTTGCCACTCGTTGAAGAAATAGATCAGAGGTATT

Note that all of the above components should be separated by spaces.

The accession component can be any unique identifier using a combination of letters and numbers. In the above example, I used a museum voucher number. This would work if there is only one sequence for this sample. However, you are probably using SuperCRUNCH because you have multiple sequences for each sample (from many loci). If this is the case, you can easily create accession numbers by adding an extension. Let's assume this sample (MVZ249370) actually has sequence data for 4 loci. The following is an example of how to label the accessions to ensure each is unique:

>MVZ249370.RAG1 Hyperolius nitidulus recombination activating protein 1 (RAG-1) gene
TTGATAGCTGAAAGAGAAGCCATGAAGTCCAGTGAGCTCATGCTTGAAATCGGCGGAATTCTCAGAAG...

>MVZ249370.TYR Hyperolius nitidulus tyrosinase (TYR) gene
GGGGAAGCCTCAGCTAGAGGAACGTGCCAAGATGTAGTCCTCTCCACTGCTCCTGTAGGTGCTCAATT...

>MVZ249370.POMC Hyperolius nitidulus proopiomelanocortin (POMC) gene
AAAGCATGCAAGATGGACTTATCTGCAGAATCACCTGTATTTCCTGGCAACGGGCACATGCAGCCCCT...

>MVZ249370.FICD Hyperolius nitidulus FIC domain-containing protein (FICD) gene
CACAATGAGAAAGCCTTGATAAACCGTGATCGGACTTTGCCACTCGTTGAAGAAATAGATCAGAGGTA...

This labeling scheme (Sample.Locus) will make it clear that all sequences are derived from MVZ249370, and also ensure the accessions to be unique. The accessions can contain a mix of letters, numbers, underscores, and periods.

Although the voucher information is part of the accession number, it won't be recognized in these sequences as they are written. To use the voucher features across various steps, an additional component needs to be added. The keywords voucher, isolate, and strain can be used, followed by the voucher code. Examples of how the voucher codes are embedded into sequences are provided at the bottom of the description for the Parse_Loci.py module. We can ammend the records above as follows:

>MVZ249370.RAG1 Hyperolius nitidulus recombination activating protein 1 (RAG-1) gene voucher MVZ249370
TTGATAGCTGAAAGAGAAGCCATGAAGTCCAGTGAGCTCATGCTTGAAATCGGCGGAATTCTCAGAAG...

>MVZ249370.TYR Hyperolius nitidulus tyrosinase (TYR) gene voucher MVZ249370
GGGGAAGCCTCAGCTAGAGGAACGTGCCAAGATGTAGTCCTCTCCACTGCTCCTGTAGGTGCTCAATT...

>MVZ249370.POMC Hyperolius nitidulus proopiomelanocortin (POMC) gene voucher MVZ249370
AAAGCATGCAAGATGGACTTATCTGCAGAATCACCTGTATTTCCTGGCAACGGGCACATGCAGCCCCT...

>MVZ249370.FICD Hyperolius nitidulus FIC domain-containing protein (FICD) gene voucher MVZ249370
CACAATGAGAAAGCCTTGATAAACCGTGATCGGACTTTGCCACTCGTTGAAGAAATAGATCAGAGGTA...

This will allow the voucher codes to be detected in these records. We could also have used isolate or strain instead of voucher, with the same effect.

Combinations of local sequences and GenBank sequences

If local sequences follow the general format described above, they can be combined with GenBank sequences and easily analyzed together.

For one phylogenomics project, I extracted UCE data from multiple published genomes. This created a local sequence set. These sequences did not contain typical GenBank record descriptions. In addition, I downloaded UCE sequence data directly from GenBank. Given that there were close to 5,000 loci to parse, filter, and align, I wanted to analyze these data together.

Using a relabeling script for this project (available here), I relabeled the genome-extracted UCE sequences. They were labeled like this:

>Vipera_berus.genome.uce-3517 Vipera berus ultra conserved element uce-3517
TAAGAATGTCGATACCGCTGAACAATCTAAGGGGCTTGGGAAGTGGGGAGCATAAAGTTTAATCTACTTATGCTGGAGTACTTACCT...

>Vipera_berus.genome.uce-322 Vipera berus ultra conserved element uce-322
ACAAATATGTGATAAGTAATAAATAAATATCTATTAGTAATCAAAGATCAATAATCAGCCACTGAGCTTACACATTATTCACCATTG...

>Vipera_berus.genome.uce-3307 Vipera berus ultra conserved element uce-3307
ACTCACCACTGGCCAATGAGAGTTGAAGTTCATCAACATCTAGTACTTCATTAATTTCCCAACTAAAAGGTATTATGTTTTATGGAA...

The main features of these records are a unique accession identifier, the taxon label, and the appropriate UCE abbreviation/description. In this case, the accession number was a combination of the taxon label, the phrase 'genome', and uce locus, separated by periods (e.g., Taxon.genome.uce-number). This ensured all the accession numbers were unique.

After relabeling the local sequence records in this fashion, I combined them with the GenBank records I had downloaded. The GenBank records looked like this:

>KR354177.2 Phrynosoma hernandesi voucher MVZ245875 ultra conserved element uce-1980 genomic sequence
TATCTTCACTGCTACAATTAAAGTAAATTGCAGTTATCTTGTGAGTATCATGTTAGTCTGGTGTTAGAAC
AAACATTTCAATCCTGTTAGGTATAGGACTTCACATTTATTTCAAGTTGCATCACCACATAAATCTG...

>KR354176.2 Phrynosoma modestum voucher MVZ238583 ultra conserved element uce-1980 genomic sequence
TATCCTCACTGCTACAATTAAAGTAAATTGGAGTTATCTTGTGAGTATCATATTAGTTTGGTGTTAGAAC
AAACATTTCAATCCTGTTAGGTATAGGACTTCACATTTATTTCAAGTTGTATCACCACATAAATCTG...

>KR354175.2 Phrynosoma cornutum voucher MVZ238582 ultra conserved element uce-1980 genomic sequence
GTTATCTTGTGAGTATCATGTTAGTCTGGTGTTAGAACAAACATTTCAATCCTGTTAGGTATAGGACTTC
ACATTTATTTCAAGTTGCATCACCACATAAATCTGAGCATGCTAAAATGTATTTCTTTTCCTTATAG...

There are clear differences between the labels from the two datasets. However, both include all the key features of the record description line (accession, taxon label, gene abbreviation/description). The resulting combined fasta file ran perfectly in SuperCRUNCH. Also, the custom records were in sequential format, and the GenBank records were in interleaved format - this did not affect the ability of SuperCRUNCH to correctly read the fasta file.

Back to top


Resolving Duplicate Records

Sometimes combining multiple fasta files can result in the presence of duplicate entries, in other words records that have identical accession numbers. In some steps, SuperCRUNCH relies on BioPython to load sequence records (via a dictionary data structure), and because of this the presence of duplicate accessions will cause the module to crash. You will know this is the case if you try to run a module and it produces the following error:

ValueError: Duplicate key 'AF443291.2'

In this instance, there are two records each containing the accession number AF443291.2 somewhere in this fasta file. In order to use this particular fasta file with SuperCRUNCH, you will need to remove all instances of duplicate accession numbers - keeping in mind there may be more than just one duplicate. For large files, this is not easily performed without an automated method. The module below, Remove_Duplicate_Accessions.py, can be used to easily accomplish this.

Back to top


Remove_Duplicate_Accessions.py

The purpose of this module is to search through a fasta file and remove any duplicate records, based on identical accession numbers. This prevents errors when SuperCRUNCH loads fasta files. The input fasta file is cleaned and the resulting fasta file is written to the output directory specified.

Basic Usage:

python Remove_Duplicate_Accessions.py -i <fasta file> -o <output directory>

Argument Explanations:

-i <full-path-to-file> or --input <full-path-to-file>

Required: The full path to a fasta file of sequence data to filter.

-o <path-to-directory> or --outdir <path-to-directory>

Required: The full path to an existing directory to write the output fasta file.

Example Usage:

python Remove_Duplicate_Accessions.py -i /bin/starting_material/Iguania_GenBank.fasta -o /bin/starting_material/cleaned/

The above command will remove duplicate sequence records in Iguania_GenBank.fasta and write the output file to the directory /bin/starting_material/cleaned/.

Outputs

A filtered fasta file (free of any duplicate accessions) is written to the output directory specified using the -o flag. The output file is named based on the input file. If the input file is called Hyperoliid-seqs.fasta, the output will be called Hyperoliid-seqs-Cleaned.fasta. That is, the NAME component of NAME.fasta is used to create an output file called NAME-Cleaned.fasta.

Back to top


Decreasing GenBank File Size

Sometimes very large sequence records are downloaded as part of a GenBank search and retrieval. Most represent shotgun genome sequences, which are not useful for SuperCRUNCH. These can be easily removed using the Remove_Long_Accessions.py module.

Back to top


Remove_Long_Accessions.py

The purpose of this module is to search through a fasta file and remove any records that exceed the maximum sequence length specified by the -s/--seqlength argument.

Basic Usage:

python Remove_Long_Accessions.py -i <fasta file> -o <output directory> -s <max sequence length>

Argument Explanations:

-i <full-path-to-file> or --input <full-path-to-file>

Required: The full path to a fasta file of sequence data to filter.

-o <path-to-directory> or --outdir <path-to-directory>

Required: The full path to an existing directory to write the output fasta file.

-s <integer> or --seqlength <integer>

Required: The maximum sequence length to retain.

Example Usage:

python Remove_Long_Accessions.py -i /bin/starting_material/Iguania_GenBank.fasta -o /bin/starting_material/lengthfiltered/ -s 50000

The above command will remove all sequences longer than 50,000 bp in Iguania_GenBank.fasta and write the output file (called Iguania_GenBank-LengthFiltered.fasta) to the existing directory /bin/starting_material/lengthfiltered/.

Outputs

A filtered fasta file (free of sequences longer than -s value) is written to the output directory specified using the -o flag. The output file is named based on the input file. If the input file is called Hyperoliid-seqs.fasta, the output will be called Hyperoliid-seqs-LengthFiltered.fasta. That is, the NAME component of NAME.fasta is used to create an output file called NAME-LengthFiltered.fasta.

Back to top


Last updated: January, 2021