Skip to content
Steve Bond edited this page Mar 14, 2017 · 160 revisions

___

A friend to take care of your sequence files

SeqBuddy is a command line program and Python3 API for quickly and easily reading, writing, analyzing, and manipulating sequence files in common formats including FASTA, GenBank, and NEXUS. There is an emphasis on simplicity and interoperability, as formats are automatically detected and input can be file paths or handles, pipes, or even plain text typed right into your terminal window. The SeqBuddy tools can be broadly grouped into two classes; tools that manipulate your data and return a new sequence file and tools that perform some analysis and return a non-sequence result. Each of the 50+ tools currently implemented in the command line UI have been documented in these wiki-pages, including use cases to demonstrate the tools in action. The flags chosen are hopefully rational, and care has been taken to minimize the number of positional arguments to make the learning curve as shallow as possible.

Generalized usage

$: sb file(s) <tool> <args> <modifier(s)>

sb: Alias for SeqBuddy used throughout the wiki (see Creating Aliases)

file(s): One or more sequence files in any combination of supported formats. Note that the files argument must be left blank if piping data into SeqBuddy.

tool: A single flag that specifies which SeqBuddy tool is being run. All tools are listed in the table below.

args: If the tool being called accepts any additional arguments, they must be supplied here. All arguments are explained in detail in each tool's wiki page.

modifier(s): These are additional flags that may be passed into SeqBuddy to modify general behavior, irrespective of the tool being called. All modifiers are listed in the table below.

Tools

Tool Flag Parameters Brief Description
annotate -ano <name> <location> [strand] [qualifiers] [regex_pattern] Add a feature (annotation) to selected sequences.
ave_seq_length -asl ['clean'] Find the average length of all sequences in an input file
back_translate -btr None Convert amino acid sequences into codons. Select mode/species with -p flag [{'random', 'optimized'}] [{'human', 'mouse', 'yeast', 'ecoli'}]
bl2seq -bl2s None All-by-all blast among sequences using bl2seq. Only Returns top hit from each search
blast -bl <BLAST database> BLAST your sequence file using common blast settings, return the hits from blastdb
clean_seq -cs ['strict'] [replacement character] Strip out non-sequence characters, such as stops (*) and gaps (-)
complement -cmp None Return complement of nucleotide sequence
concat_seqs -cts ['clean'] Concatenate a bunch of sequences into a single solid string
count_codons -cc ['concatenate'] Return codon frequency statistics.
count_residues -cr None Generate a table of sequence compositions.
degenerate_sequence -dgn <table (int)> Convert unambiguous codons to ambiguous degenerate codons
delete_features -df <regex> [regex ...] Remove specified features from all records
delete_large -dlg <threshold (int)> Delete sequences with length above threshold
delete_metadata -dm None Remove meta-data from file (only IDs are retained)
delete_records -dr <regex> [regex ...] [path] [cols (int)] Remove records from a file (deleted IDs are sent to stderr)
delete_repeats -drp [scope {'all', 'ids', 'seqs'}] [columns (int)] Strip out repeat records (ids and/or identical sequences)
delete_small -dsm <threshold (int)> Delete sequences with length below threshold
extract_feature_sequences -efs <regex> [regex ...] Pull out specific features from annotated sequences
extract_regions -er <positions (str)> [positions] ... Pull out sub-sequences
find_CpG -fcpg None Predict regions under strong purifying selection based on high CpG content
find_orfs -orf [Min size (int)] [RevComp (false)] Finds all the open reading frames in the sequences and their reverse complements.
find_pattern -fp <regex> [regex ...] ['ambig'] Search for sub-sequences, returning match start positions.
find_repeats -frp [columns (int)] Identify whether a file contains repeat sequences and/or sequence ids
find_restriction_sites -frs [enzymes {'commercial', 'all', <specific>} ...] [min cuts (int)] [max cuts (int)] [order {'position', 'alpha'}] Returns a dictionary of all of the restriction sites and their indices for each sequence in the file
group_by_prefix -gbp [Split Pattern [Split pattern ...]] [length (int)] [out dir] Sort sequences into separate files based on prefix
group_by_regex -gbr <regex> [regex ...] [Out dir (path)] Group sequences by ID into new files based on some search criteria
guess_alphabet -ga None Return the alphabet type found in the input file
guess_format -gf None Guess the flatfile format of the input file
hash_seq_ids -hsi [hash length (int)] Rename all identifiers to random hashes
insert_seq -is <sequence> <location {front, rear, index (int)}> Insert a sequence at the desired location
in_silico_digest -isd [enzymes {<name>} ...] Cut DNA with specific restriction enzymes
isoelectric_point -ip None Calculate isoelectric points
list_features -lf None Print a pretty list of sequence annotations
list_ids -li [columns (int)] Output list of sequence identifiers in one (default) or more columns
lowercase -lc None Convert sequences to lowercase
make_ids_unique -miu [separator (string)] [padding (int)] Add a number at the end of replicate ids to make them unique
map_features_nucl2prot -fn2p None Transfer annotations from cDNA/mRNA sequences onto protein sequences
map_features_prot2nucl -fp2n None Transfer annotations from protein sequences onto cDNA/mRNA sequences
max_recs -max None Find the longest record(s)
merge -mrg None Combine records with the same ID
min_recs -min None Find the shortest record(s)
molecular_weight -mw None Computes the molecular weight of each sequence
num_seqs -ns None Counts how many sequences are present
order_features_alphabetically -ofa ['rev'] Change the output order of sequence features, based on sequence position
order_features_by_position -ofp ['rev'] Change the output order of sequence features, based on sequence position
order_ids -oi ['rev'] Sort all sequences by id in alpha-numeric order (reverse with 'rev')
order_ids_randomly -oir None Randomly reorder the position of records in the file
prosite_scan -psc None Annotate a DNA, RNA or protein sequence using ExPASy PROSITE website
pull_random_record -prr [number (int)] Extract random sequence(s)
pull_records -pr <regex> [regex ...]['full'][path] Get all the records with ids containing a given string
pull_record_ends -pre <amount (int)> Get the ends of all sequences
pull_records_with_feature -prf <regex> [regex ...] Get all the records with feature names/IDs containing a given string
purge -prg <Max BLAST bit-score (int)> Delete sequences with high similarity
rename_ids -ri <regex> <subs (str)> [num] ['store'] Replace a pattern in IDs with a new string
replace_subseq -rs <regex> [regex ...] [replacement] Replace a sequence pattern with something new
reverse_complement -rc None Return reverse complement of nucleotide sequences
reverse_transcribe -r2d None Convert RNA sequences to DNA
screw_formats -sf <new format> Change the file format to something else
select_frame -sfr <frame {1, 2, 3}> Change the reading frame of sequences by deleting characters from the front
shuffle_seqs -ss None Randomly reorder primary sequence
translate -tr None Convert coding sequences into amino acid sequences
translate6frames -tr6 None Translate nucleotide sequences into all six reading frames
transcribe -d2r None Convert DNA sequences to RNA
transmembrane_domains -tmd [Job ID] Identify and annotate transmembrane domains using the TOPCONS web service
uppercase -uc None Convert sequences to uppercase

Modifying flags

Flag Brief Description
-f --format Force read a specific BioPython format. This may allow you to use some of the tools on some formats not auto-read by SeqBuddy (no promises)
-i --in_place Rewrites the FIRST input file with the final output. Be careful!
-k --keep_temp Specify a directory to store any temporary files produced during execution
-o --out_format Specify the supported format you want the output returned in
-q --quiet Suppress stderr messages (not fully implemented yet)
-t --test Run the function and return any stderr/stdout other than the alignment.

Main Toolkit Pages





Further Reading

Clone this wiki locally