-
Notifications
You must be signed in to change notification settings - Fork 23
SeqBuddy
SeqBuddy is a command line program and Python3 API for quickly and easily reading, writing, analyzing, and manipulating sequence files in common formats including FASTA, GenBank, and NEXUS. There is an emphasis on simplicity and interoperability, as formats are automatically detected and input can be file paths or handles, pipes, or even plain text typed right into your terminal window. The SeqBuddy tools can be broadly grouped into two classes; tools that manipulate your data and return a new sequence file and tools that perform some analysis and return a non-sequence result. Each of the 50+ tools currently implemented in the command line UI have been documented in these wiki-pages, including use cases to demonstrate the tools in action. The flags chosen are hopefully rational, and care has been taken to minimize the number of positional arguments to make the learning curve as shallow as possible.
$: sb file(s) <tool> <args> <modifier(s)>
sb: Alias for SeqBuddy used throughout the wiki (see Creating Aliases)
file(s): One or more sequence files in any combination of supported formats. Note that the files argument must be left blank if piping data into SeqBuddy.
tool: A single flag that specifies which SeqBuddy tool is being run. All tools are listed in the table below.
args: If the tool being called accepts any additional arguments, they must be supplied here. All arguments are explained in detail in each tool's wiki page.
modifier(s): These are additional flags that may be passed into SeqBuddy to modify general behavior, irrespective of the tool being called. All modifiers are listed in the table below.
Tool | Flag | Parameters | Brief Description |
---|---|---|---|
amend_metadata | -amd | <attribute> [substitution] [regex_pattern] | Add/Delete/Modify record metadata |
annotate | -ano | <name> <location> [strand] [qualifiers] [regex_pattern] | Add a feature (annotation) to selected sequences |
ave_seq_length | -asl | ['clean'] | Find the average length of all sequences in an input file |
back_translate | -btr | None | Convert amino acid sequences into codons. Select mode/species with -p flag [{'random', 'optimized'}] [{'human', 'mouse', 'yeast', 'ecoli'}] |
bl2seq | -bl2s | None | All-by-all blast among sequences using bl2seq. Only Returns top hit from each search |
blast | -bl | <BLAST database> | BLAST your sequence file using common blast settings, return the hits from blastdb |
clean_seq | -cs | ['strict'] [replacement character] | Strip out non-sequence characters, such as stops (*) and gaps (-) |
complement | -cmp | None | Return complement of nucleotide sequence |
concat_seqs | -cts | ['clean'] | Concatenate a bunch of sequences into a single solid string |
count_codons | -cc | ['concatenate'] | Return codon frequency statistics |
count_residues | -cr | None | Generate a table of sequence compositions |
degenerate_sequence | -dgn | <table (int)> | Convert unambiguous codons to ambiguous degenerate codons |
delete_features | -df | <regex> [regex ...] | Remove specified features from all records |
delete_large | -dlg | <threshold (int)> | Delete sequences with length above threshold |
delete_metadata | -dm | None | Remove meta-data from file (only IDs are retained) |
delete_records | -dr | <regex> [regex ...] ['full'] [path] [cols (int)] | Remove records from a file (deleted IDs are sent to stderr) |
delete_recs_with_feature | -drf | <regex> [regex ...] | Remove all the records with feature names/IDs containing a given string |
delete_repeats | -drp | [scope {'all', 'ids', 'seqs'}] [columns (int)] | Strip out repeat records (ids and/or identical sequences) |
delete_small | -dsm | <threshold (int)> | Delete sequences with length below threshold |
delete_taxa | -dt | <taxon> [taxon ...] | Remove records from a given set of taxa |
extract_feature_sequences | -efs | <regex> [regex ...] | Pull out specific features from annotated sequences |
extract_regions | -er | <positions (str)> [positions] ... | Pull out sub-sequences |
find_CpG | -fcpg | None | Predict regions under strong purifying selection based on high CpG content |
find_orfs | -orf | [Min size (int)] [RevComp (false)] | Finds all the open reading frames in the sequences and their reverse complements |
find_pattern | -fp | <regex> [regex ...] ['ambig'] | Search for sub-sequences, returning match start positions |
find_repeats | -frp | [columns (int)] | Identify whether a file contains repeat sequences and/or sequence ids |
find_restriction_sites | -frs | [enzymes {'commercial', 'all', <specific>} ...] [min cuts (int)] [max cuts (int)] [order {'position', 'alpha'}] | Returns a dictionary of all of the restriction sites and their indices for each sequence in the file |
group_by_prefix | -gbp | [Split Pattern [Split pattern ...]] [length (int)] [out dir] | Sort sequences into separate files based on prefix |
group_by_regex | -gbr | <regex> [regex ...] [Out dir (path)] | Group sequences by ID into new files based on some search criteria |
guess_alphabet | -ga | None | Return the alphabet type found in the input file |
guess_format | -gf | None | Guess the flatfile format of the input file |
head | -hd | [number (int)] | Extract record(s) from the top |
hash_ids | -hi | [hash length (int)] | Rename all identifiers to random hashes |
insert_seq | -is | <sequence> <location {front, rear, index (int)}> | Insert a sequence at the desired location |
in_silico_digest | -isd | [enzymes {<name>} ...] | Cut DNA with specific restriction enzymes |
isoelectric_point | -ip | None | Calculate isoelectric points |
keep_taxa | -kt | <taxon> [taxon ...] | Pull records from a given set of taxa |
list_features | -lf | None | Print a pretty list of sequence annotations |
list_ids | -li | [columns (int)] | Output list of sequence identifiers in one (default) or more columns |
lowercase | -lc | None | Convert sequences to lowercase |
make_ids_unique | -miu | [separator (string)] [padding (int)] | Add a number at the end of replicate ids to make them unique |
map_features_nucl2prot | -fn2p | None | Transfer annotations from cDNA/mRNA sequences onto protein sequences |
map_features_prot2nucl | -fp2n | None | Transfer annotations from protein sequences onto cDNA/mRNA sequences |
max_recs | -max | [number (int)] | Find the longest record(s) |
merge | -mrg | None | Combine records with the same ID |
min_recs | -min | [number (int)] | Find the shortest record(s) |
molecular_weight | -mw | None | Computes the molecular weight of each sequence |
num_seqs | -ns | None | Counts how many sequences are present |
order_features_alphabetically | -ofa | ['rev'] | Change the output order of sequence features, based on sequence position |
order_features_by_position | -ofp | ['rev'] | Change the output order of sequence features, based on sequence position |
order_ids | -oi | ['rev'] | Sort all sequences by id in alpha-numeric order (reverse with 'rev') |
order_ids_randomly | -oir | None | Randomly reorder the position of records in the file |
order_recs_by_len | -obl | ['rev'] | Sort records by sequence length |
prepend_organism | -ppo | [length (int)] | Prefix all IDs with unique organism identifier |
prosite_scan | -psc | ['strict'] | Annotate a DNA, RNA or protein sequence using ExPASy PROSITE website |
pull_random_record | -prr | [number (int)] | Extract random sequence(s) |
pull_records | -pr | <regex> [regex ...] ['full'] [path] | Get all the records with ids containing a given string |
pull_record_ends | -pre | <amount (int)> | Get the ends of all sequences |
pull_records_with_feature | -prf | <regex> [regex ...] | Get all the records with feature names/IDs containing a given string |
purge | -prg | <Max BLAST bit-score (int)> | Delete sequences with high similarity |
rename_ids | -ri | <regex> <subs (str)> [num] ['store'] | Replace a pattern in IDs with a new string |
replace_subseq | -rs | <regex> [regex ...] [replacement] | Replace a sequence pattern with something new |
reverse_complement | -rc | None | Return reverse complement of nucleotide sequences |
reverse_transcribe | -r2d | None | Convert RNA sequences to DNA |
screw_formats | -sf | <new format> | Change the file format to something else |
select_frame | -sfr | <frame {1, 2, 3}> | Change the reading frame of sequences by deleting characters from the front |
shuffle_seqs | -ss | None | Randomly reorder primary sequence |
tail | -tl | [number (int)] | Extract record(s) from the bottom |
taxonomic_breakdown | -tb | [depth (int)] | Show taxonomic spread of sequences |
translate | -tr | None | Convert coding sequences into amino acid sequences |
translate6frames | -tr6 | None | Translate nucleotide sequences into all six reading frames |
transcribe | -d2r | None | Convert DNA sequences to RNA |
transmembrane_domains | -tmd | [Job ID] | Identify and annotate transmembrane domains using the TOPCONS web service |
uppercase | -uc | None | Convert sequences to uppercase |
Flag | Brief Description |
---|---|
-a --alpha | Force read as dna, rna, or protein |
-f --format | Force read a specific BioPython format. This may allow you to use some of the tools on some formats not auto-read by SeqBuddy (no promises) |
-i --in_place | Rewrites the FIRST input file with the final output. Be careful! |
-k --keep_temp | Specify a directory to store any temporary files produced during execution |
-o --out_format | Specify the supported format you want the output returned in |
-q --quiet | Suppress stderr messages (not fully implemented yet) |
-r --restrict | Specify which records are modified (regular expressions are understood) |
-s --random_seed | Specify a random seed |
-t --test | Run the function and return any stderr/stdout other than the alignment. |