SeqBuddy

A friend to take care of your sequence files

SeqBuddy is a command line program and Python3 API for quickly and easily reading, writing, analyzing, and manipulating sequence files in common formats including FASTA, GenBank, and NEXUS. There is an emphasis on simplicity and interoperability, as formats are automatically detected and input can be file paths or handles, pipes, or even plain text typed right into your terminal window. The SeqBuddy tools can be broadly grouped into two classes; tools that manipulate your data and return a new sequence file and tools that perform some analysis and return a non-sequence result. Each of the 50+ tools currently implemented in the command line UI have been documented in these wiki-pages, including use cases to demonstrate the tools in action. The flags chosen are hopefully rational, and care has been taken to minimize the number of positional arguments to make the learning curve as shallow as possible.

Generalized usage

$: sb file(s) <tool> <args> <modifier(s)>

sb: Alias for SeqBuddy used throughout the wiki (see Creating Aliases)

file(s): One or more sequence files in any combination of supported formats. Note that the files argument must be left blank if piping data into SeqBuddy.

tool: A single flag that specifies which SeqBuddy tool is being run. All tools are listed in the table below.

args: If the tool being called accepts any additional arguments, they must be supplied here. All arguments are explained in detail in each tool's wiki page.

modifier(s): These are additional flags that may be passed into SeqBuddy to modify general behavior, irrespective of the tool being called. All modifiers are listed in the table below.

Tools

Tool	Flag	Parameters	Brief Description
amend_metadata	-amd	<attribute> [substitution] [regex_pattern]	Add/Delete/Modify record metadata
annotate	-ano	<name> <location> [strand] [qualifiers] [regex_pattern]	Add a feature (annotation) to selected sequences
ave_seq_length	-asl	['clean']	Find the average length of all sequences in an input file
back_translate	-btr	None	Convert amino acid sequences into codons. Select mode/species with -p flag [{'random', 'optimized'}] [{'human', 'mouse', 'yeast', 'ecoli'}]
bl2seq	-bl2s	None	All-by-all blast among sequences using bl2seq. Only Returns top hit from each search
blast	-bl	<BLAST database>	BLAST your sequence file using common blast settings, return the hits from blastdb
clean_seq	-cs	['strict'] [replacement character]	Strip out non-sequence characters, such as stops (*) and gaps (-)
complement	-cmp	None	Return complement of nucleotide sequence
concat_seqs	-cts	['clean']	Concatenate a bunch of sequences into a single solid string
count_codons	-cc	['concatenate']	Return codon frequency statistics
count_residues	-cr	None	Generate a table of sequence compositions
degenerate_sequence	-dgn	<table (int)>	Convert unambiguous codons to ambiguous degenerate codons
delete_features	-df	<regex> [regex ...]	Remove specified features from all records
delete_large	-dlg	<threshold (int)>	Delete sequences with length above threshold
delete_metadata	-dm	None	Remove meta-data from file (only IDs are retained)
delete_records	-dr	<regex> [regex ...] ['full'] [path] [cols (int)]	Remove records from a file (deleted IDs are sent to stderr)
delete_recs_with_feature	-drf	<regex> [regex ...]	Remove all the records with feature names/IDs containing a given string
delete_repeats	-drp	[scope {'all', 'ids', 'seqs'}] [columns (int)]	Strip out repeat records (ids and/or identical sequences)
delete_small	-dsm	<threshold (int)>	Delete sequences with length below threshold
delete_taxa	-dt	<taxon> [taxon ...]	Remove records from a given set of taxa
extract_feature_sequences	-efs	<regex> [regex ...]	Pull out specific features from annotated sequences
extract_regions	-er	<positions (str)> [positions] ...	Pull out sub-sequences
find_CpG	-fcpg	None	Predict regions under strong purifying selection based on high CpG content
find_orfs	-orf	[Min size (int)] [RevComp (false)]	Finds all the open reading frames in the sequences and their reverse complements
find_pattern	-fp	<regex> [regex ...] ['ambig']	Search for sub-sequences, returning match start positions
find_repeats	-frp	[columns (int)]	Identify whether a file contains repeat sequences and/or sequence ids
find_restriction_sites	-frs	[enzymes {'commercial', 'all', <specific>} ...] [min cuts (int)] [max cuts (int)] [order {'position', 'alpha'}]	Returns a dictionary of all of the restriction sites and their indices for each sequence in the file
group_by_prefix	-gbp	[Split Pattern [Split pattern ...]] [length (int)] [out dir]	Sort sequences into separate files based on prefix
group_by_regex	-gbr	<regex> [regex ...] [Out dir (path)]	Group sequences by ID into new files based on some search criteria
guess_alphabet	-ga	None	Return the alphabet type found in the input file
guess_format	-gf	None	Guess the flatfile format of the input file
head	-hd	[number (int)]	Extract record(s) from the top
hash_ids	-hi	[hash length (int)]	Rename all identifiers to random hashes
insert_seq	-is	<sequence> <location {front, rear, index (int)}>	Insert a sequence at the desired location
in_silico_digest	-isd	[enzymes {<name>} ...]	Cut DNA with specific restriction enzymes
isoelectric_point	-ip	None	Calculate isoelectric points
keep_taxa	-kt	<taxon> [taxon ...]	Pull records from a given set of taxa
list_features	-lf	None	Print a pretty list of sequence annotations
list_ids	-li	[columns (int)]	Output list of sequence identifiers in one (default) or more columns
lowercase	-lc	None	Convert sequences to lowercase
make_ids_unique	-miu	[separator (string)] [padding (int)]	Add a number at the end of replicate ids to make them unique
map_features_nucl2prot	-fn2p	None	Transfer annotations from cDNA/mRNA sequences onto protein sequences
map_features_prot2nucl	-fp2n	None	Transfer annotations from protein sequences onto cDNA/mRNA sequences
max_recs	-max	[number (int)]	Find the longest record(s)
merge	-mrg	None	Combine records with the same ID
min_recs	-min	[number (int)]	Find the shortest record(s)
molecular_weight	-mw	None	Computes the molecular weight of each sequence
num_seqs	-ns	None	Counts how many sequences are present
order_features_alphabetically	-ofa	['rev']	Change the output order of sequence features, based on sequence position
order_features_by_position	-ofp	['rev']	Change the output order of sequence features, based on sequence position
order_ids	-oi	['rev']	Sort all sequences by id in alpha-numeric order (reverse with 'rev')
order_ids_randomly	-oir	None	Randomly reorder the position of records in the file
order_recs_by_len	-obl	['rev']	Sort records by sequence length
prepend_organism	-ppo	[length (int)]	Prefix all IDs with unique organism identifier
prosite_scan	-psc	['strict']	Annotate a DNA, RNA or protein sequence using ExPASy PROSITE website
pull_random_record	-prr	[number (int)]	Extract random sequence(s)
pull_records	-pr	<regex> [regex ...] ['full'] [path]	Get all the records with ids containing a given string
pull_record_ends	-pre	<amount (int)>	Get the ends of all sequences
pull_records_with_feature	-prf	<regex> [regex ...]	Get all the records with feature names/IDs containing a given string
purge	-prg	<Max BLAST bit-score (int)>	Delete sequences with high similarity
rename_ids	-ri	<regex> <subs (str)> [num] ['store']	Replace a pattern in IDs with a new string
replace_subseq	-rs	<regex> [regex ...] [replacement]	Replace a sequence pattern with something new
reverse_complement	-rc	None	Return reverse complement of nucleotide sequences
reverse_transcribe	-r2d	None	Convert RNA sequences to DNA
screw_formats	-sf	<new format>	Change the file format to something else
select_frame	-sfr	<frame {1, 2, 3}>	Change the reading frame of sequences by deleting characters from the front
shuffle_seqs	-ss	None	Randomly reorder primary sequence
tail	-tl	[number (int)]	Extract record(s) from the bottom
taxonomic_breakdown	-tb	[depth (int)]	Show taxonomic spread of sequences
translate	-tr	None	Convert coding sequences into amino acid sequences
translate6frames	-tr6	None	Translate nucleotide sequences into all six reading frames
transcribe	-d2r	None	Convert DNA sequences to RNA
transmembrane_domains	-tmd	[Job ID]	Identify and annotate transmembrane domains using the TOPCONS web service
uppercase	-uc	None	Convert sequences to uppercase

Modifying flags

Flag	Brief Description
-a --alpha	Force read as dna, rna, or protein
-f --format	Force read a specific BioPython format. This may allow you to use some of the tools on some formats not auto-read by SeqBuddy (no promises)
-i --in_place	Rewrites the FIRST input file with the final output. Be careful!
-k --keep_temp	Specify a directory to store any temporary files produced during execution
-o --out_format	Specify the supported format you want the output returned in
-q --quiet	Suppress stderr messages (not fully implemented yet)
-r --restrict	Specify which records are modified (regular expressions are understood)
-s --random_seed	Specify a random seed
-t --test	Run the function and return any stderr/stdout other than the alignment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SeqBuddy

A friend to take care of your sequence files

Generalized usage

Tools

Modifying flags

Main Toolkit Pages

Further Reading

Clone this wiki locally