This document is based on the original [BioJava in Anger](http://www.biojava.org/docs/bj_in_anger/) by Mark Schreiber et al. "BioJava in Anger, A Tutorial and Recipe Book for Those in a Hurry".

# Introduction

BioRuby can be both big and intimidating. For those of us who are in a hurry there really is a whole lot there to get your head around. This document is designed to help you develop BioRuby programs that do 99% of common tasks without needing to read and understand 99% of the BioRuby API.
The page was inspired by various programming cookbooks and follows a "How do I...?" type approach. Each How do I is linked to some example code that does what you want and sometimes more. Basically if you find the code you want and copy and paste it into your program you should be up and running quickly. I have endeavoured to over document the code to make it more obvious what I am doing so some of the code might look a bit bloated.
'BioRuby in Anger' is maintained by Toshiaki Katayama and Pjotr Prins. If you have any suggestions, questions or comments contact the BioRuby mailing list.

# Alphabets and Symbols

In BioRuby, Sequence class inherits String so you can treat Sequence object as a String with various powerful methods implemented in Ruby's String class. You can easily generate DNA and/or Amino Acid sequences to edit, extract subsequence, regexp pattern match on it with usual methods for String object. Sequence class also has methods for splicing, translation, calculate stastical values, window search etc.
There are nothing equivarent to BioJava's Alphabet and/or Symbols in BioRuby, however, BioRuby provides lists of nucleic acids, amino acids and codon tables and use it transparently in appropreate methods?as needed.

## How can I make an ambiguous Symbol like Y or R?

The IBU defines standard codes for symbols that are ambiguous such as Y to indicate C or T and R to indicate G or C or N to indicate any nucleotide. BioRuby represents these symbols as the same Bio::Sequence::NA object which can be easily converted to Regular expression that matches components of the ambiguous symbols. In turn, Bio::Sequence::NA object can contain symbols matching one or more component symbols that are valid members of the same alphabet as the Bio::Sequence::NA and are therefore capable of being ambiguous.
Generally an ambiguity symbol is converted to a Regexp object by calling the to_re method from the Bio::Sequence::NA that contains the symbol itself. You don't need to make symbol 'Y' by yourself because it is already built in the Bio::NucleicAcid class as a hash named Bio::NucleicAcid::Names.

In [1]:
require 'bio'
 
# creating a Bio::Sequence::NA object containing ambiguous alphabets
ambiguous_seq = Bio::Sequence::NA.new("atgcyrwskmbdhvn")
 
# show the contents and class of the DNA sequence object
p ambiguous_seq              # => "atgcyrwskmbdhvn"
p ambiguous_seq.class        # => Bio::Sequence::NA
 
# convert the sequence to a Regexp object
p ambiguous_seq.to_re        # => /atgc[tc][ag][at][gc][tg][ac][tgc][atg][atc][agc][atgc]/
p ambiguous_seq.to_re.class  # => Regexp
 
# example to match an ambiguous sequence to the rigid sequence
att_or_atc = Bio::Sequence::NA.new("aty").to_re
puts "match" if att_or_atc.match(Bio::Sequence::NA.new("att"))
puts "match, too" if Bio::Sequence::NA.new("att").match(att_or_atc)
if Bio::Sequence::NA.new("atc") =~ att_or_atc
  puts "also match"
end

"atgcyrwskmbdhvn"
Bio::Sequence::NA
/atgc[tcy][agr][atw][gcs][tgk][acm][tgcyskb][atgrwkd][atcwmyh][agcmrsv][atgcyrwskmbdhvn]/
Regexp
match
match, too
also match


# Basic Sequence Manipulation
## How do I make a Sequence from a String or make a Sequence Object back into a String?

A lot of the time we see sequence represented as a String of characters eg "atgccgtggcatcgaggcatatagc". It's a convenient method for viewing and succinctly representing a more complex biological polymer. BioRuby makes use of a Ruby's String class to represent these biological polymers as Objects. Unlike BioJava's SymbolList, BioRuby's Bio::Sequence inherits String and provide extra methods for the sequence manipulation. We don't have a container class like a BioJava's Sequence class, to store things like the name of the sequence and any features it might have, you can think of to use other container classes such as a Bio::FastaFormat, Bio::GFF, Bio::Features etc. for now (We have a plan to prepare a general container class for this to be compatible with a Sequence class in other Open Bio* projects).
Bio::Sequence class has same capability as a Ruby's String class, it is simple easy to use. You can represent a DNA sequence within the Bio::Sequence::NA class and a protein sequence within the Bio::Sequence::AA class. You can translate DNA sequence into protein sequence with a single method call and can concatenate them with the same method '+' as a String class's.

### String to Bio::Sequence object
Simply pass the sequence string to the constructor.

In [2]:
require 'bio'
 
# create a DNA sequence object from a String
dna = Bio::Sequence::NA.new("atcggtcggctta")
 
# create a RNA sequence object from a String
rna = Bio::Sequence::NA.new("auugccuacauaggc")
 
# create a Protein sequence from a String
aa = Bio::Sequence::AA.new("AGFAVENDSA")
 
# you can check if the sequence contains illegal characters
# that is not an accepted IUB character for that symbol
# (should prepare a Bio::Sequence::AA#illegal_symbols method also)
puts dna.illegal_bases
 
# translate and concatenate a DNA sequence to Protein sequence
newseq = aa + dna.translate
puts newseq      # => "AGFAVENDSAIGRL"

[]
AGFAVENDSAIGRL


### String to Sequence with comments
Yes, we should prepare a better container class for this. Temporally, you can do this as:

In [3]:
require 'bio'
 
# store a DNA sequence with the name dna_1 in a Bio::FastaFormat object
dna = Bio::Sequence::NA.new("atgctg").to_fasta("dna_1")
 
# store a RNA sequence with the name rna_1 in a Bio::FastaFormat object
rna = Bio::Sequence::NA.new("augcug").to_fasta("rna_1")
 
# store a Protein sequence with the name prot_1 in a Bio::FastaFormat object
prot = Bio::Sequence::AA.new("AFHS").to_fasta("prot_1")
 
# you can extract a name and a sequence stored in a Bio::FastaFormat object.
fasta_seq = Bio::FastaFormat.new(dna)
puts fasta_seq.entry_id  # => "dna_1"
puts fasta_seq.naseq     # => "atgctg"

dna_1
atgctg


### Bio::Sequence to String
You don't need to call any method to convert a Bio::Sequence object to use as a String object because it can behave as a String, although you can call a to_s method to stringify explicitly.

In [4]:
# you can use Bio::Sequence object as a String object to print, seamlessly
dna = Bio::Sequence::NA.new("atgc")
puts dna        # => "atgc"
str = dna.to_s
puts str        # => "atgc"

atgc
atgc


## How do I get a subsection of a Sequence?
Given a Sequence object we might only be interested in examining the first 10 bases or we might want to get a region between two points. You might also want to print a subsequence to a file or to STDOUT how could you do this?
BioRuby partly uses a biological coordinate system for identifying bases. You can use Bio::Sequence#subseq method to extract subsequence as the first base is numbered 1 and the last base index is equal to the length of the sequence. Other methods that are inherited from a String class use a normal String indexing which starts at 0 and proceedes to length -1. If you attempt to access a region outside of 1..length with a subseq method nil will be returned. Other methods in a String class will behave as a same.

### Getting a Subsequence

In [5]:
# sample DNA sequence
seq = Bio::Sequence::NA.new("atgcatgc")
 
# get the first symbol
sym1 = seq.subseq(1,1)  # => "a"
sym2 = seq[0]           # => 97 (ascii code for "a", ruby's default behavior)
 
# get the first three bases
seq1 = seq.subseq(1,3)  # => "atg"
seq2 = seq[0,3]         # => "atg"
seq3 = seq[0..2]        # => "atg"
 
# get the last three bases
seq4 = seq.subseq(seq.length - 2, seq.length)  # => "tgc"
seq5 = seq[-3,3]                               # => "tgc", this is probably the most elegant solution
seq6 = seq[-3..-1]                             # => "tgc"

"tgc"

### Printing a Subsequence

In [7]:
# print the last three bases of a SymbolList or Sequence
#  // Is the '-3' true for BioJava?
#  String s = symL.subStr(symL.length() - 3, symL.length());
puts seq.subseq(seq.length - 2, seq.length)

tgc


### Complete Listing

In [8]:
require 'bio'
 
# generate an RNA sequence
seq = Bio::Sequence::NA.new("auggcaccguccagauu")
 
# get the first Symbol
sym = seq.subseq(1,1)    # => "a"
 
# get the first three bases
seq2 = seq.subseq(1,3)
 
# get the last three bases
seq3 = seq.subseq(seq.length - 2, seq.length)
 
# print the last three bases
s = seq.subseq(seq.length - 2, seq.length)
puts s

auu


### Iteration
You can iterate on every subsequences easily with BioRuby.

## How do I transcribe a DNA Sequence to a RNA Sequence?
In BioRuby, DNA and RNA sequences are stored in the same Bio::Sequence::NA class just using different Alphabets, you can convert from DNA to RNA or RNA to DNA using the rna or dna methods, respectively.

In [17]:
require 'bio'
 
# make a DNA sequence
dna = Bio::Sequence::NA.new("atgccgaatcgtaa")

# transcribe it to RNA
rna = dna.rna

# just to prove it worked
puts dna       # => "atgccgaatcgtaa"
puts rna       # => "augccgaaucguaa"

# revert to the DNA again
puts rna.dna   # => "atgccgaatcgtaa"

atgccgaatcgtaa
augccgaaucguaa
atgccgaatcgtaa


## How do I reverse complement a DNA or RNA Sequence?
To reverse complement a DNA sequence simply use the complement method.

In [18]:
require 'bio'
 
# make a DNA sequence
seq = Bio::Sequence::NA.new("atgcacgggaactaa")
 
# reverse complement it
rev = seq.complement
 
# prove that it worked
puts seq  # => "atgcacgggaactaa"
puts rev  # => "ttagttcccgtgcat"

atgcacgggaactaa
ttagttcccgtgcat


## Sequences are immutable so how can I change it's name?
Sequences are not immutable in BioRuby - just use the freeze method to make sequence unchangable.
## How can I edit a Sequence or SymbolList?
Sometimes you will want to modify the order of Symbols in a sequence. For example you may wish to delete some bases, insert some bases or overwrite some bases in a DNA sequence. BioRuby's Bio::Sequence object can be edited by any methods inherited from Ruby's String class.

In [2]:
require 'bio'
 
# create a DNA sequence
seq = Bio::Sequence::NA.new("atggct")
 
# add "cc" to the end
seq += Bio::Sequence::NA.new("cc")
 
# should now be atggctcc
puts seq  # => "atggctcc"
 
# insert "tt" at the start
seq = Bio::Sequence::NA.new("tt") + seq
 
# should now be ttatggctcc
puts seq  # => "ttatggctcc"
 
# insert "aca" at position 4
seq[3,0] = Bio::Sequence::NA.new("aca")
 
# should now be ttaacatggctcc
puts seq  # => "ttaacatggctcc"
 
# overwrite at position 2, 3 bases with "ggg"
seq[1,3] = Bio::Sequence::NA.new("ggg")
 
# should now be tgggcatggctcc
puts seq  # => "tgggcatggctcc"
 
# delete from the start 5 bases (overwrite 5 bases with nothing)
seq[0,5] = ""
 
# should now be atggctcc
puts seq  # => "atggctcc"
 
# now a more complex example
 
# overwrite positions two and three with aa and then insert tt
seq[1,2] = Bio::Sequence::NA.new("aa") + Bio::Sequence::NA.new("tt")
 
# should now be aaattgctcc
puts seq  # => "aaattgctcc"

atggctcc
ttatggctcc
ttaacatggctcc
tgggcatggctcc
atggctcc
aaattgctcc


# Translation

## How do I translate a DNA or RNA Sequence or SymbolList to Protein?
All you need is to call a translate method for a Bio::Sequence::NA object. In BioRuby, you don't need to convert DNA to RNA before its translation.

In [4]:
require 'bio'
 
# create a DNA sequence
seq = Bio::Sequence::NA.new("atggccattgaatga")
 
# translate to protein
prot = seq.translate
 
# prove that it worked
puts seq   # => "atggccattgaatga"
puts prot  # => "MAIE*"

atggccattgaatga
MAIE*


## How do I translate a single codon to a single amino acid?
The general translation example shows how to use the translate method of Bio::Sequence::NA object but most of what goes on is hidden behind the convenience method. If you only want to translate a single codon into a single amino acid you get exposed to a bit more of the gory detail but you also get a chance to figure out more of what is going on under the hood.

In [5]:
require 'bio'
 
# make a 'codon'
codon = Bio::Sequence::NA.new("uug")
 
# you can translate the codon as described in the previous section.
puts codon.translate  # => "L"

L


Here's the other way

In [6]:
require 'bio'
 
# make a 'codon'
codon = Bio::Sequence::NA.new("uug")
 
# select the standard codon table
codon_table = Bio::CodonTable[1]
 
# You need to convert RNA codon to DNA alphabets because the
# CodonTable in BioRuby is implemented as a static Hash with keys
# expressed in DNA alphabets (not RNA alphabets).
codon2 = codon.dna
 
# get the representation of that codon and translate to amino acid.
amino_acid = codon_table[codon2]
puts amino_acid        # => "L"

L


## How do I use a non standard translation table?
The convenient translate method in Bio::Sequence::NA, used in the general translation example, is not limited to use the "Universal" translation table. The translate method also accepts a translation starting frame and a codon table number as its arguments.
The following translation tables are available:

1. - Standard (Eukaryote)
2. - Vertebrate Mitochondrial
3. - Yeast mitochondorial
4. - Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma
5. - Invertebrate Mitochondrial
6. - Ciliate Macronuclear and Dasycladacean
9. - Echinoderm Mitochondrial
10. - Euplotid Nuclear
11. - Bacteria
12. - Alternative Yeast Nuclear
13. - Ascidian Mitochondrial
14. - Flatworm Mitochondrial
15. - Blepharisma Macronuclear
16. - Chlorophycean Mitochondrial
21. - Trematode Mitochondrial
22. - Scenedesmus obliquus mitochondrial
23. - Thraustochytrium Mitochondrial Code

The following program shows the use of the Euplotid Nuclear translation table (where UGA = Cys).

In [7]:
require 'bio'
 
# make a DNA sequence including the 'tga' codon
seq = Bio::Sequence::NA.new("atgggcccatgaaaaggcttggagtaa")
 
# translate from the frame 1 with codon table 10 (Euplotid Nuclear)
protein = seq.translate(1, 10)
 
# print out the protein
puts protein        # => "MGPCKGLE*"
 
# compared to the universal translation table
puts seq.translate  # => "MGP*KGLE*"

MGPCKGLE*
MGP*KGLE*


## How do I translate a nucleotide sequence in all six frames?

In [8]:
require 'bio'
 
# Create an empty hash to place the results within
data = Hash.new
 
seq = Bio::Sequence::NA.new("atgggcccatgaaaaggcttggagtaa")
 
(1..6).each do |frame|
  # Store the results within the data hash
  data[frame] = seq.translate(frame)
end
 
# Print the data hash
puts data # => {1=>"MGP*KGLE*", 2=>"WAHEKAWS", 3=>"GPMKRLGV", 4=>"LLQAFSWAH", 5=>"YSKPFHGP", 6=>"TPSLFMGP"}

{1=>"MGP*KGLE*", 2=>"WAHEKAWS", 3=>"GPMKRLGV", 4=>"LLQAFSWAH", 5=>"YSKPFHGP", 6=>"TPSLFMGP"}


# Sequence I/O

## How do I write Sequences in Fasta format?
FASTA format is a fairly standard bioinformatics output that is convenient and easy to read. BioRuby's Sequence class has a to_fasta method for formatting sequence in FASTA format.
Printing any Bio::Sequence sequence object in FASTA format.

In [9]:
require 'bio'
 
# Generates a sample 100bp sequence.
seq1 = Bio::Sequence::NA.new("aatgacccgt" * 10)
 
# Naming this sequence as "testseq" and print in FASTA format
# (folded by 60 chars per line).
puts seq1.to_fasta("testseq", 60)

>testseq
aatgacccgtaatgacccgtaatgacccgtaatgacccgtaatgacccgtaatgacccgt
aatgacccgtaatgacccgtaatgacccgtaatgacccgt



# Counts and Distributions

## How do I count the residues in a Sequence?

In [11]:
require 'bio'
 
seq = Bio::Sequence::NA.new("atgcatgcaaaa")
seq.composition.each do |nuc, count|
  puts "#{nuc}:#{count}"
end

a:6
t:2
g:2
c:2


{"a"=>6, "t"=>2, "g"=>2, "c"=>2}

## How can I turn a Count into a Distribution?
Using a function.

In [12]:
require 'bio'
 
def distr(seq)
  l=seq.length.to_f
  dist={}
  seq.composition.each do |nuc, count|
    dist[nuc]=count/l.to_f
  end
  return dist
end
 
seq = Bio::Sequence::NA.new("atgcatgcaaaa")
 
p seq
p seq.composition
p distr(seq)

"atgcatgcaaaa"
{"a"=>6, "t"=>2, "g"=>2, "c"=>2}
{"a"=>0.5, "t"=>0.16666666666666666, "g"=>0.16666666666666666, "c"=>0.16666666666666666}


{"a"=>0.5, "t"=>0.16666666666666666, "g"=>0.16666666666666666, "c"=>0.16666666666666666}

Using a class method.

In [13]:
require 'bio'
 
class Bio::Sequence::NA
  def distribution
    length=self.length.to_f
    dist={}
    self.composition.each do |nuc, count|
      dist[nuc]=count/length
    end
    dist
  end
end
 
seq = Bio::Sequence::NA.new("atgcatgcaaaa")
 
p seq
p seq.composition
p seq.distribution

"atgcatgcaaaa"
{"a"=>6, "t"=>2, "g"=>2, "c"=>2}
{"a"=>0.5, "t"=>0.16666666666666666, "g"=>0.16666666666666666, "c"=>0.16666666666666666}


{"a"=>0.5, "t"=>0.16666666666666666, "g"=>0.16666666666666666, "c"=>0.16666666666666666}

# Disclaimer

This code is generously donated by people who probably have better things to do. Where possible we test it but errors may have crept in. As such, all code and advice here in has no warranty or guarantee of any sort. You didn't pay for it and if you use it we are not responsible for anything that goes wrong. Be a good programmer and test it yourself before unleashing it on your corporate database.

# Copyright

The documentation on this site is the property of the people who contributed it. If you wish to use it in a publication please make a request through the BioRuby mailing list.The original version was based on the 'BioJava in Anger' document by Mark Schreider et al.
The code is open-source. A good definition of open-source can be found here. If you agree with that definition then you can use it.