A biological sequence file (fasta, fastq, qseq) parser for Ruby
Ruby
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
lib
spec
.gitignore
.rspec
.rvmrc
.travis.yml
Gemfile
Rakefile
dna.gemspec
readme.md

readme.md

DNA Gem Version Build Status Coverage Status

A biological sequence file parser for Ruby

Austin G. Davis-Richardson

Features

Installation

Tested on Ruby 1.9.3 and 2.0.0

$ (sudo) gem install dna

Usage

require 'dna'

# Automatic Format Detection 

File.open('sequences.fasta') do |handle|
  records = Dna.new handle

  records.each do |record|
    puts record.length
  end
end

File.open('sequences.fastq') do |handle|
  records = Dna.new handle

  records.each do |record|
    puts record.quality
  end
end

File.open('sequences.qseq') do |handle|
  records = Dna.new handle
  puts records.first.inspect
end

# **caveat:** If you are reading from a compressed file
# or `stdin` you MUST specify the sequence format:

require 'zlib'

Zlib::GzipReader('sequences.fasta.gz') do |handle|
  records = Dna.new handle, :format => :fasta

  records.each do |record|
    puts record.length
  end
end

Support for PHRED score parsing

# Illumina > 1.3)

record.illumina_qualities # => [31, ..., 37]

# Error probabilities

record.illumina_probabilities
# => [1.0, 0.7943282347242815, ...,  0.3981071705534972]

# Solexa + Illumina =< 1.3

record.solexa_qualities
record.solexa_probabilities

# Sanger

record.sanger_qualities
record.sanger_probabilities

Bonus Feature

The DNA gem is also a command-line tool with grep-like capabilities. Print records with (Ruby) regexp match in header.

$ dna spec/data/input.fastq "[1-2]"

@1
TGAAACTTATTGATCACCCCGCTTGGCGTTGGGGAGAAATTCAGAAAAGAGTGCTTGATGGGGCGCCACATGCCGTGCAACCCACTCTCTTTCACGCAGCGCGCCCCA
+1
5888.6778888650/-//&,(,./*-11'//0&,-0.(.,,,,/2/&-,,,,,.(.,(,..&---&-,,,((*-----*+.&,,,,,(//&,,,-(,,+(,,,--&(
@2
GTCGCGGCTTACCACCCAACGATTTTTTTTAGAGGTGCTGGTTTCA
+2
2550//*-1./4.--/'+.2.,,,,,,,,&(/00.11426554+13

$ dna spec/data/test.fasta "\d"

>1
GAGAGATCTCATGACACAGCCGAAG
>2
GAGACAUAUCCNNNAA