-
Notifications
You must be signed in to change notification settings - Fork 6
Home
Bio::Faster is a BioRuby gem that implements a fast and simple parser for FastQ files. The new version dropped the support for FastA files to focus on the more resource demanding FastQ parsing. This new version is a rewrite of the old one, the C extension has been completely written from scratch and now the parser checks also for formatting problems in FastQ files. Full RSpecs has been defined based on the test files available in the official FastQ paper. This new gem uses Ruby-FFI to bind against the C extension and it's also compatible with JRuby. For a full list of supported Rubies check Travis-CI
A Bio::Faster object is created by passing a FastQ file name and the each_record method is then used to parse the whole file. The method returns a simple array for each sequence in the file. The array includes the complete sequence header (ID and comments), the sequence itself and, by default, an array with the quality values as integers. Default quality encoding is expected to be Sanger (Phred33) and conversion is done directly during the parsing.
fastq = Bio::Faster.new("sequences.fastq")
fastq.each_record do |sequence_header, sequence, quality|
puts sequence_header, sequence, quality
end
If the quality values are Phred64 format (e.g. Solexa) you need to specify it directly on the each_record method:
fastq = Bio::Faster.new("sequences.fastq")
fastq.each_record(:quality => :solexa) do |sequence_header, sequence, quality|
puts sequence_header, sequence, quality
end
The each_record method can also return just the raw qualities as a string of the ASCII codes, without doing any conversion. To do this, specify the quality as "raw" while calling the method itself.
fastq = Bio::Faster.new("sequences.fastq")
fastq.each_record(:quality => :raw) do |sequence_header, sequence, quality|
puts sequence_header, sequence, quality
end
The each_record method can also read directly from STDIN and this can be useful when dealing with compressed FastQ files.
Just specify stdin as the input:
Bio::Faster.new(:stdin).each_record do |seq|
...
and you can call the Ruby script with pipes in a standard Unix terminal:
zcat sequences.fastq.gz | ruby my_parser.rb
So you can read gzipped files without any drop in the parsing performance.
BioFaster is almost 3-4X times faster then standard object oriented FastQ parser method (and even faster with JRuby).
This is a comparison of the time needed to parse a 5.4 Gb Illumina 1.8+ FastQ file.
Bio::Faster.new("test_file.fastq").each_record {|sequence_header, sequence, quality|}
Ruby 1.9.3-p194
real 4m1.337s
user 3m56.447s
sys 0m4.339s
JRuby 1.6.7 OpenJDK 64-Bit Server VM 1.6.0_18
real 3m12.023s
user 3m9.040s
sys 0m4.277s
Ruby 1.9.3-p194
Bio::FlatFile.open(Bio::Fastq,File.open("test_file.fastq")).each_entry {|seq|}
real 11m35.946s
user 11m26.762s
sys 0m7.764s