Simple BioC parser/builder for ruby
HTML CSS Ruby JavaScript
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
html
lib
samples
spec
xml
.gitignore
Gemfile
LICENSE
LICENSE.txt
README.md
Rakefile
simple_bioc.gemspec

README.md

SimpleBioC

SimpleBioC is a simple parser / builder for BioC data format. BioC is a simple XML format to share text documents and annotations. You can find more information about BioC from the official BioC web site (http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/BioC/)

Feature:

  • Parse & convert a BioC document to an object instance compatible to BioC DTD
  • Use plain ruby objects for simplicity
  • Build a BioC document from an object instance
  • Convert BioC to PubAnnotation JSON

Installation

Add this line to your application's Gemfile:

gem 'simple_bioc'

And then execute:

$ bundle

Or install it yourself as:

$ gem install simple_bioc

Simple Usages

Include library

require 'simple_bioc'

Parse with a file name (path)

collection = SimpleBioC::from_xml(filename)

Traverse & Manipulate Data. Data structure are almost the same as the DTD. Please refer library documents and the BioC DTD.

    puts collection.documents[2].passages[0].text

Build XML text from data

    puts SimpleBioC::to_xml(collection)

Convert PubAnnotation JSON from data

    puts SimpleBioC::to_pubann(collection, {
      sourcedb: 'PubMed', 
      target: 'http://pubannotation.org/docs/sourcedb/PubMed/sourceid/18034444', 
      project: 'Ab3P-abbreviations'
    }))

Options

Specify set of <document>s to parse

You can parse only a set of document elements in a large xml document instead of parsing all the document elements. It may decrease the processing time. For example, the following code will return a collection with two documents ("1234", "4567").

    collection = SimpleBioc::from_xml(filename, {documents: ["1234", "4567"]})

No whitespace in output

By default, outputs of SimpleBioC::to_xml() will be formatted with whitespace. If you do not want this whitespace, you should pass 'save_with' option with 0 to the to_xml() function.

    puts SimpleBioC::to_xml(collection, {save_with:0})

Sample

More samples can be found in Samples directory

    require 'simple_bioc'

    # Sample1: parse, traverse, manipulate, and build BioC data
    require 'simple_bioc'
    
    # parse BioC file
    collection = SimpleBioC.from_xml("../xml/everything.xml")
    
    # the returned object contains all the information in the BioC file
    # traverse & read information
    collection.documents.each do |document|
      puts document
      document.passages.each do |passage|
        puts passage
      end
    end
    
    # manipulate 
    doc = SimpleBioC::Document.new(collection)
    doc.id = "23071747"
    doc.infons["journal"] = "PLoS One"
    collection.documents << doc
    
    p = SimpleBioC::Passage.new(doc)
    p.offset = 0
    p.text = "TRIP database 2.0: a manually curated information hub for accessing TRP channel interaction network."
    p.infons["type"] = "title"
    doc.passages << p
    
    
    # build BioC document from data
    xml = SimpleBioC.to_xml(collection)
    puts xml

Sample2: PubAnnotation Converter (convert_pubann.rb)

# convert document to PubAnnotation JSON
require 'simple_bioc'

if ARGF.argv.size < 1
  puts "usage: ruby convert_pubann.rb {filepath}"
  exit
end

collection = SimpleBioC::from_xml(ARGF.argv[0])
puts SimpleBioC::to_pubann(collection, {
  sourcedb: 'PubMed', 
  target: 'http://pubannotation.org/docs/sourcedb/PubMed/sourceid/18034444', 
  project: 'Ab3P-abbreviations'
})

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

LICENSE

Copyright © 2013, Dongseop Kwon

Released under the MIT License.