OpenBio

ktym edited this page Sep 7, 2011 · 97 revisions

The membership of the group fluctuated during the afternoon session on Sunday afternoon. Part of the discussion was quite broad, looking for interesting "use cases" for RDF, and how the Open Bio* projects (BioPerl, BioRuby, Biopython, BioJava, etc) might contribute.

One potential idea was RDF serialisation of Bio* objects, with the need to try to agree consistent model output between the project. This could indirectly facilitate things like GenBank or UniProt to RDF.

Our plan for the Hackathon is to produce several RDF web services which could be integrated into SADI. These would use the existing strengths of the Bio* projects to call and parse the output from standalone tools - and use this to construct RDF.

##Members##

  • Peter Cock (Biopython) (leader)
  • Christian Zmasek (BioRuby)
  • Erick Antezana
  • Raoul J.P. Bonnal (BioRuby)
  • Mitsuteru Nakao (BioRuby)
  • Matúš Kalaš (BioXSD, EDAM)
  • Naohisa Goto (BioRuby)
  • Rutger Vos
  • Toshiaki Katayama (BioRuby)
  • Hilmar Lapp (BioPerl)
  • Brad Chapman (Biopython)

#Existing Libraries We noted that once data is available as RDF or SPARQL end points, there are existing open source libraries for working with this (in Python, C, etc) which are not biologically focussed. We should build on these rather than reimplementing this core functionality. Examples:

#RDF Examples from SADI (input/output)

#Private Data Referring to published (open) data in RDF is fairly straightforward as the sequences etc have URIs.

However, for generic analyses the user will have their own sequences which are unpublished/private. How does one assign a URI for these? Use the sequence itself? Generate URLs on the local server?

#Webservices to build The following RDF service ideas are generally about making existing commonly used tools available for use within large linked-data analysis. For example, you might take your protein sequence of interest, run BLAST against the NR database, then cross reference the results with mentioned of the matches in the literature (e.g. PubMed), perhaps combining this with taxonomic resources.

##BLAST to RDF The first example was being given a list of sequences (or sequence identifiers), running BLAST, parsing the output and returning it as RDF.

Mark W. explained later in the afternoon that SADI already has number of prototypes doing this, and that the hard part of this was deciding on the best RDF representation. As of Wednesday this has been agreed between Mark W., Luke M, Jervern B., etc as BLAST RDF model.

In discussion it was noted that restricting this to databases of public sequences (for which we have a URI) makes referencing the BLAST matches in RDF output much easier, e.g. we might initially just want to use some of the NCBI standard BLAST databases like NR.

##HMMER3 output to RDF/XML The output from HMMER is conceptually quite similar to that from BLAST, so once an RDF model for BLAST is agreed it should be relatively straightforward to design something similar for HMMER.

In particular, this would be useful for a web service to take query sequences (or their IDs) as RDF, call HMMER3 to search them for PFAM domains, and return the domain matches as RDF.

Note that this example deliberately uses PFAM domains which can be described with public URIs, rather than a more general service where you might want to use arbitrary private/unpublished domain models.

Note that BioRuby currently contains only a HMMER2 parser. Chris Z. worked on a HMMER3 wrapper and output parser for BioRuby. HMMER3 output parser and application wrapper are completed, tests and documentation still in progress: see https://github.com/cmzmasek/bioruby/tree/master/lib/bio/.

The HMMER3-RDF/XML support in BioRuby has been shown to convert HMMER3 output to RDF/XML within the TogoWS service.

HMMER3 RDF/XML

gist - Ruby

gist - RDF/XML output

Usage

Source: https://github.com/cmzmasek/bioruby

Executing hmmscan or hmmsearch and converting the result to RDF/XML

Per domain
require 'bio'
factory = Bio::HMMER3.new('hmmscan',
                          '/path/to/Pfam-A.hmm',
                          'my_query.fasta',
                          '--domtblout' )
report = factory.query
puts report.to_rdf(:xml)
Per protein
require 'bio'
factory = Bio::HMMER3.new('hmmscan',
                          '/path/to/Pfam-A.hmm',
                          'my_query.fasta',
                          '--tblout' )
report = factory.query
puts report.to_rdf(:xml)

Parsing existing HMMER outputs and convertion to RDF/XML

Input is String
require 'bio'
data = String.new
data         << '#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord'
data << "\n" << '# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target'
data << "\n" << '#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------'
data << "\n" << 'Bcl-2                PF00452.13   101 sp|P10415|BCL2_HUMAN -            239   3.7e-30  103.7   0.1   1   1   7.9e-34   4.9e-30  103.3   0.0     1   101    97   195    97   195 0.99 Apoptosis regulator proteins, Bcl-2 family'
data << "\n" << 'BH4                  PF02180.11    27 sp|P10415|BCL2_HUMAN -            239   3.9e-15   54.6   0.1   1   1   1.3e-18   8.2e-15   53.6   0.1     2    26     8    32     7    33 0.94 Bcl-2 homology region 4'
data << "\n" 

report = Bio::Hmmer3Report.new(data)

puts report.to_rdf(:xml)

# Example of printing some values (no RDF):
report.hits.each do |hit|
  puts hit.target_name
  puts hit.query_name
  puts hit.full_sequence_e_value
end
Input is File
require 'bio'
report = Bio::Hmmer3Report.new('/path/to/hmmer_domtblout.output')
puts report.to_rdf(:xml)

To Do

  • Examples, documentation, and tutorial
  • Integration into main BioRuby distribution
  • OWL ontology of hmmsearch and hmmscan terms
  • Get community input

##Multiple Sequence Alignment Taking as input sequences (or their IDs?) and producing a classic multiple sequence alignment (as RDF).

##Multiple Sequence Alignment to Tree Another example was taking a multiple sequence alignment (as RDF), building a phylogenetic tree and returning this as RDF.

##Multi-Protein Trees to annotated tree This was a more detailed phylogenetics example taking a tree for multiple proteins and combining this with a taxonomic tree to infer gene duplication/deletion/insertion and producing an annotated tree. The discussion was on using the SDI (Speciation Duplication Inference) algorithm (or something like it) to infer gene duplication by reconciliating gene trees with species trees. Again, part of the challenge will be extending the RDF for the tree to add these annotations.

#RDF Serialisation A related desired ability is using the Bio* libraries to serialise various the parsed versions of biological file formats to RDF. For instance, we can already parse various sequence file formats (e.g. FASTA, GenBank, UniProt) into objects, so can we output them as RDF? Simplistically this enables the Bio* tools to be used to convert these formats into RDF, but also of course the objects can be modified/generated in code as well.

A short list of particular interest was BLAST+ (see BLAST RDF model), HMMER3, GenBank, UniProt, and FASTA as RDF. Clearly this first requires an RDF model for pairwise alignments (see notes above), and for annotated sequences.

overview

GenBank/RefSeq to RDF

BioRuby

We created a BioRuby plugin to handle convert objects in RDF. The main ideas is to pass through the Bio::Sequence object. For testing we used the sequence AB000100 from DDBJ. In the output RDF we borrowed ddbj prefix and schema. The following formats are supported in input:

and in output:

Note: we are testing performances of RDF.rb pure ruby library and consistency between input/output.

Development on GitHub

Installation

 git clone git://github.com/helios/bioruby-rdf.git
 cd bioruby-rdf
 bundle install

Note: soon on rubygems

 gem install bio-rdf

Usage

  #don't forget to include the cloned repository in the ruby path or irb (-I)
  require 'bio-rdf'
  gb = Bio::GenBank.new(File.read(your_file_name))
  bio_seq = gb.to_biosequence
  bio_seq.output(:rdf) #Note default output is in NTriples
                       #pass :type=>(:XML|:NTRIPLES|:N3|:JSON) to select a specific output format

ToDo

  • Refactoring
  • Create a Bio::Sequence#to_rdf to create a RDF graph
  • Try to create another plugin bio-rdf_ext using the binding http://librdf.org/docs/ruby.html to increase the performances and sharing a common API among other OpenBio projects (Python,Perl)
    • Update: Installation of latest librdf (git version) under OSX seems buggy. Using a packaging system like Fink and enabling unstable repository seems possible to install raptor 1.4 rasqal

Web service integration of HMMER3 and GenBank/DDBJ to RDF converters in TogoWS

In our BioRuby group, converters for Bio::Sequence (DDBJ, GenBank/RefSeq, EMBL) to RDF (XML, N3, N-Triples) and HMMER3 results in "tblout/domtblout" format to RDF/XML were developed. Those were also made available through the TogoWS http://togows.dbcls.jp/convert/ service. You can test it without scripting (replace "ddbj" with "genbank" or "embl" in the endpoint URL appropriately):

% wget http://togows.dbcls.jp/entry/ddbj/AB000100
% wget --post-file AB000100 http://togows.dbcls.jp/convert/ddbj.n3
% wget --post-file AB000100 http://togows.dbcls.jp/convert/ddbj.ntriples
% wget --post-file AB000100 http://togows.dbcls.jp/convert/ddbj.rdfxml
% wget --post-file tblfile.txt http://togows.dbcls.jp/convert/hmmer3tbl.rdfxml

In addition, BioRuby 1.4.2 was released. You can install it by

% sudo gem install bio

although the above RDF converters are not yet included in this version.

##A case study of RDF I/O using Perl##

The Bio::Phylo package for phyloinformatics (publication: [1], CPAN release version: [2], development version [3]) reads and writes NeXML [4], an XML format for phylogenetic data that uses RDFa [5] for additional semantic annotation. Using an XSLT sheet [6], the produced NeXML can be transformed into RDF/XML that instantiates classes from the CDAO ontology [7]. Bio::Phylo uses this facility to generate RDF using its Bio::Phylo::IO architecture.

An example of this is shown here:

#!/usr/bin/perl
use strict;
use Bio::Phylo::IO qw(parse unparse);

my $nexus_file = shift;

# this parses a NEXUS file and returns a 
# Bio::Phylo::Project object, which encapsulates
# a data set including OTUs, character state 
# matrices (or alignments) and phylogenetic trees
my $project = parse(
	-format     => 'nexus',
	-file       => $nexus_file,
	-as_project => 1
);

# this unparses the project object 
# into a set of triples
my $rdf = unparse(
	-format => 'cdao',
	-phylo  => $project,
);

# now the RDF/XML string can be written to a
# file, over CGI, etc.
print $rdf;

This was fairly simple to implement in Bio::Phylo [8] because the heavy lifting was done by the XSLT style sheet (which needed quite a bit of work during the hackathon). However, this assumes that the XML that is transformed is reasonably aware of semantic web requirements, and that the API that produces the XML can meet those requirements. For example, to generate absolute URIs, the API probably may need some facility for storing base URIs and identifiers such that these can be concatenated to generate URIs for the subjects of the produced triples.

Going in the other direction is a little more complicated, especially because a great deal of assumptions are made about what is present in a given RDF graph. In the implementation here, the assumption was that instances of certain OWL classes (those defined by the CDAO) are present in the graph. These instances are then treated as roughly equivalent to objects. For non-semantic web-aware APIs this causes problems because there is a mismatch between object-oriented classes and RDF types: in RDF, the subject URIs that are assumed to map onto object-oriented classes may occur in triples where unexpected properties and values are used.

The workaround in Bio::Phylo is that objects inherit from an annotatable superclass to which an arbitrary number of metadata annotations can be attached. These metadata objects have getters and setters for (predicate) namespaces, getters and setters for the predicates themselves and for the annotation objects. The metadata objects themselves inherit from the same annotatable superclass so that metadata can be attached to metadata (recursively), such that metadata becomes a subject node in the resulting RDF graph.

The general pattern for querying the RDF graph then became: 1) instantiate an object for each URI that is defined as being of a known CDAO rdf:type; 2) keep a tally of these instantiated objects (using a hash keyed on URIs) so that objects that refer to each other (e.g. edges in trees that refer to child and parent nodes) can be resolved and linked up in the object representation; 3) query the graph for additional triples where the instantiated Perl objects are the subject of triples with unexpected predicates and values; 4) store those triples using the annotatable architecture.

The result is now that part of the input graph can be accessed as Perl objects, but triples with subjects of unknown types go unseen. Although this approach has these limitations, the resulting API is still useful for cases where services need to be implemented that wrap around "legacy" programs that expect flat-file or non-semantic XML input. Here is an example how that looks like:

#!/usr/bin/perl
use strict;
use RDF::Trine::Store::Memory;
use RDF::Trine::Parser;
use RDF::Trine::Model;
use Bio::Phylo::IO 'parse';

# for relative rdf:ID URIs, a base URI needs to be defined
# for subsequent SPARQL queries that search for known types
my $base   = 'http://example.org/';
my $store  = RDF::Trine::Store::Memory->new; # holds the graph, can be db
my $model  = RDF::Trine::Model->new( $store ); # represents the graph
my $parser = RDF::Trine::Parser->new('rdfxml'); # parses rdf

$parser->parse_file_into_model( $base, 'data.rdf', $model );

# this returns a Bio::Phylo project object, which can be
# serialized to a variety of flat and xml file formats
print parse(
	'-format' => 'cdao',
	'-file'   => 'characters.rdf',
	'-model'  => $model,
	'-base'   => $base,
	'-as_project' => 1,
)->to_nexus;

The parser that queries the RDF graph is still a work in progress, though it succesfully records phylogenetic topologies and taxa (with annotations). Character state matrices are still a work in progress [9]. The NeXML website shows a number of translations between NeXML instance documents and their equivalent RDF/XML representation [10].

  1. http://www.biomedcentral.com/1471-2105/12/63
  2. http://search.cpan.org/dist/Bio-Phylo/
  3. https://github.com/rvosa/bio-phylo
  4. http://www.nexml.org
  5. http://www.w3.org/TR/xhtml-rdfa-primer/
  6. http://nexml.svn.sourceforge.net/viewvc/nexml/trunk/nexml/xslt/nexml2cdao.xsl
  7. http://www.evolutionaryontology.org/cdao/1.0/cdao.owl
  8. https://github.com/rvosa/bio-phylo/blob/master/lib/Bio/Phylo/Unparsers/Cdao.pm
  9. https://github.com/rvosa/bio-phylo/blob/master/lib/Bio/Phylo/Parsers/Cdao.pm
  10. http://nexml.org/nexml/examples/

Python SPARQL API

Uodates to the Python SPARQL API, originally started at BioHackathon 2010, include:

  • Updating the Biogateway wrapper to handle the latest schema changes. It's worth noting that this required no high level API modifications, which reflects the utility of layers on top of SPARQL.

  • Implementing a wrapper for the new BioMart SPARQL interface, using the test server.

The underlying code creates SPARQL, queries the server, and returns results as an array. This provides a unified interface between services and simplifies coding to SPARQL for end users. The API uses a Builder class to create the query, which involves setting attributes of interest to return and key/value pairs to filter on. A server is then defined, and passed the builder to return results. For the BioMart interface, this looks like:

builder = BioMartQueryBuilder("snp", "jpNCCLiver")
builder.add_attributes(["chromosome", "chromosome_start", "chromosome_end",
                        "aa_mutation", "gene_affected", "probability", "mutation"])
builder.add_filter("consequence_type", "non_synonymous_coding")
builder.add_filter("validation_status", "validated")
icgc_server = SematicBioMart("bm-test.res.oicr.on.ca:9085")
results = icgc_server.search(builder)
print results[0]

[Result(chromosome_start='6529637', probability='0.49',
        aa_mutation='D>Y', gene_affected='ENSG00000171680',
        mutation='C>A', chromosome='1', chromosome_end='6529637')]

The Biogateway interface logic is identical:

builder = UniProtGOQueryBuilder("Homo sapiens")
builder.add_attributes(["protein_name", "interactor", "gene_name"])
builder.add_filter("GO_term", "insulin")
builder.add_filter("disease_description", "diabetes")
server = Biogateway()
results = server.search(builder)
print results[0]

HNF1A  CALM1 insulin secretion

These generalized interfaces provide a simpler way to query SPARQL endpoints while still providing a large set of functionality. Much like API abstractions on top of SQL databases, users can always revert to the full SPARQL syntax for custom queries.

The goal is to define a useful abstraction and query logic shared between the OpenBio projects. The code is available at:

Databases in RDF

DDBJ (mostly identical with GenBank format) to RDF is already done since BH10. The full dumps are available at

and each entry in RDF can be obtained as

in which "location" is stored "as is" and not divided into sub-pieces with fine resolution for now.

##Links

Tools

OSX

In case you are using fink as package manager you can install redland libraries in few steps:

  • enable unstable repository in file /sw/etc/fink.conf, and add unstable/main and unstable/crypto
  • sudo fink selfupdate; fink index -f; fink scanpackages;
  • sudo fink install redland-bin redland-dev redland-shlibs raptor-bin raptor-dev raptor-shlibs librasqal-dev librasqal-shlibs rasqal-bin

Using MacPort

  • sudo port sync
  • sudo port install redland redland-bindings raptor2 rasqal

##Progress

#Monday

...

#Tuesday

...

#Wednesday

Mark/Luke/Jerven completed a BLAST RDF model.

...

#Thursday

Mark/Luke/Jerven polished the BLAST RDF model.

Raoul has Sequence Record (e.g. DDBJ GenBank) to RDF working with BioRuby, following existing DDBJ RDF model.

Chris has HMMER3 to RDF working, also using BioRuby (HMMER3 RDF/XML).

Peter has a TogoWS REST API wrapper in Biopython, working with Toshiaki to fix some bugs identified.

Matus has partly re-annotated BioXSD by EDAM, RDFS, and DC concepts in order to support hypothetical auto-generation of good-enough RDF from any XML described by a SAWSDL:modelReference-annotated XML Schema (optionally with so-far 2 additional xs:appinfo elements purely for convenience of the resulting RDF; no SAWSDL:liftingSchemaMapping, no XSLT). Conclusion: worth investing my (& volunteers':) time into this in future. Prerequisities: 1. Additions to EDAM, 2. Better URIs of EDAM concepts, 3. Refinement of annotation of BioXSD (BioXSD types will serve as test cases), 4. Sophisticated runtime XSD (& SAWSDL) resolver (XSLT very likely not enough). Example RDF

#Friday

Toshiaki got Chris' HMMER3 to RDF code integrated into TogoWS, now available online to test (see http://togows.dbcls.jp/site/en/rest.html for details).

Toshiaki also got Raoul's sequence record to RDF converter installed on TogoWS for GenBank, EMBL and DDBJ to RDF.

Peter extended the test coverage of his Biopython wrapper for TogoWS, and reported a few more issues to Toshiaki.

##Questions [E. Antezana]

  • I don't really see the interest in translating BLAST results into RDF :-( Could you please elaborate a bit more on the prospective applications?

  • Chris Z: (Parts of) the BLAST result could be submitted to further analysis (e.g. via SADI). For example, the species distribution of all hits with a E-value below 10E-3 could be analysed.

  • A BLAST result/output will change each time you change your input sequences and/or your DB (target) so you might end up with huge collection of RDF triples...which will be too specific for a given application...

  • Which elements from a BLAST output are going to be captured and translated?

  • Peter C: At very least the matched sequences identifiers, and likely some information about the alignment(s) between the query and the match (i.e. the HSPs) such as the e-value and bitscore.

  • what do you mean with "biologically focussed" in section "Existing Libraries"? what would be the missing functionality they fail to provide?

  • Maybe things like user friendly wrappers for Biological data provides, or converters from Biological file formats to RDF for sharing data.

  • What are the challenges in this project? Maybe the mapping BLAST output to RDF? (I mean the representation)

  • In the BLAST to RDF case, yes, according to Mark and Luke the hardest part is the RDF model. They plan to meet up Tuesday morning to go over this (see email list), done as of Wednesday morning - BLAST RDF model.

  • @cjfields - just a note: the BioPerl Bio::SearchIO BLAST parser (as well as other SearchIO parsers) use an event handler system, so it's feasible to have a specific handler that generates RDF or other relevant data. Sorry I couldn't be there!

  • Milestones?

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.