Navigation Menu

Skip to content
ktym edited this page Jun 4, 2014 · 14 revisions

During the 1st RDF summit for genomics, we've developed/agreed on the followings:

RDF model for genomics

Common RDF model for representing Gene/transcript|mRNA|CDS/exon relationship based on SIO and FALDO

<gene> sio:SIO_010080 <transcript> .                 # sio:is-transcribed-into
<transcript> sio:SIO_000974 <part1>, <part2>, ... .  # sio:has-ordered-part
<part1> sio:SIO_000628 <exon1> .                     # sio:refers-to
<part1> sio:SIO_000300 1 .                           # sio:has-value
<exon1> faldo:location <region> .
  :

By looking at intensive use of SIO terms in Ensembl RDF, we'll keep using SIO. This may also result in the better integration of genetic variations which will be represented by GFVO/SIO ontologies in the future.

Identifiers

  • Use identifiers.org URIs for cross references (at least true for taxon IDs so far)

  • Updates on the Identifiers.org service

  • Requested content negotiation and bug fix in the RDF representation

  • Requested/reviewed additional xref DBs used in the INSDC entries

  • Newly developed ID conversion layer as a virtual SPARQL endpoint

Variations

  • Updated GFVO/SIO for future development of variation RDF and BioInterchange

Integration

  • Trial of on-the-fly RDF conversion of NoSQL/file data for integration of private data and public endpoints by SPARQL SERVICE query
  • in-house HyperEstraier instance holding private experimental data + public Ensembl/RefSeq annotations

Converters

Developed codes are available at https://github.com/dbcls/rdfsummit

SPARQL endpoint

SPARQL endpoint is temporally available at http://ep.dbcls.jp/sparql71hg for testing.

The following query is confirmed to work with both graphs:

PREFIX sio: <http://semanticscience.org/resource/>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>

SELECT ?gene ?transcript ?exon ?seq
FROM <http://togogenome.org/hg19/refseq>  # or FROM <http://togogenome.org/hg19/ensembl>
WHERE {
 ?gene sio:SIO_010080 ?transcript .
 ?transcript sio:SIO_000974/sio:SIO_000628 ?exon .
 ?exon faldo:location [
   faldo:begin/faldo:position ?begin ;
   faldo:end/faldo:position ?end ;
   faldo:begin/faldo:reference ?seq
 ]
 FILTER (?begin >= 45409039 && ?end <= 45412650)
}
ORDER BY ?begin
LIMIT 100

Remaining issues

How to represent gene names

In our RDF model currently

  • Ensembl uses skos:altLabel for gene names and rdfs:label and dc:identifier for Ensembl IDs
  • Refseq uses rdfs:label for gene names (as features in INSDC records don't have IDs)

As I found /gene_synonym=".." qualifier in gene/mRNA/CDS features in RefSeq, I'd like to use skos:prefLabel for gene names and skos:altLabel for synonyms. It would be great if Ensembl uses skos:prefLabel instead of skos:altLabel for canonical gene names. Use of dct:title would be another option.

Chromosome URIs

We have discussed about generic URI for chromosomes (e.g., how to represent human chromosome 19 of GRCh37) and failed to make an agreement because generic URI can't be reliable for representing (FALDO) coordination of personal, cancer genome or single cell sequencing even if they are genomes obtained from the same species.

However, I found that it is inconvenient without having common RDF representation to make the same SPARQL query works for both of Ensembl and RefSeq data to say "give me all annotations located in between 45409039 and 45412650 on chromosome 19" in the above SPARQL query example.

Currently,

for human GRCh37 chromosome 19 and I'm happy to keep those as is.

My point is, because the human Ensembl and RefSeq entries are made from the same data source GRCh(37 or later), can we make an agreement on how to commonly represent chromosome IDs for those reference sequences? Something like:

<ensembl/refseq chromosome 19> sio:has-identifier "19" .
<ensembl/refseq chromosome 19> sio:has-source "GRCh37" .  # (optional)

or

<ensembl/refseq chromosome 19> sio:has-identifier [
    a sio:physical-entity-identifier;
    sio:has-value "19"
  ]

or

<ensembl/refseq chromosome 19> dct:identifier "19"

Alternatively, if Bio2RDF or Identifiers.org can provide aggregation of those URIs

<ensembl/refseq chromosome 19> sio:is-identical-to <id.org/bio2rdf chromosome 19>

Variations

Ensembl group started to represent information on genomic variations in RDF but the work is still on going. We'll continue to make it available in collaboration with BioInterchange/GFVO.