ConstructionOfLinkedDataDB

synobu edited this page Sep 26, 2011 · 103 revisions

ConstructionOfLinkedDataDB

Aims

  • Construction of New LinkedData DB (link to the existing major LinkedData DBs).
  • Creating a Linked Data HOWTO for the construction of LinkedData DBs.
  • Writing a "Reasons why we should dive into RDF and Linked Data" as a manuscript: Why RDF?.

Some notes

  • You don't have to make RDF in the RDF/XML format which is quite complex. Start with the N-triples format: http://www.w3.org/2001/sw/RDFCore/ntriples/ which is very easy. (by AK)
  • The turtle format is also easy to deal with by Google Refine. If you do not have a big table (<100,000 lines?), start with Google Refine and the turtle format.
  • You can convert between various RDF formats by using the rapper tool provided in the Redland RDF libraries (Raptor in particular): http://librdf.org/. (by AK)
  • rapper -i ntriples -o rdfxml file.n3 > file.rdf

Data to publish in LinkedData DB

Alzheimer gene expression data (BH11Ujicha)

LinkedData DB for gene expression analysis on Alzheimer's disease sample (by SO)

Sample ruby code: https://gist.github.com/1170203#file_sample_sparql.rb (by MN)

to be generalized for a framework of gene expression analysis by Facet view of LinkedData (by MN).

Glyco-data

Toxicogenomics data (gene expression data)

  • gene expressions (expression value, probe set id, gene symbol/title, transcript_id, etc..)
  • metadata (dose, dose level, time, organ, vivo/vitro, etc..)

Proteomics data (Peptide Mass data)

  • peptides mapping to protein (peptide mass, peptide sequence, protein ID, position, coverage, etc...)
  • metadata (sample, species, analytical instrument, processing software, parameters, searched DB, etc...)
  • proteomics data repositories: PRIDE, PeptideAtlas, Peptidome (site is closed. There are only archived data)
  • HUPO-PSI ontologies: PRO, MOD, ProPreO, OBI, SepCV, MSCV

Membrane Proteins of Known Structure (ID Mapping Data)

  • mapping of PDB ID, OPM ID, TC ID

Genome Metadata (Habitat, Ontology alignment)

Setup procedure of LinkedData DB

Setup of RDF store - Virtuoso (open source)

Download of Virtuoso open source edition from http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSDownload.

RDF (triple) stores were surveyed at https://github.com/dbcls/bh11/wiki/Triplestoresurvey.

CentOS

Installation

$ sudo yum install gcc gmake autoconf automake libtool flex \
  bison gperf gawk m4 make openssl-devel readline-devel wget
$ tar xvfz virtuoso-opensource-6.1.3.tar.gz
$ cd virtuoso-opensource-6.1.3
$ ./configure --prefix=/usr/local/ --with-readline
$ nice make
# make install

Starting server

$ cd /usr/local/var/lib/virtuoso/db/
$ ls
virtuoso.ini
$ virtuoso-t -df

Configuration

Access to the Conductor menu by pointing your web-browser at http://localhost:8890/conductor/. Before accessing to the Conductor menu, you need to open 8890 port by system-config-securitylevel.

References

Mac OS X (OS 10.6 Snow Leopard)

Install Gawk

Download Gawk version 3.1.1 (gawk-3.1.1.tar.gz) from http://www.gnu.org/software/gawk/.

$ cd gawk-3.1.1
$ ./configure
$ make
$ sudo make install

Install Virtuoso Open-Source Edition

Download Virtuoso (the OpenLink Virtuoso source code) from the SourceForge project page (http://sourceforge.net/projects/virtuoso/files/). The latest version at the time was 6.1.3 (virtuoso-opensource-6.1.3).

$ cd virtuoso-opensource-6.1.3
$ ./configure
$ make
$ sudo make install

Starting server

$ cd /usr/local/virtuoso-opensource/var/lib/virtuoso/db
$ ls
virtuoso.ini
$ sudo /usr/local/virtuoso-opensource/bin/virtuoso-t -f &

Configuration

Access to the Conductor menu by pointing your web-browser at http://localhost:8890/conductor/.

References

  • INSTALL and README files in the virtuoso-opensource-6.1.3/ directory.

Setup of RDF store - OWLIM

Setup of RDF store - Sesami

QuickStart - Virtuoso

http://docs.openlinksw.com/virtuoso/quicktours.html

Upload RDF data

Log in as user "dba".

Virtuoso Web interface > RDF (on the top menu) > RDF Store Upload > Choose File and set proper "Named Graph IRI"

Execute SPARQL query

Virtuoso Web interface > RDF (on the top menu) > SPARQL > Set "Default Graph IRI", write your SPARQL query and Execute

Execute SPARQL query via terminal emulator

You can execute SPARQL query with REST GET method.

Quick guide:

  1. Access Virtuoso SPARQL Query Form (http://localhost:8890/sparql)
  2. Execute your favorite SPARQL query
  3. Copy the URL of the coming result page
  4. Access the URL with REST GET method: $ curl "URL" (Don't forget to add double quotation marks!)

Like this:

$ curl "http://localhost:8890/sparql?default-graph-uri=&query=SELECT+%3Fpdb_id%0D%0AWHERE%0D%0A%7B%0D%0A%3Fid+%3Chttp%3A%2F%2Flocalhost%3A3333%2Fpdb_id%3E+%3Fpdb_id%0D%0A%7D%0D%0A&format=text%2Fhtml"

Notes:

  • You can choose several output formats (&format=***): HTML, XML, JSON, CSV, RDF/XML, …
  • When you want to get the result in JSON format, replace &format=text%2Fhtml with &format=application%2Fjson

Reference for SPARQL via scripting languages:

Virtuoso through scripting languages

Please add your codes written in your favorite scripting languages.

Execute SPARQL query with Python (by MM)

import sys, urllib, json

class Connection:

    def __init__(self, url):
    	self.base_url = url

    def query(self, q):
        q = "sparql?default-graph-uri=" + \
            "&query=" + urllib.quote_plus(q) + \
    	    "&format=application%2Fjson"
        return self._exec_sparql(q)

    def _exec_sparql(self, sparql):
    	data = urllib.urlopen(self.base_url+sparql).read()
        try:
    	    result = json.loads(data)["results"]["bindings"]
            return result
    	except:
            return [{ "error": data }]


def main():
    c = Connection("http://localhost:8890/")
    response = c.query("""SELECT ?pdb_id WHERE {
                          ?id <http://localhost:3333/pdb_id> ?pdb_id .
    	                  }""")
    for r in response:
        print r["pdb_id"]["value"]


if __name__ == "__main__":
    main()

Retrieve data from PDBj with Python (by MM)

import rdflib

class PDBjRDF:

    def __init__(self, pdb_id):
    	self.pdb_uri = "http://pdbj.org/rdf/" + pdb_id
        self.PDBo = rdflib.Namespace("http://pdbj.org/schema/pdbx-v40.owl#")

    def get_pdbx_descriptor(self):
    	descriptors = []
        struct_categories = self._get_object(self.pdb_uri, "has_structCategory")
        for uri_s in struct_categories:
            has_structs = self._get_object(uri_s, "has_struct")
            for uri_h in has_structs:
                struct_categories = self._get_object(uri_h, "struct.pdbx_descriptor")
                for descriptor in struct_categories:
                    descriptors.append(descriptor)
        return descriptors

    def _get_object(self, uri, predicate):
        g = rdflib.Graph()

        objects = []
        try:
            response = urllib.urlopen(uri)
            g.parse(response, format="xml")
        except:
            return []

        for s, p, o in g.triples((None, self.PDBo[predicate], None)):
    	    objects.append(o)

        return objects

    def get_uniprot(self):
        links = []
        struct_refcategories = self._get_object(self.pdb_uri, "has_struct_refCategory")
        for uri_s in struct_refcategories:
            has_structrefs = self._get_object(uri_s, "has_struct_ref")
            for uri_h in has_structrefs:
                link_to_uniprots = self._get_object(uri_h, "link_to_uniprot")
                for link in link_to_uniprots:
    	            links.append(link)
        return links


def main():
    pdb_ids = ["1AP9", "1C8S", "1FBB", "1M0L", "1X0K"]
    for pdb_id in pdb_ids:
        p = PDBjRDF(pdb_id)
        descriptors = p.get_pdbx_descriptor()
        link_to_uniprots = p.get_uniprot()

        print descriptors, link_to_uniprots


if __name__ == "__main__":
    main()

Execute SPARQL query with Ruby (by YI)

require 'rubygems'
require 'sparql/client'

sparql = SPARQL::Client.new("http://localhost:8890/sparql")

predicate = "<http://localhost:3333/something>"
object = "?o"

result = sparql.query("SELECT ?o WHERE { ?s #{predicate} #{object} }")
result.each do |i|
  p i
end

Execute SPARQL query with Perl (by YAC)

#!perl

use LWP::UserAgent;

my $query="select ?p, ?o where{ <http://pdbj.org/rdf/7TIM> ?p ?o}";
my $baseURL="http://sw07.dbcls.jp/sparql/";

my $sparql_query = "query=$query&debug=on&format=text/csv&save=display";
my $sparqlURL="$baseURL?$sparql_query";

my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1 ");
my $req = HTTP::Request->new(GET => $sparqlURL);
my $res = $ua->request($req);

print $res->content;

LinkedData generation

Conversion of table data to RDF data

Reference for LinkedData design (type/predicate)

Construction of OWL

  • Protege

LinkedData DB for end user

SPARQL (Search)

  • Searching multiple SPARQL end ponts with one query: SPARQL 1.1
  • see page 98 in Bob DuChame (2011) Learning SPARQL. O'Reilly.

Facet (Aspect view)

Sub-network visualization (Link view)

ChEMBL and Drug Data links

  1. ChEMBL-RDF - Data Packages - the Data Hub http://ckan.net/package/farmbio-chembl
  2. Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF http://www.slideshare.net/annzi/pharmaceutical-knowledge-retrieval-through-reasoning-of-chembl-rdf
  3. http://annziproject.blogspot.com/
  4. chem-bla-ics: ChEMBL RDF #1:SPARQL end point http://chem-bla-ics.blogspot.com/2010/02/chembl-rdf-1sparql-end-point.html
  5. egonw/chembl.rdf - GitHub https://github.com/egonw/chembl.rdf
  6. ChEMBL-RDF | Kasabi http://beta.kasabi.com/dataset/chembl-rdf
  7. ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_11/chembl_11_release_notes.txt
  8. https://www.ebi.ac.uk/chembl/
  9. http://chem-bla-ics.blogspot.com/2011/03/bioclipse-cdk-chembl-sparql.html
  10. Journal of Biomedical Semantics | Full text | Linking the Resource Description Framework to cheminfor- matics and proteochemometrics http://www.jbiomedsem.com/content/2/S1/S6
  11. http://chem-bla-ics.blogspot.com/search/label/chembl
  12. W3C HCLSIG LODD http://www.w3.org/wiki/HCLSIG/LODD
  13. LODD Data http://www.w3.org/wiki/HCLSIG/LODD/Data
  14. Journal of Cheminformatics | Full text | Linked open drug data for pharmaceutical research and development http://www.jcheminf.com/content/3/1/19 (Fig 2 shows ChEMBL data used in TripleMap)
  15. Original SPARQL end point http://rdf.farmbio.uu.se/chembl/sparql and SNORQL http://rdf.farmbio.uu.se/chembl/snorql/
  16. https://github.com/bioclipse/bioclipse.chembl/ (has example SPARQL queries)
  17. http://chembl.blogspot.com/2010/08/from-one-of-our-collaborators.html
  18. Kasabi blog post http://blog.kasabi.com/2011/08/23/featured-dataset-chembl-rdf-with-egon-willighagen/

Unsolved questions

Members

  • Soichi Ogishima (SO)
  • Mizuki Morita (MM)
  • Yoshinobu Igarashi (YI)
  • Yi-an Chen (YAC)
  • Shin Kawano
  • Kiyoko Kinoshita
  • Shinobu Okamoto (ShO)
  • Mitsuteru Nakao
  • Anna Kokubu
  • Takaaki Mori
  • Erick Antezana
  • Chisato Yamasaki (also intereinsted in BioDBCore)
  • Yukie Akune
  • Yasunori Yamamoto (YY)