# egonw/chembl-rdf-paper

### Subversion checkout URL

You can clone with HTTPS or Subversion.

Fetching contributors…

Cannot retrieve contributors at this time

752 lines (621 sloc) 34.088 kb
 \documentclass[10pt]{bmc_article} \usepackage{a4wide} % Formatting web addresses \usepackage{url} % Formatting web addresses \usepackage{ifthen} % Conditional \usepackage{multicol} %Columns \usepackage[utf8]{inputenc} %unicode support %\usepackage[applemac]{inputenc} %applemac support if unicode package fails %\usepackage[latin1]{inputenc} %UNIX support if unicode package fails \urlstyle{rm} \usepackage[citestyle=numeric-comp,sorting=none]{biblatex} \setlength{\bibitemsep}{0cm} \renewbibmacro{in:}{% \ifentrytype{article}{}{% \printtext{\bibstring{in}\intitlepunct}}} \addbibresource{article.bib} %\usepackage{amsmath} %\usepackage{endnotes} \usepackage{graphicx} %\usepackage{tikz} \newboolean{publ} %Review style settings \newenvironment{bmcformat}{\begin{raggedright}\baselineskip20pt\sloppy\setboolean{publ}{false}}{\end{raggedright}\baselineskip20pt\sloppy} %Publication style settings %\newenvironment{bmcformat}{\fussy\setboolean{publ}{true}}{\fussy} \begin{document} \begin{bmcformat} %\pretitle{} \title{The ChEMBL database as Linked Open Data} \author{Egon~L.~Willighagen$^{1}$\email{Egon~L.~Willighagen - egon.willighagen@maastrichtuniversity.nl}\and Andra~Waagmeester$^1$\email{Andra Waagmeester - FIXME}\and Ola Spjuth$^2$\email{Ola Spjuth - FIXME}\and Peter~Ansell$^3$\email{Peter Ansell - p\_ansell@yahoo.com}\and Antony~Williams$^4$\email{Antony Williams - FIXME}\and Valery~Tkachenko$^4$\email{Valery Tkachenko - FIXME}\and Janna~Hastings$^5$\email{Janna Hastings - FIXME}\and John~Overington$^6$\email{John Overington - FIXME}\and Anna~Gaulton$^6$\email{Anna Gaulton - FIXME}\and Mark~Davies$^6$\email{Mark Davies - FIXME}\and Bin~Chen$^7$\email{Bin Chen - FIXME}\and David~Wild$^7$\email{David Wild - FIXME} } \address{\iid(1)Department of Bioinformatics - BiGCaT, Maastricht University, P.O. Box 616, UNS50 Box 19, NL-6200 MD, Maastricht, The Netherlands \\ \iid(2)Department of Pharmaceutical Biosciences, Uppsala University, PO Box 591, SE-751 24, Uppsala, Sweden \\ \iid(3)University of Queensland, St Lucia, Qld 4072, Australia \\ \iid(4)Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC 27587, U.S.A. \\ \iid(5)Chemoinformatics and Metabolism, European Bioinformatics Institute, POSTAL CODE, Hinxton, United Kingdom \\ \iid(6)EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, United Kingdom \\ \iid(7)School of Informatics and Computing, Indiana University, Bloomington, IN, U.S.A. } \maketitle \begin{abstract} \paragraph*{Background:} Making data available as Linked Data using RDF is advantageous because it promotes integration with other web resources. RDF documents can natively link to related data, and others can link back using URIs. RDF makes the data machine-readable, using extensible vocabularies for additional information. This combination provides a new, open, interface for scientists to use ChEMBL. \paragraph*{Results:} This paper describes the continued conversion of data from the ChEMBL database into RDF triples. This updated version of ChEMBL-RDF now uses recently introduced ontologies, including CHEMINF and CiTO, exposes more information from the database, and is now available as dereferencable, Linked Data. To demonstrate the new features, we present new use cases showing the benefits of integration with other web resources using semantic web technologies. \paragraph*{Conclusions:} \end{abstract} %%%%%%%%%%% The article body starts: \section*{Introduction}\label{s1} The current scientific data deluge, in which datasets are growing faster than scientists can analyse and curate them, provides several new challenges for information systems. Scientists wish to discover new, unique, and significant patterns in datasets, which explaining biological phenomena not yet understood. This process requires integration of datasets that are both sparse, and growing independent of one another. The discovery of new patterns showing the causes and effects of various biological phenomena is beyond the scope of single datasets. For example, systems biologists integrate micro-array differential expression datasets to biological pathways, using various other datasets to provide evidence for the links~\cite{}. Another prominent example is drug discovery where a new unique chemical entity is discovered based on descriptions of its biological properties. This process requires the effective linkage of many scientific datasets~\cite{Samwald2011,OpenPHACTS}. ChEMBL contains descriptions for biological activities involving over a million chemical entities, extracted from literature. This provides a unique resource for drug researchers~\cite{Gaulton2012,Warr2009}. It is updated on a fairly frequent basis as the existing data is further curated and new data is added. The ChEMBL dataset is available for download and can be browsed using a web interface. The former requires scientists to import the data into a relational database, while the latter limits the machine access to the data. Two independent teams have previously mapped the ChEMBL dataset to RDF, Chem2Bio2RDF~\cite{Chen2010} and ChEMBL-RDF~\cite{Willighagen2011}. This paper presents the current state of the ChEMBL-RDF dataset, details of the latest structures and ontologies used to map version 13 of ChEMBL to RDF, along with new example use cases, showing links to other datasets using their RDF URIs to further support research in the life sciences. \section*{Methods}\label{s2} The ChEMBL version used in this paper is ChEMBL 13, which was released on 29 February 2012. The SQL datadump provided by EMBL was inserted into a local MySQL database. A set of SQL queries were then executed to generate RDF files using a set of PHP scripts. These scripts are Open Source and are available from the source code hosting service, GitHub~\cite{ChEMBLRDFGitHub}. This process has been outlined in a best practices note by the W3C Health Care and Life Sciences XXX Working Group~\cite{Marshall2012}. In the previous RDF conversion the RDF triples in ChEMBL-RDF were represented using properties and c lasses from a custom, ad-hoc ontology~\cite{Willighagen2011}. The current version used community-proposed ontologies, making the RDF more interoperable. The ChEMBL-RDF 13 dataset uses ontology standards such as the Bibliography Ontology~\cite{Giasson2011} and the Citation Typing Ontology~\cite{Hastings2011} for literature references, and domain ontologies like the Protein Ontology\cite{Sidhu2006} and the Chemical Information Ontology\cite{Hastings2011}. Throughout this paper, various shorthand prefixes are used to denote different ontology namespaces to simplify the RDF examples. These prefixes are outlined in Table~\ref{namespaces}. \begin{table*} \caption{Prefixes and their matching namespaces used in this paper.} \label{namespaces} \begin{tabular}{ll} \hline \multicolumn{2}{l}{\textbf{Common Vocabularies}} \\ bibo & Bibliography Ontology~\cite{Giasson2011} \\ & http://purl.org/ontology/bibo/ \\ chebi & Chemical Entities of Biological Interest~\cite{DeMatos2010} \\ & http://purl.org/obo/owl/CHEBI\# \\ cheminf & Chemical Information Ontology~\cite{Hastings2011} \\ & http://semanticscience.org/resource/ \\ cito & Citation Typing Ontology~\cite{Shotton2010} \\ & http://purl.org/spar/cito/ \\ obo / pro & OBO \& PRotein Ontology~\cite{Sidhu2006} \\ & http://purl.obolibrary.org/obo/ \\ \multicolumn{2}{l}{\textbf{ChEMBL-RDF Namespace}} \\ chembl & http://rdf.farmbio.uu.se/chembl/onto/\# \\ \multicolumn{2}{l}{\textbf{ChEMBL-RDF Prefixes }}\\ act & http://data.kasabi.com/dataset/chembl-rdf/activity/ \\ assay & http://data.kasabi.com/dataset/chembl-rdf/assay/ \\ mol & http://data.kasabi.com/dataset/chembl-rdf/molecule/ \\ res & http://data.kasabi.com/dataset/chembl-rdf/resource/ \\ \hline \end{tabular} \end{table*} To expose that ChEMBL-RDF data two approaches have been adopted. First, a SPARQL end point hosted at Uppsala University, using the Open Source Virtuoso software. Use is free, but the querying is capped, based on the estimated computational effort. Second, resources have been made dereferencable using the Kasabi platform~\cite{kasabi}. The Kasabi hosting provides a SPARQL end point which requires a user account. A user account can be free for casual usage, or paid for consistent usage. Although many parts of the ChEMBL-RDF 13 dataset now use community based ontologies, some terms from the previous ad-hoc ontology are still used. These terms were created under the URI namespace \textit{http://rdf.farmbio.uu.se/chembl/onto/\#}, and are referenced here using the prefix: \textit{chembl}. %As a method to further test the access and possibilities of this Linked Data version of %ChEMBL, new uses cases have been developed for this paper. \section*{Results}\label{s3} We here present the updated ChEMBL-RDF, including a description of the mapping process and the identification of links to other datasets. \subsection{ChEMBL-RDF} \subsubsection{Data Structure} For each of the common resource classes, a triple pattern was defined, following the data available in the relational database. Figure~\ref{f1} shows how the various resource classes are linked together. This section will show how data in the ChEMBL database is exposed at an triple level. \begin{figure}[t] \includegraphics[width=0.45\textwidth]{figs/relations} \caption{The various resource types found in the ChEMBL triples. Some entities are subclasses of common classes, while others are instances.}\label{f1} \end{figure} The core concept in the ChEMBL database is that of the biological activity. This is the type of information that the ChEMBL database extracts from literature. In ChEMBL-RDF activities are mostly represented using the custom ChEMBL-RDF vocabulary, where the type, fields and links to other resources are using the \textit{chembl} namespace. While ChEMBL provides both the original activities as found in literature and standardized values allowing comparison between studies, the triples only make the latter available. The CiTO ontology is used to point to the paper from which the data was extracted. \begin{footnotesize} \begin{verbatim} act:a31863 a chembl:Activity ; cito:citesAsDataSource res:r6424 ; chembl:forMolecule mol:m180094 ; chembl:onAssay assay:a54505 ; chembl:relation ">" ; chembl:standardUnits "nM" ; chembl:standardValue "100000"^^xsd:float ; chembl:type "IC50" . \end{verbatim} \end{footnotesize} More than five thousand different activity types are captured by the ChEMBL database. The top five types are Potency (43\%), IC50 (13\%), MIC (4.6\%), Inhibition (3.7\%), and Ki (3.6\%). The activity types in ChEMBL-RDF are currently not available as, or using, an ontology. The activities themselves are measured against assays, which make up a second important resource type. Various assay types are found in the database: chembl:ADMET, chembl:Binding, chembl:Functional, chembl:Property, and chembl:Unassigned. The assay information link activities to a target (see below) and has a short description. The confidence information assigned by the ChEMBL is exposed too, and was previously used to improve activity prediction using Bayesian statistics~\cite{Willighagen2011}. \begin{footnotesize} \begin{verbatim} assay:a17 a chembl:Assay ; cito:citesAsDataSource res:r11347 ; chembl:hasAssayType chembl:ADMET ; chembl:hasConfScore "7"^^xsd:int ; chembl:hasDescription "Inhibition of ... hydroxylase" ; chembl:hasTarget target:t100122 . \end{verbatim} \end{footnotesize} The assays measure activities against particular biological targets. The ChEMBL database recognizes various types: pro:PR\_0000001, chembl:ADMET, chembl:CELL-LINE, chembl:NUCLEIC-ACID, chembl:ORGANISM, chembl:SUBCELLULAR, chembl:TISSUE, chembl:UNCHECKED, and chembl:UNKNOWN. The latter two are currently defined as explicit types. \begin{footnotesize} \begin{verbatim} target:t1 a chembl:Target ; rdfs:subClassOf pro:PR_000000001 ; rdfs:label "Glucoamylase" , "Maltase-glucoamylase, intestinal" ; dc:identifier "uniprot:O43451" , "3.2.1.3" ; dc:title "Maltase-glucoamylase" ; chembl:classL1 "Enzyme" ; chembl:hasDescription "Maltase-glucoamylase, intestinal" ; chembl:hasKeyword "Glycosidase" , "Membrane" , "Sulfation" ; chembl:organism "Homo sapiens" . \end{verbatim} \end{footnotesize} For drug discovery, the drugs themselves are the main topic of study. ChEMBL contains many different drug types, mostly small molecules, but also peptides, proteins, antibodies, oligosaccharides, oligonucleotides, and even cells. All compounds are represented as classes, following the design of the CHEMINF ontology. These entities do not have a common superclass in common, but can easily be identified as having the the role of being a drug. This is triplified using the OBO and ChEBI ontologies: \begin{footnotesize} \begin{verbatim} mol:m4 obo:has_role chebi:CHEBI_23888 . \end{verbatim} \end{footnotesize} The entity typing is expressed by subclassing other classes. For example, proteins subclass the Protein concept from the PRotein Ontology (PR\_000000001), small molecules subclass the Chemical Entity concept from the CHEMINF ontology (CHEMINF\_000000), and oligosaccharides and oligonucleotides subclass their respective matches in the CHEBI ontology (CHEBI\_50699 and CHEBI\_7754). For each drug compound the name and synonyms are provided as labels. When InChI and InChIKeys are available for a drug, then these are provided via the CHEMINF formalism (here abbreviated): \begin{footnotesize} \begin{verbatim} mol:m41 rdfs:label "ChEMBL406142" , "Bis(3-[1 .... yl]-propionamide)" ; rdfs:subClassOf cheminf:CHEMINF_000000 ; cheminf:CHEMINF_000200 m41:inchikey , m41:smiles , m41:inchi ; m41:inchikey a cheminf:CHEMINF_000059 ; cheminf:SIO_000300 "LMCOMIDLRGMFCZ-RIPOXUOASA-N" . m41:smiles a cheminf:CHEMINF_000018 ; cheminf:SIO_000300 "CC(C)C[C@@H]1N2C= .... O)C7=O)NC1=O)C2=O" . m41:inchi a cheminf:CHEMINF_000113 ; cheminf:SIO_000300 "InChI=1S/C82H102N .... 66+,67+,68+/m1/s1" . chemblid:ChEMBL406142 owl:equivalentClass mol:m41 . \end{verbatim} \end{footnotesize} For small molecules, molecular properties are often available from ChEMBL, and as of ChEMBL-RDF 13 these too are exposed in triple format. Like the InChI and InChIKey, these are provided using the CHEMINF ontology approach. Here is, for example, the ALogP value: \begin{footnotesize} \begin{verbatim} mol:m1 cheminf:CHEMINF_000200 m1:alogp . m1:alogp a cheminf:CHEMINF_000305 ; cheminf:SIO_000300 "3.344"^^xsd:double . \end{verbatim} \end{footnotesize} For documents little information is replicated from the database, taking advantage of PubMed Identifiers (PMIDs) and Digital Object Identifiers (DOIs). URIs for the latter can be resolved online, providing curated information on the identified documents. For each paper, basic properties are provided using the BIBO ontology: \begin{footnotesize} \begin{verbatim} journal:j6c706049c2e08871b7c46a6528065736 a bibo:Journal ; dc:title "J. Med. Chem." . res:r1 a bibo:Article ; dc:date "2004" ; dc:isPartOf journal:j6c706049c2e08871b7c46a6528065736 ; bibo:issue "1" ; bibo:pageEnd "9" ; bibo:pageStart "1" ; bibo:pmid "14695813" ; bibo:volume "47" . \end{verbatim} \end{footnotesize} \subsubsection{Data Statistics} \# triples \# links out \# problems The full set of triples are available from the website http://semantics.bigcat.unimaas.nl/chembl12/. FIXME! In contrast to earlier releases of ChEMBL-RDF, it now contains the chemical properties provided by the ChEMBL database as calculated by ACD/Labs and XXXXX. The data includes chemical properties like polar surface area, pKa and logP, counts for hydrogen bond donor and acceptor, and rotational bonds. This data is provided for XXXXX structures. \subsubsection{Data Statistics and Validation} Each release of the ChEMBL database is accompanied by set of release counts, which help provide a concise content overview of the release in question. The ChEMBL 13 release counts are as follows: \begin{itemize} \item 8,845 targets \item 1,143,682 distinct molecules \item 6,933,068 activities \item 617,681 assays \item 44,682 documents \end{itemize} These counts are important as firstly they show continued growth in ChEMBL data, as seen in all previous releases. Secondly, these counts can help validate any transformations, which have been applied to the relational data model traditionally used to store the ChEMBL data. The following SPARQL queries attempt to provide a high level validation of the ChEMBL-RDF by regenerating the ChEMBL 13 release statistics. The following SPARQL query returns 8,845 ChEMBL targets: \begin{tiny} \begin{verbatim} select count(*) WHERE { ?s ?p chembl:Target . } \end{verbatim} \end{tiny} The following SPARQL query returns 1,143,682 ChEMBL molecules: %Mark Davies notes: %NOTE: Full set of ChEMBL molecules have been created by querying molecule\_dictionary table %NOTE: I have added extra triple for each molecule: . %NOTE: Modified molecule compound structure can be queried here: https://wwwdev.ebi.ac.uk/chembl/sparql \begin{tiny} \begin{verbatim} select count(*) WHERE { ?s ?p chembl:Molecule . } \end{verbatim} \end{tiny} The following SPARQL query returns 617,681 ChEMBL assays: \begin{tiny} \begin{verbatim} select count(*) WHERE { ?s ?p chembl:Assay . } \end{verbatim} \end{tiny} The following SPARQL query returns the 6,933,068 ChEMBL activities: \begin{tiny} \begin{verbatim} select count(*) WHERE { ?s ?p chembl:Activity . } \end{verbatim} \end{tiny} The following SPARQL query returns the 44,681 ChEMBL documents. This is 1 less than appears in ChEMBL 13 release notes, as it does not include doc\_id -1, which refers to the unpublished dataset. \begin{tiny} \begin{verbatim} select count(*) WHERE { ?s ?p bibo:Article . } \end{verbatim} \end{tiny} Recreation of the ChEMBL release statistics, by querying the ChEMBL-RDF with SPARQL provides end users with high degree of confidence that the ChEMBL-RDF contains same core content as source relational database. \subsection{Linked Open Data} To integrate ChEMBL-RDF with other RDF versions of scientific datasets, we link out to various resources. These links are shown in Figure~\ref{2}. Triples for compounds link out to ChemSpider using the complementary index and OpenMolecules RDF using InChI values. Protein links are given to Bio2RDF using the Uniprot identifier. Literature references are directed to CrossRef and Bio2RDF using the DOIs and PubMed identifiers, respectively. \begin{figure}[t] \includegraphics[width=0.45\textwidth]{figs/lodgraph} \caption{The links out of the ChEMBL-RDF data into the Linked Open Data cloud. Edges are labeled by the predicates making the links.}\label{2} \end{figure} \subsubsection{Linking out to Bio2RDF} The Bio2RDF project provides both resolvable Linked Data URIs using a generic Linked Data server~\cite{Ansell2011} to access SPARQL endpoints for a range of scientific databases~\cite{Belleau2008}. A number of these databases are referenced in ChEMBL, including ChEBI, Pubmed, and both the Uniprot protein and taxonomy databases~\cite{TheUniProtConsortium2010}. These links are vital to provide context for use cases that require a correlation between chemical structures and other scientific data. %FIXME: For protein links out are based on the FOOBAR for the species, the EC code for proteins, as well as UniProt identifiers. FIXME: should we link out to the new UniProt RDF? PA: Code is available on the OpenPHACTS branch to link to http://purl.uniprot.org/uniprot/ \begin{tiny} \begin{verbatim} target:t101191 chembl:hasTaxonomy ; owl:sameAs ; owl:sameAs . \end{verbatim} \end{tiny} For papers with PubMed identifiers we also link out to Bio2RDF: \begin{tiny} \begin{verbatim} res:r23 skos:exactMatch . \end{verbatim} \end{tiny} \subsubsection{Linking out to ChemSpider} ChemSpider~\cite{Pence2010} is a freely accessible, online database provided by RSC. It contains over 26 million unique chemical compounds aggregated from over 400 data sources as well as chemical data extracted from RSC scientific articles and databases. Since its inception, efforts have been made to utilize it as both a deposition platform for the community to contribute novel data, as well as a platform for annotation and curation for existing data. Studies have shown that there are data quality issues in many of the public compound databases~\cite{Williams2011}. ChemSpider has become a valuable resource for curated data, especially chemical-compound name mappings. ChemSpider is presently providing the chemical structure, substructure and similarity searching services underpinning the Open PHACTS semantic web project~\cite{Williams2012}. Specific chemical data sources containing data mappings between ChemSpider identifiers (CSIDs) and the original data source identifiers have been provided to the triple store, together with chemical identifiers including validated chemical names (systematic, generic and trivial), SMILES, and InChIs. The data mappings between the CSIDs and ChEMBL IDs are released to the community under the Creative Common Attribution-Share Alike license (CC-BY-SA 3.0). Attribution should be made to the original ChEMBL database, Open PHACTS, and ChemSpider. Link mappings are provided with skos:exactMatch predicates, while the ChemSpider identifiers are also available via a CHEMINF representation: \begin{tiny} \begin{verbatim} skos:exactMatch . \end{verbatim} \end{tiny} \subsubsection{Linking out to OpenMolecules RDF} The InChI is a unique identifier for (small) organic molecules, and has been previously used to define unique IRIs for molecules~\cite{Bradley2009,Willighagen2011}. While IRIs are theoretically unlimited in length, in practice web browsers and servers limit the length of IRIs. Virtuoso is, unfortunately, a system which supports only IRIs of up to a certain length. Therefore, InChI-based links are only created for smaller molecules. Almost 1.3 million links to OpenMolecules RDF were created in a similar manner to: \begin{tiny} \begin{verbatim} mol:m62687 owl:equivalentClass . \end{verbatim} \end{tiny} Notice here the use of owl:equivalentClass to match the formalism in CHEMINF that defines molecules are classes, rather than instances. \subsubsection{Linking out to CrossRef} In addition to the PubMed identifiers used to link from literature references to Bio2RDF, ChEMBL provides DOIs, which we use to link out to the RDF provided by CrossRef~\cite{Bilder2011}: \begin{tiny} \begin{verbatim} res:r2032 owl:sameAs . \end{verbatim} \end{tiny} \section*{Applications} ChEMBL-RDF can be used to explore interrelated scientific datasets, and we here present a few applications. The first application describes how it uses the SPARQL end point can be used to make the data available as linked data via the Bio2RDF platform. The second application takes advantage of the bibliographic information exposed as machine readable data, in calculating citation statistics. The third application shows an integration of ChEMBL-RDF with ChemSpider to provide an extension for the decision support platform in Bioclipse. The last application shows how Chem2Bio2RDF can combine the ChEMBL data with other life sciences databases, showing the power of the Linked Data approach. \subsection{Bio2RDF} The Linked Data server used by Bio2RDF has been reconfigured for ChEMBL to provide URL based services for standard URI resolution, along with text and link searches~\cite{Ansell2011}. A Java Web Archive (WAR) file along with the configuration files and build scripts for the ChEMBL-RDF Linked Data server are available on GitHub~\cite{WebAppGitHub}. It proxies the standard ChEMBL-RDF URIs by translating URLs between those requested by users and the URIs that are available in SPARQL endpoints. For example, if the ChEMBL web application is running on the users local machine, e.g. \url{http://localhost:8080/chembl/}, then a request for the article with identifier a31863'', \url{http://localhost:8080/-chembl/article/a31863}, will be resolved from the database using the full original URI, \url{http://data.kasabi.com/-dataset/chembl-rdf/13/activity/a31863}. If the user requested an RDF document using content negotiation, the original URIs will be unchanged, however, if the user requested an HTML document, the results will contain both the original RDF triples, represented using RDFa, with links that resolve using the users local machine. The links services enables the ChEMBL application to derive both forward links, originating in ChEMBL, e.g. \url{http://localhost:8080/chembl/linkstonamespace/-targetns/originalns:identifier}, and backward links, originating in other databases, such as LODD, Bio2RDF, \footnote{\url{http://bio2rdf.org/linksns/targetns/originalns:identifier}}, and Chem2Bio2RDF. These services are vital to efficiently navigate the Linked Data web, as it is both impractical and inefficient to require users to crawl the entire web before they can discover relevant resources. These services are currently only supplied as web services from ChEMBL and Bio2RDF, but it is hoped that similar services will be provided by other scientific Linked Data providers in the future. Datasets that are available in SPARQL endpoints can be queried for links efficiently using simple queries as demonstrated in the ChEMBL web application. \subsection{CitedIn} Andra: use the SPARQL end point to find which entries in ChEMBL cite what papers The link between data and formal publication is important in many areas of attribution, scientist ranking, etc, as outline in \cite{Waagmeester2012}. ChEMBL contains many literature references, and we wish to query this data for CitedIn. \subsection{Bioclipse Decision Support} % Egon,Antony/Valery,Ola: develop and write up the Bioclipse Decision Support use case Bioclipse Decision Support (Bioclipse DS)~\cite{Spjuth:2011uq} is a user-oriented tool based on the Bioclipse workbench for providing on-time and on-demand information on chemical structures. Such information can include calculated properties, data from database queries, and results from predictive models. Bioclipse DS has previously been demonstrated on predictive modeling in drug safety assessment~\cite{Spjuth:2011uq} and also been linked to invoke and present results from distributed toxicity predictions from the OpenTox infrastructure~\cite{Willighagen:2011kx}. In this study we extended Bioclipse DS with remote access to ChEMBL-RDF and ChemSpider. This enables users browsing chemical structures in Bioclipse to look up near neighbors in ChemSpider via the ChemSpider Web API (SOAP), and for the found compounds to query ChEMBL-RDF for available interaction data. The results are presented alongside predictive models in Bioclipse (see Fig \ref{fig:bioclipse-ds}, and can be used for decision support when evaluating chemical structures and consider strategies for optimization. \begin{figure*}[!ht] \begin{center} \includegraphics[width=16cm]{bioclipse-ds.png} \newline \caption[wee]{Screnshot from Bioclipse Decision Support with results from a ChemSpider + ChEMBL-RDF search. The top middle canvas contains the query structure, the top right canvas shows the near neighbors in ChemSpider (via a similarStructure search), the lower right shows the chemical structure for the selected compound in the top right canvas, and the lower center canvas shows the found interactions for this compound.} \label{fig:bioclipse-ds} \end{center} \end{figure*} \subsection{Chem2Bio2RDF} David/Bin: Chem2Bio2RDF \subsection{Compound Selectivity} NOTE: May want to exclude this Use Case as query has not successful run yet, keeps causing virtuoso to timeout or run out of memory. Currently playing with local server setup to get query to run. Designing a molecule, which is selective to one target over another will often be considered a successful outcome in a drug design process. If the target in question is a protein, it is easy to understand that as sequence identity amongst residues, which contribute to potential small molecule binding sites remains high so does the issue of selectivity. That is, the molecule being designed may also bind to the equivalent binding site in the closely related protein target and may lead to undesirable consequences. The ChEMBL data model links molecules to targets using different activity types recorded in the literature. Using activity types, which act as a measure of binding affinity, such as IC50 or Ki and applying activity value cuts offs it is possible to identify molecules which have higher binding affinity to certain targets compared to others. Taking this one step further it is possible to identify a set molecules, which have a high affinity to protein A (e.g. IC50 value < 50nM) and low affinity to protein B (e.g. IC50 value > 200nM). The following SPARQL query identifies a set of molecules, which based on data curated from the literature and stored in the ChEMBL-RDF data model, selectively bind Human Cyclin-Dependent Kinase 2 (UniProt: P24941) over Human Cyclin-Dependent Kinase 4 (UniProt: P11802). NOTE: Query should return 65 molecules \begin{tiny} \begin{verbatim} select distinct ?molecule WHERE { ?assay ?p chembl:Assay ; chembl:hasTarget ?target . ?activity ?p chembl:Activity ; chembl:forMolecule ?molecule ; chembl:onAssay ?assay ; chembl:type ?type ; chembl:standardUnits ?unit ; chembl:standardValue ?value; chembl:relation ?relation FILTER ( ?type = "IC50" && ?unit = "nM" && ?value < 50 && ?relation = "=" ) . ?target ?p chembl:Target; dc:identifier ?id FILTER ( ?id = "uniprot:P24941" ) . { select ?molecule WHERE { ?assay ?p chembl:Assay ; chembl:hasTarget ?target . ?activity ?p chembl:Activity ; chembl:forMolecule ?molecule ; chembl:onAssay ?assay; chembl:type ?type ; chembl:standardUnits ?unit ; chembl:standardValue ?value ; chembl:relation ?relation FILTER ( ?type = "IC50" && ?unit = "nM" && ?value > 200 && ?relation = "=" ) . ?target ?p chembl:Target; dc:identifier ?id FILTER ( ?id = "uniprot:P11802" ) . } } } \end{verbatim} \end{tiny} Analysis on the set of molecules returned by this query can be used to help identify small molecule features, which may increase target-binding specificity. For queries which link small molecules to targets, by traversing bioactivity data in the ChEMBL database, it is also important to consider the parameters associated with the assay to target mappings. These additional parameters include a relationship type, a multi flag (for poorly defined targets) a complex flag (for protein complex targets) and a curation level. These different factors are summarized in the ChEMBL confidence score, which ranges from 9 (direct single protein target) to 0 (uncurated). In order to return the largest possible dataset, the confidence score has been ignored in this example compound selectivity use case. \section*{Discussion} While we show here that the RDF version of the ChEMBL data is very useful, the current RDF triples are by no means static: ChEMBL-RDF will keep evolve, following new developments in the Linked Open Data world. For example, it is likely that over time we will link out to more Linked Data resources. More importantly, we wish to adopt more common ontologies, of which the BioAssay Ontology is planned to be the next~\cite{Visser2011}. These steps will make the ChEMBL-RDF triples even more interoperable. \section*{Authors contributions} EW initiated the project, created the initial RDF version of the ChEMBL data, and encouraged the use cases. AW extended CitedIn to support citation info in ChEMBL-RDF. OS, AW, and VT developed the nearest neighbor application of Bioclipse and ChemSpider. JH supported the project with CHEMINF representations. PA integrated ChEMBL-RDF into Bio2RDF. JO, AG, and MD critically validated and summarized the ChEMBL data statistics. All authors contributed to the continued development of ChEMBL-RDF, as well as to the writing of the paper, and approved the final version. \section*{Acknowledgements} Tim Hodson and Zach Beauvais of Kasabi and A. L\"ovgren at the BMC Computing Department at Uppsala University for their support. OS acknowledges funding from the Swedish VR (2011-6129) and eSSENCE. The data in ChEMBL is made available by funding from the Wellcome Trust [086151/Z/08/Z] {\ifthenelse{\boolean{publ}}{\footnotesize}{\small} \printbibliography } \ifthenelse{\boolean{publ}}{\end{multicols}}{} \end{bmcformat} \end{document}
Something went wrong with that request. Please try again.