Clone this wiki locally
- Kazuharu Arakawa
- Kazuki Oshita
Restauro-G v.2 - yet another ID conversion service
A large proportion of bioinformatics researches is comprised of data integration tasks, fetching, parsing, and merging data sets and functional annotations from dispersed databases with the support of database cross references and hyperlinks. Therefore, in order to expedite such tasks, a centralized server containing curated links among these databases would be desirable to assist the data integration tasks in bioinformatics. To this end, we have been developing an identifier conversion web service named Restauro-G version 2. Based on the cross referencing information available from UniProt and KEGG, this service retrieves all identifiers and their PURLs (Persistent Uniform Resource Locators) related to an identifier provided by the user, via REST service interface. Moreover, users may supply nucleotide or amino acid sequences in place of the identifier, for rapid annotation of sequences. In order to comply with the recent Semantic Web and Linked Data initiatives, results can be returned in N-triples or RDF formats for interoperability, as well as the legacy GenBank, EMBL and tabular formats. This service is freely available at http://rest.g-language.org/annotations/.
Restauro-G v.1 was a simple but rapid re-annotation software for GenBank files, aimed for data preparation for comparative genomic studies. v.2 is almost a complete over-haul of this software, making it a REST web service and allowing to retrieve information from gene/protein IDs or sequences.
- Tamaki S, Arakawa K*, Kono N, Tomita M, "Restauro-G: A Rapid Genome Re-Annotation System for Comparative Genomics", Genomics Proteomics Bioinformatics, 2007, 5(1): 53-58. (http://www.ncbi.nlm.nih.gov/pubmed/17572364)
During BioHackathon 2011, we have added support for querying using nucleotide/amino acid sequence, running BLAT search to retrieve the closest UniProt ID. RDF output was also added during this Hackathon.
GeneID:947170 by tabular format
P0A7G6 (UniProt) by N-Triple format
hsa:126 (KEGG) by RDF format
POST sequence directly
One of the central advantage of Linked Data as an end-user biologist, is the ease of the comprehensive retrieval of related information. This allows a researcher to observe a data or a gene set from multi-omics point of view, including the gene functions and ontology, orthologous groups or pathways, amino acid domains and protein families, and so on. On the other hand, the multitude of Linked Data can easily become overwhelming, resulting in familiar "hair-balls" frequently seen in protein-interaction networks: the links of data can become too complex to comprehend.
Therefore, we believe there should be some kind of sophisticated filtering of the retrieved information, ranking the results according to relevance to one's interest, or by some form of enrichment of interesting phenomenon. Such filtering, or data arrangement and presentation, should ideally be available upon querying with intuitive visualization. To this end, during this Hackathon, we have developed a proposal for such filtering and visualization.
Starting from the complete genome (gene set) of Escherichia coli, we have firstly retrieved all Linked Data using above-mentioned Restauro-G v.2 service, as well as several numerical data calculated through the G-language REST web service (product of BioHackathon 2009) including the gene expression prediction measures CAI (Codon Adaptation Index), PHX (Predicted Highly eXpressed genes), FOP (Frequency of OPtimal codons), gene lengths, and distance from origin of replication.
- Arakawa K*, Kido N, Oshita K, Tomita M, "G-language Genome Analysis Environment with REST and SOAP Web Service Interfaces", Nucleic Acids Research, 2010, 48:W700-W705. (http://www.ncbi.nlm.nih.gov/pubmed/20439313)
These related data can be categorized as "nominal" or "numerical". All database links retrieved using Restauro-G v.2 are "nominal", and all values calculated using the G-language REST service are "numerical". Therefore, there should be three types of statistics to calculate the relevance among these data (for "nominal" vs. "nominal", "numerical" vs. "numerical", "nominal" vs. "numerical"). The first two comparisons ("nominal" vs. "nominal", "numerical" vs. "numerical") of same types can be a simple statistics for association (or correlations) as follows:
- Cramer's V for nominal data
- Spearman's rank correlation for continuous data
The Cramer’s V contingency coefficient (Cramer (1946)), is an extension of the φ coefficient on r × c contingency tables.
where χ2 - value of the χ2 test statistic, n - total frequency in a contingency table, w – jthe smaller the value out of r and c.
These values (Cramer's V and Spearman's ρ) can show the relatedness of the information. There are often overlapping databases in biology, such as KEGG, Reactome, BioCyc for pathway databases, Pfam, InterPro, Tigarfam for protein domains, COG, eggnog, and KEGG OC for orthology, and so on, so it will be useful to show the relatedness of the databases to look at informations of interest. If similar databases are clustered according to their relatedness or similarity, a researcher can easily choose one database from each cluster, and can spend more time on different clusters, or see detailed differences within the clusters.
Moreover, when there are "nominal" and "numerical" data, these data sets can be used for enrichment analysis, to see the most relevant database or entries of the database for certain numerical features. For example, one can quickly find the most enriched Gene Ontology, KEGG Pathway, Protein domains, for the most highly expressed gene groups. For enrichment analysis, Fisher's exact test is most frequently used in biology, so here we use
- Fisher's exact test for enrichment of top 25% continuous data (in comparison to all genes) against nominal data (categories)
- Only showing associations between data sets: http://ws.g-language.org/toys/bh11/
- graph showing associations with enrichment: http://ws.g-language.org/toys/bh11/index2.html
This is currently a demo with the pre-calculated data of E.coli. However, this kind of visualization using statistical scoring scheme should provide an intuitive framework for novice users of Linked Data, showing what is relevant to the data of interest provided by the users rather than blinding him with a mass of information, and by helping to make sense of the huge masses of data.