G language project

Members

Kazuharu Arakawa
Kazuki Oshita
Hidetoshi Itaya

Aim (with some background)

Over the last couple of years, two artificial intelligence projects have achieved major milestones in knowledge discovery and complex data integration, along with massive media attentions: Watson by IBM, and Siri by Apple. Watson (DeepQA) has won against human world champions in the quiz show 'Jeopardy!', and Siri has been shipped with millions of new iPhones as a personal concierge that can communicate using natural language. While the focus of these projects are rather different (one being an assistant while another being a quiz contestant), interestingly enough, both of these projects share similar technical challenges. For example,

Voice recognition and speech synthesis
Natural language processing
Extensive data integration from an extremely wide range of knowledge domains
Need to provide answers promptly with low latency

Solutions to these problems have also several aspects in common. For example, both projects employ large-scale computing (IBM Power Cluster for Watson, compute cloud server for Siri), voice recognition from Nuance Communications, Inc. For NLP and large-scale data integration, both projects narrow down the search space by first assigning a certain task "domain". For example, Watson first guesses the "lexical answer type" from the quiz by machine learning approaches, and Siri figures out the type of task (such as scheduling, and route finding, and so on) and limiting the task to that context. Moreover, in order for the low latency and for the limitation of tasks to different domains, both project use the mix of raw text, relational databases, API calls, and semantic web technologies.

Similarly, many of the challenges in the data integration and knowledge discovery in bioinformatics would presumably benefit from these approaches. The use of natural language queries is obviously the most human-friendly user interface, and will help to fill the gap of knowledge in informatics between wet and dry scientist, and real-world problems are solved by using the mix of table, RDB, and semantic web data.

To this end, we the G-language Project members decided to create a virtual research assistant for bioinformatics, just like Siri, but in the biology domain, and to prototype this application during the BioHackathon 2012. For this, we have primarily limited the domain of data querying to "gene-centric" and "genome-centric" approaches, for the sake of the ease of implementation and frequency of such query.

Genie

Genie is a virtual research assistant for biology. Users communicate with Genie using English, via voice (using HTML5 speech recognition features or the dictation function of the MacOS X Mountain Lion), and Genie also gives back replies in synthesized voice (via Google Translate API or via MacOS X speech service).

Genie can find information on three main categories:

Anything about a gene of interest, such as, what is the sequence, function, cellular localization, pathway, related disease, related SNPs and polymorphisms, interactions, regulations, expression levels.
Anything about a set of genes, based on multiple criteria. For example, you can ask Genie to extract all SNPs, in genes that have relation to cancer, that work as transferases, that are expressed in the cytoplasm, and that have orthologs in mouse.
Anything about a genome, such as, producing different types of visual maps, calculation of GC skews, prediction of origins and terminus of replication, calculation of codon usage bias, and so on.

To do the above tasks, it is first essential to recognize the name of the organism of interest, and the genes of interest. (recognition of gene names actually requires the knowledge of the organism of interest.) This is a NLP problem, where the species name can be scientific or common, could be singular or plural, abbreviated, punctuated, with different use of spacing. Since we have to have the data about the recognized organism names, here we have used a dictionary-based approach, with weighting of overlapping common names or the different subspecies and strains, using the amount of annotations given to the gene sets. Once the organism name is resolved, annotations are fetched for this species, and subsequently, dictionary for the gene names is created dynamically.

For information retrieval, we make full use of different software systems of G-language Project.

G-language Genome Analysis Environment and its REST service for extremely rapid genome-centric analysis and information retrieval
G-language Maps (Genome Projector and Pathway Projector, as well as Chaos Game Representation REST Service) for visualization of genomic information
Keio Bioinformatics Web Services EMBASSY package (that adds functionality of BLAST, MAFFT, PHYLIP, and other 50 major bioinformatics tools to EMBOSS) as well as EMBOSS itself, totaling more than 400 tools
G-Links: an extremely rapid gene-centric data aggregator, which integrates numerous databases

Prototype version of Genie can be accessed at: http://ws.g-language.org/genie/

Demo movie is available at youtube: http://www.youtube.com/watch?v=V4jsuIOAwyM

G-Links

Project Website: http://www.g-language.org/wiki/glinks
base URL: http://link.g-language.org/ We have now a new server (and shorter URL!)

Updates:

Cover flow view ex. http://link.g-language.org/BRCA1_HUMAN
filter and extract options
support for taxonomy (now you can retrieve the annotations for the entire gene set of an organism, using RefSeq ID for microbes, and NCBI taxonomy ID for eukaryotes)
addition of supported databases (in SNPs, diseases, and etc.)

Examples:

RECA_ECOLI and GeneID:11922666 in TSV format
- http://link.g-language.org/RECA_ECOLI,GeneID:11922666/format=tsv
All of human (NCBI taxid: 9606) genes which is related with cancer in Notation3 format
- http://link.g-language.org/9606/filter=:cancer/format=n3
GOslim_process about E. coli (RefSeq:NC_000913) genes which has GOsilm_process which related with "metabolic" in RDF format
- http://link.g-language.org/NC_000913/filter=GOslim_process:metabolic/extract=GOslim_process/format=rdf
All annotations about KEGG Orthology K03553 in RDF format
- http://link.g-language.org/KO:K03553/format=rdf

G-language GAE

We have added natural language support for input genome. For example, G-language GAE can load all of the following for the E.coli K12 genome:

load("NC_000913") #RefSeq ID, automatically fetched from NCBI
load("embl:U00096") #EMBL USA, automatically fetched from EMBL
load("ecoli.gbk") #local genbank file

Now you can do the following:

load("511145") #NCBI taxonomy ID, automatically fetched from NCBI
load("E.coli") #abbreviation of genus
load("e coli") #more ambiguity
load("E.coli K12") #with strain name
load("Escherichia coli K-12") #more variation
load("Escherichia") #only the genus

We score species according to the annotation relevance: model organisms have more annotations than other related species, so even though there are tens of E.coli genomes available nowadays, G-language GAE can correctly guess to use the most relevant strain, K12 as the candidate. This can also be used in the web service:

http://rest.g-language.org/Escherichia%20coli%20K12/

Provide feedback

Saved searches