Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Working group motivation:
"having good fish (raw material) is very important to get a good sushi (product)", Yamamoto-san, 2012
"having good rice (raw material) is very important to get good sake (product)", Erick-san, 2012 ;-)
"having good ontologies (raw material) are very important to get a sound Semantic Web (product)", BH12, 2012
- Metagenome Environmental Ontology (MEO) update ontology with input from PLB. [HM][SK][SO][WI]
- Form guidelines for developing MEO for sustainability and usefulness (SOP) [PLB]
- Explore the role of application ontologies as moderators between casual users and ontology developers. Outline draft SOP.
- Design an ontology describing incomplete enzyme reaction equations. [MK][PLB]
- Integrate/deploy ONTO-perl modules for OBO-formatted ontologies. [EA]
- guidelines for ontology developers 1.1. common pitfalls 1.2. which ontologies should I (re)use?
- guidelines for predicates design
- try to develop automatic ontology mapping work flow with NER
- Perl modules to automatically check OBO-formatted ontologies
What is an ontology?
[PLB] The use of the term ontology varies across and within research domains. Introductory lectures from the perspective of Barry Smith are available here with a view on the role of ontology in scientific domains (compared to engineering) here. A guide on ontology development in Protege from Noy and mcGuiness is available [here] (http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html).
When considering MEO, it became clear that more careful consideration of the classes this project seeks to represent and a revision of the idea of 'ontology coordination' was needed. It seems that MEO, in spirit, is aiming to serve as a an application ontology for the (meta)genomics community (please correct this view if its mistaken). While this may help in the short-term for simple queries, issues connected with 'merging' large sections of orthogonal ontologies under one framework will become problematic when more involved questions are posed.
[HM]: I think the difficulty that our MEO has confronted is: OBO vs OWL. Are the usages of class and predicate well-defined and described to the ontology file? and understandable for computers? If so, we can use consistency check for usage of class and predicate by computer, and revise our MEO easily. Related discussion is as follows: http://themindwobbles.wordpress.com/2008/02/20/from-obo-to-owl-and-back-again-obo-capabilities-of-the-owl-api/
[EA]: Do you really need to capture your terms in OWL? is OBO not enough? [PLB]: Also, do you think the OBO vs OWL issue is as important as addressing the design/concept of MEO?
Goal 1: MEO update
background: MicrobeDB.jp (under construction) integrates several microbial data (include omics, taxonomy/cultures, habitats) using semantic web technology. To appropriately describe and easily search microbial habitats information in existing-DB entries, we developed Metagenome/Microbes Environmental Ontology (MEO), which coordinate ENVO, GAZ, NCBI Taxonomy, FMA, and BodyParts3D.
use case description:
- To know the distribution of specific microbes (e.g., Escherichia coli SE11) on earth.
- To obtain the shared ortholog-set of microbes which inhabit specific habitats.
- Multivariate analysis between environmental habitats and taxonomic compositions of microbes.
- Functional enrichment analysis based on functional annotations.
- Phylogenetic analysis based on 16S RNA sequence.
Issues and suggested updates
[PLB] MEO's attempt to combine very different ontologies/taxonomies/vocabularies is likely to lead to semantic confusion. For example, the terms in the NCBI Taxonomy are not semantically equivalent to environments. However, there is a need to quickly create terms for the microbial (meta)genomics community based on annotation efforts in MicrobeDB.jp. Discussed options:
- Treat each 'coordinated' resource as a separate attribute of a (meta)genome record. Sensible RDF predicates may then be created between these along the goals of MicrobeDB. Thus, a record will have an EnvO annotation, a GAZ annotation, etc. and the MicrobeDB team may experiment with their relations based on their data.
- Suggest new terms to EnvO based on the annotation efforts and user requests. Immediately create MEO terms to allow annotation to proceed while new terms are being integrated. If the EnvO term creation is successful, replace the MEO ID with the ENVO ID.
- Alternatively, add new terms (and well-formed definitions) directly to EnvO, where they will be curated if needed, but have a stable ID. A subset of EnvO terms specific to MicrobeDB's users can be created and its composition maintained by the MicrobeDB team.
Goal 2 / Goal 3
[PLB] Following discussions on the aims and proposed functioning of MEO, the following approach was suggested:
- Coordination: MEO can benefit from truly coordinating annotations from EnvO (one field each for biome, feature, material), GAZ, NCBI Taxonomy, etc. rather than attempting to integrate them in one ontology. A service with a coordinated set of descriptors (vs a 'super-ontology') allows the addition or removal of coordinated resources with less complications.
- Targeting users: Further, MEO will create and maintain subsets (when needed) of the resources it coordinates which are relevant to its users. These subsets can be presented to users to facilitate their use of these (sometimes very large) resources, bringing added value to MicrobeDB. If users need more terms that are not in the subset, they should be able to easily search and use existing terms in the full resources.
- Enriching community standards: When the need for new terms is clear, MicrobeDB can request/create such terms in the resources it coordinates. Thus, MicrobeDB, through MEO, acts as a valuable representative of the (meta)genomics community in the development of its community resources and an asset to the community projects.
- Development of enhanced search functionality: Users can enter a text query and terms found in a coordinated resource will be highlighted (disambiguation requested when needed). The term(s) found in each resource will then be used to progressively filter (meta)genome records and deliver a result set. Interactive browsing of resource subsets (defined by MEO) can also be supported.
Hence, MEO transitions into a resource coordination, delivery, and development application for the (meta)genomics community. This will accelerate the use and engagement with ontological and other semantically controlled resources by this community, while being easily maintainable and extensible.
Standard Operating Procedure (Draft: Do not quote or cite) for MEO-like services
Date: 2012-09-07 Author(s): [PLB] Status: Draft; non-authoritative
A. Scope and Applicability
This procedure is designed to bridge the development and coordinated application of varied semantic resources ("resources") in the structured description, archiving, and retrieval of (meta)genomic data records stored by a data provider ("provider") for a specified user base ("users"). This procedure is applicable in scenarios where existing and active resources address the semantic domains needed by a data provider to address the needs of its users. This is an initial draft, is subject to unannounced change, and is in no way authoritative.
This procedure defines a working model wherein a provider coordinates a collection of resources for the purpose of annotation and semantic querying of its data collection. The provider will focus on a) the annotation and querying of data using multiple resources and b) engaging existing resource managers and requesting developments relevant to the provider's users. To promote efficiency, wide interoperability, and maintainability, the provider will only create new, custom resources when existing resources are unable to meet its users' needs.
C. Actors and Roles
Data providers: Providers host the data their users wish to access and develop the tools required to deliver custom subsets of this data to users on demand. Subset membership shall be informed by annotations of the data records performed by the provider or a trusted third-party. Annotations reference terms existing in the semantic resources coordinated by the provider and preserve their original identifiers and semantics. Providers shall collect and vet annotator and user requests prior to forwarding these to resource managers. Providers may develop and maintain local semantic relations between coordinated resources for their own purposes. These relations may be forwarded to resource managers at the provider's discretion.
Resource managers: Resource managers develop and make available semantic resources such as ontologies, taxonomies, and controlled vocabularies. If these are successfully engaged by a data provider, they will support a) the creation, definition and integration of new terms 2) the creation and maintenance of community-specific subsets of their resource and c) the development of effective two-way feedback mechanisms for use with the data provider.
Users: Users access data records from the provider through the provider's query tools. Users may submit requests for new terms or clarification requests to the provider or directly to the resource manager. Users with special interest in developing semantic annotation/querying may provide detailed feedback which should be supported by the data provider and, if appropriate, referred to the resource manager.
D. Procedural Steps (provider)
- Review data objects of interest and determine what information would best support anticipated user queries.
- Locate active semantic resources that correspond to the types of information identified in 1.
- Engage resource managers and inform them of user community's needs, requesting terms relevant to each resource's domain derived from the provider's annotation of data.
- Integrate responsive resources and create subsets/views with special relevance to the user community. This will allow a) annotators to efficiently use each distinct resource to annotate data objects and b) users to query annotated data with one or more coordinated resources. Further, the subsets created can be submitted to the parent resource, allowing them to better serve the provider's user community at large.
- Create persistent term request mechanisms for community-specific annotators and users to engage resource managers where possible. If feasible, the provider can act as a mediator, vetting/prioritising term requests based on their knowledge of community need.
- Construct and experiment with local inter-relations of resource content to maximise community-specific usability and semantic power.
- Periodically review available resources and add/remove resources based on their (continuing) relevance and use.
E. Further contextual descriptors
Searches involving quantitative values (e.g. "Give me all metagenomes from environments with a pH between 1 and 4") are likely to be better supported by referencing numerical contextual data associated with (meta)genome records. Contextual data specification checklists such as the Minimum Information about any (x) Sequence (MIxS), which are supported by major sequence archives, may be resource providers can integrate to this end. Synonym lists corresponding to the fields of such checklists can be created based on observed user queries.
In situations where qualitative descriptors are more appropriate (e.g. "Give me all metagenomes from anoxic environments"), resources such as PATO are targets for integration.
F. Evaluation and Reporting
All reports and documentation shall be systematically archived and available to all actors. This information may, in itself, be an interesting object of study in the near future.
- Internal evaluation and reporting: providers will a) evaluate the coverage of annotations on hosted data using the resources they have integrated b) evaluate the usage patterns (anonymously) and gauge community-need c) monitor and develop metrics for the usage and logical soundness of any inter-resource semantic linkages created by the provider d) document annotator performance as well as mis-annotations which will form a best practice guide for future annotation.
- User-directed evaluation and reporting: providers will a) document (in approachable terms) which resources the user is accessing and how they represent the entities in their domain (the provider is encouraged to engage the resource manager for assistance) b) report typical misuses of query tools as they arise and demonstrate proper usage and c) evaluate and report which user-suggested terms are most likely to be integrated into a given resource.
- Resource-directed evaluation and reporting: providers will a) send periodic evaluations to resource managers summarising how well a given resource meets the needs of the provider's target community b) inform resource managers which resources are typically co-queried on their platform c) evaluate and report how responsive a given resource has been to the provider's community-specific requests, logging actions taken and their rationale.
background: Interfaces between users and ontology consortia through applied ontology/service providers: How do we normalise communication and expectancy? How do we translate this into a viable working model? Is it needed?
Select key users that are driven to give critical feedback and connect them with the ontology. Application Ontology providers can be a hub/moderator of this feedback. Tools may be developed to deliver ontology user/provider satisfaction with the quality and application of a given ontology to all parties. This can help identify issues earlier.
Goal 4: Semantic representation of incomplete enzyme reaction equations based on ontological principles
Incomplete enzyme reactions are not of interest to IUBMB (who manage EC numbers), but are common in metabolomics. It would be helpful to establish a structured representation to describe the available knowledge out of the reaction of interest even if the equation is not complete.
use case description: classify compounds, functional groups, and relations between enzymes.
Duscussion: Rough design [Click Here]
- Funtional Group = def. "A group of atoms bonded in a conserved chemical structure which has predictable chemical properties"
- Incomplete reaction equation can be represented like: Known compound A (Str1+FG1) + Unknown Compound C (Unknown StrX+FG3) => Known compound B (Str1+FG3) + Unknown…
- catalyst, substrate and product can be put in "role" (reaction role): catalyst is a role. substrate is a role. product is a role.
- Reversible reactions should be regarded as independent entries, connected to each other using "transformation_of" relation of the identity of the preserved structures: Compound B is transformation of Compound A in reaction 1. Compound A is transformation of Compound B in reaction 2.
- The "has_agent" relation might be helpful to describe the role of enzyme, but if we use it, we might need to create a child of reaction (or process?) "enzymatic catalysis" (which may need to consider the mechanism).
- RDF Container Elements might help describe reactions: A reaction could be represented as a container, and enzyme E has role "catalyst" in the reaction compound A has role "substrate" in the reaction compound B has role "product" in the reaction compound B is transformation of compound A in the reaction.
- Instead of using "transformation_of" as a direct link of two compounds, it might be better to include this relation in the reaction definition: Enzyme catalysis (or reaction) Rxn X = def. is a reaction in which COMPOUNDID:compA has_role ROLEID:substrate, and COMPOUNDID:compB has_role ROLEID:product, and ENZYMEID:enzE has_role ROLEID:catalystt, and COMPOUNDID:compI has_role ROLEID:intermediate, etc.
- Better check GO Biological Process, and see how it models processes and references there participants.
Choice of Ontology file format: OBO or OWL?
There are two main stream of file format to describe ontology.
Brief description extracted and edited from above link.
The OBO flat file format is an ontology representation language. The concepts it models represent a subset of the concepts in the OWL description logic language, with several extensions for meta-data modeling and the modeling of concepts that are not supported in DL languages. The format itself attempts to achieve the following goals:
- Human readability
- Ease of parsing
- Minimal redundancy (limit overlap of ontologies in related fields)
Most of ontologies that use OBO Format are limited to Bio Science
Brief description extracted and edited from above link.
The OWL 2 Web Ontology Language, informally OWL 2, is an ontology language for the Semantic Web with formally defined meaning. OWL 2 ontologies provide classes, properties, individuals, and data values and are stored as Semantic Web documents. OWL 2 ontologies can be used along with information written in RDF, and OWL 2 ontologies themselves are primarily exchanged as RDF documents.
What are the differences between OBO and OWL formats?
・Semantic equivalent of a construct in one format is sometimes missing from the other format.
- The lack of globally unique identifiers (GUIDs)
- Lacks adequate query and rule language support
- Hard to read for humans
- High storage costs
Set up PURL for MEO
- Ontology Development 101: A Guide to Creating Your First Ontology
- Protégé OWL Tutorial
Interested parties (Alphabetical Order)
- Erick Antezana [EA]
- Hiroshi Mori [HM]
- Masaaki Kotera [MK]
- Shuichi Kawashima [SK]
- Shinobu Okamoto [SO]
- Pier Luigi Buttigieg [PLB]
- Wataru Iwasaki [WI]