Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to tell if a prefix is for an ontology, database, etc.? #926

Open
webermn opened this issue Aug 17, 2023 · 3 comments
Open

How to tell if a prefix is for an ontology, database, etc.? #926

webermn opened this issue Aug 17, 2023 · 3 comments

Comments

@webermn
Copy link

webermn commented Aug 17, 2023

Noob question with probably a simple answer staring me in the face, but how can I determine which registry entires are formal ontologies vs. controlled vocabularies vs. databases/repositories vs. [other]?

I have been looking for a 'type' attribute (or equivalent) for this, but my primary workaround has been to use the search button to narrow the list by the terms above. Being able to filter on 'type' would be helpful for me — if not in the GUI, then as an attribute that could be parsed from the JSON/YAML/TSV export.

@cthoyt
Copy link
Member

cthoyt commented Aug 18, 2023

Hi @webermn, thanks for the comment.

The primary goal of the Bioregistry is to index semantic spaces. Often, this corresponds 1-to-1 with ontologies or databases, but this isn't always the case. A good counterexample is the Uber Anatomy Ontology (UBERON), which has both the eponymous semantic space for anatomical entities (uberon) and also an additional one for properties (ubprop). Similarly, you can see that the KEGG database results in many different semantic spaces (https://bioregistry.io/kegg.compound, https://bioregistry.io/kegg.disease, etc.).

In many ways, the difference between what's an "ontology" and what's a "database" is merely the curation philosophy of the maintainers and the kind of export format they use. For example, people typically consider HGNC as a database, but it can also be easily dumped to an ontology. You can find many examples of "databases" that are dumped as "ontologies" here: https://github.com/biopragmatics/obo-db-ingest. See further discussion about databases to ontologies at https://docs.google.com/presentation/d/1aySEHTgkags7UPJYHyvQ9frYvAIqr1G5A3u7dGF26Y4 and OBOFoundry/OBOFoundry.github.io#1981.

Where available, the Bioregistry keeps tracks of links to ontology artifacts, so this could be a simple solution for this. In https://bioregistry.io/api/registry/uberon?format=json, you can find three links to ontology artifacts:

Screenshot 2023-08-18 at 10 15 23

Similarly, the Python API allows for getting these:

>>> import bioregistry as br
>>> br.get_obo_download("uberon")
'http://purl.obolibrary.org/obo/uberon.obo'

From what you wrote, I think you just want a way of filtering the registry page from the GUI. If you can better describe why you want to do this, we might be able to work towards a solution.

@cthoyt cthoyt changed the title Type attribute for Registry records? How to tell if a prefix is for an ontology, database, etc.? Aug 18, 2023
@webermn
Copy link
Author

webermn commented Aug 20, 2023

Thank you, @cthoyt, for the helpful response. The examples you provided on "dumping" a database to an ontology, as well as Chris's slides and some of your blog posts I came across, were great context.

I think what I am ultimately trying to do/find might not be the purpose of the Bioregistry resource. It may be the purpose of other resources, and I'm just not aware. I'll try to explain:

  • To help support NIH's data management and sharing (DMS) policy, NIH's DMS site provides a list of NIH-supported repositories with links for each entry for how/where to share data as well as access data for reuse.

  • In addition to these NIH-supported repositories, there are references to using other community repositories, such as those listed in re3data, FAIRsharing, and DataCite Commons. However, the responsibility for selection is ultimately that of the PI — and my sense is that a significant number of PIs may not have a full appreciation of things like CURIEs or even how to plan for using a specific ontology or ensure URIs in their required data management and sharing plans (DMSP).

  • In my mind, resources like Bioregistry can be quite helpful to PIs, their teams, and their support staff (e.g., academic librarians) in creating DMSPs, since it's written in plain language, has good examples, has a publicly accessible API, enables download in different formats, includes contact information for points of contact, appears to have active support (like this open issue!), and links out to a lot of additional useful content.

  • However, from my perspective, navigation of the (current) 1701 registry entries from the GUI was somewhat difficult, since it's paginated at 10 per page, only displays 3 fields, and the download options are only JSON and YAML on the Registry page, which are less accessible to the average person (though I did find a TSV download on the Downloads page).

  • I think that with some additional navigation tools, such as the examples below, the Bioregistry might be even more helpful.

    • Whether the registry entry is for a repository/database, an ontology, or both (as a displayed field to filter on, hence the original reference to a 'type' attribute?)
    • Exposure of keywords that are on the individual pages
    • Additional pagination options (for 50, 100, etc.)
    • Binary selection options for things like 'hasExample', 'isDeprecated', 'hasExternalPrefix', etc.

Again, this may not really be the purpose of the Bioregistry and could instead be something other resources are attempting to achieve. That said, my logic is that the simpler it is for the average person to navigate resources like the Bioregistry and see how this might help inform things like DMSPs, the more people will become comfortable with these important curation concepts and engage with this community — which I ultimately think will better enable biomedical research.

(Please feel free to close this issue as a "Won't Do"!)

@cthoyt
Copy link
Member

cthoyt commented Aug 21, 2023

@webermn wow, that's really helpful, thanks for writing such a detailed response.

As you mentioned, there are lots of repositories that cover various resources with different metadata standards related to their different scopes. Bioregistry actually already imports and aligns with re3data, FAIRsharing, and several others (listed here). These two specific examples both have different focuses than Bioregistry, but are nonetheless valuable for aligning. I will look into if this can be done with DataCite as well (again, even though the goal of that resource is different than Bioregistry's).

Overall, I think using Bioregistry as a resource to help write DMPs is a great idea. I would love to see people taking a principled approach to what kinds of PIDs they use and especially making sure they're written/stored/communicated in a standard way.

Another registry that has a similar focus to Bioregistry (and is also itself standardized then incorporated in the Bioregistry) is https://registry.bio2kg.org/. They have a really nice navigation/faceted search that I would like to replicate that would address some of your concerns.

Additional context: we are likely going to get some dedicated funding for the Bioregistry soon, and addressing your use cases would be a great use of this opportunity. Do you think you would be able to meet next week or in the near future to discuss further? Feel free to send an email to cthoyt@gmail.com (I'm on European time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants