Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve ncbitaxon on primary provider website #1044

Open
bgyori opened this issue Feb 10, 2024 · 2 comments
Open

Resolve ncbitaxon on primary provider website #1044

bgyori opened this issue Feb 10, 2024 · 2 comments

Comments

@bgyori
Copy link
Contributor

bgyori commented Feb 10, 2024

In most applications it would be useful to resolve ncbitaxon IDs to the NCBI's website, e.g., https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606 as the primary provider of these IDs. Currently, https://bioregistry.io/ncbitaxon:9606 first resolves to http://purl.obolibrary.org/obo/NCBITaxon_9606 and then to https://ontobee.org/ontology/NCBITaxon?iri=http://purl.obolibrary.org/obo/NCBITaxon_9606, a third party provider. I suspect that the choice of using purl here is motivated by URI-based identification rather than web-based resolution concerns. Still, could we make the NCBI website the default resolver?

cthoyt added a commit that referenced this issue Apr 16, 2024
@cthoyt
Copy link
Member

cthoyt commented Apr 16, 2024

This has come up many times and I have just spent some more time thinking about this. There are a few possible solutions:

  1. Simply change the uri_format in the NCBI Taxonomy record. This will get the job done, but have the drawback that the default exported Bioregistry prefix map will then have a non-OBO PURL in it. In the past, not having OBO PURLs show up in all places has been a point of friction for adoption by the OBO community, and changing this would probably deteriorate trust
  2. Update the configuration of the OBO PURL system. That's external to Bioregistry, and I'm not sure what the consequences would be
  3. Hack in a field for URL resolution similar to uri_format to the Bioregistry data model that is only considered during resolution. This might also motivate having a dichotomy between functions for getting IRIs and for getting URLs that bake in some assumptions about what qualities the results have. This will increase complexity for both curators and maintainers to understand the data model, and decide where this value should get considered
  4. More carefully extend the provider data model to incorporate annotations on whether a URI format string is meant for RDF, for resolution, or both. This will be quite a bit of effort, as it appears 803/1,768 (45.4%) records in the Bioregistry have explicit URI format string annotations.

Code that counts the number of URI format string annotations:

import bioregistry

total = len(bioregistry.resources())
count = sum(r.uri_format is not None for r in bioregistry.resources())
print(f"There are {count}/{total} ({count/total:.1%}) records with explicit URI format strings")

@matentzn
Copy link
Collaborator

If NCBI could way in, we could probably change the resolver of the OBO PURL to NCBI resource.. Its a bit awkward as some people might expect information about the the ontology when looking up this information, but probably its ok.

What is the concern to do the same as done for NCIT?

"uri_format": "https://ncit.nci.nih.gov/ncitbrowser/ConceptReport.jsp?dictionary=NCI%20Thesaurus&code=$1",
"uri_format_rdf": "http://purl.obolibrary.org/obo/NCIT_$1"

Is it that tooling (curies) does not respect the uri_format_rdf slot?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants