Resolve ncbitaxon on primary provider website #1044

bgyori · 2024-02-10T15:04:00Z

In most applications it would be useful to resolve ncbitaxon IDs to the NCBI's website, e.g., https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606 as the primary provider of these IDs. Currently, https://bioregistry.io/ncbitaxon:9606 first resolves to http://purl.obolibrary.org/obo/NCBITaxon_9606 and then to https://ontobee.org/ontology/NCBITaxon?iri=http://purl.obolibrary.org/obo/NCBITaxon_9606, a third party provider. I suspect that the choice of using purl here is motivated by URI-based identification rather than web-based resolution concerns. Still, could we make the NCBI website the default resolver?

The text was updated successfully, but these errors were encountered:

Closes #1044

cthoyt · 2024-04-16T11:39:35Z

This has come up many times and I have just spent some more time thinking about this. There are a few possible solutions:

Simply change the uri_format in the NCBI Taxonomy record. This will get the job done, but have the drawback that the default exported Bioregistry prefix map will then have a non-OBO PURL in it. In the past, not having OBO PURLs show up in all places has been a point of friction for adoption by the OBO community, and changing this would probably deteriorate trust
Update the configuration of the OBO PURL system. That's external to Bioregistry, and I'm not sure what the consequences would be
Hack in a field for URL resolution similar to uri_format to the Bioregistry data model that is only considered during resolution. This might also motivate having a dichotomy between functions for getting IRIs and for getting URLs that bake in some assumptions about what qualities the results have. This will increase complexity for both curators and maintainers to understand the data model, and decide where this value should get considered
More carefully extend the provider data model to incorporate annotations on whether a URI format string is meant for RDF, for resolution, or both. This will be quite a bit of effort, as it appears 803/1,768 (45.4%) records in the Bioregistry have explicit URI format string annotations.

Code that counts the number of URI format string annotations:

import bioregistry

total = len(bioregistry.resources())
count = sum(r.uri_format is not None for r in bioregistry.resources())
print(f"There are {count}/{total} ({count/total:.1%}) records with explicit URI format strings")

matentzn · 2024-04-16T12:29:24Z

If NCBI could way in, we could probably change the resolver of the OBO PURL to NCBI resource.. Its a bit awkward as some people might expect information about the the ontology when looking up this information, but probably its ok.

What is the concern to do the same as done for NCIT?

"uri_format": "https://ncit.nci.nih.gov/ncitbrowser/ConceptReport.jsp?dictionary=NCI%20Thesaurus&code=$1",
"uri_format_rdf": "http://purl.obolibrary.org/obo/NCIT_$1"

Is it that tooling (curies) does not respect the uri_format_rdf slot?

cthoyt added a commit that referenced this issue Apr 16, 2024

Update default providers

b55a363

Closes #1044

This was referenced Apr 29, 2024

Consider a non-isomorphic mappings to NCBI Taxon database obophenotype/ncbitaxon#55

Open

A general solution to the databases as ontologies problem in bioregistry #1104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve ncbitaxon on primary provider website #1044

Resolve ncbitaxon on primary provider website #1044

bgyori commented Feb 10, 2024

cthoyt commented Apr 16, 2024

matentzn commented Apr 16, 2024

Resolve ncbitaxon on primary provider website #1044

Resolve ncbitaxon on primary provider website #1044

Comments

bgyori commented Feb 10, 2024

cthoyt commented Apr 16, 2024

matentzn commented Apr 16, 2024