Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use preferred bioregistry prefixes for normalized entity identifiers #3

Closed
dhimmel opened this issue Jan 14, 2022 · 7 comments
Closed

Comments

@dhimmel
Copy link

dhimmel commented Jan 14, 2022

Great to see that BERN2 normalizes entities to compact identifiers in resource:identifier format. I noticed that there is an opportunity to standardize the prefixes used with Bioregistry:

FYI I didn't check all the entity types BERN2 is capable of tagging for whether they use the preferred prefix.

@cthoyt might also be helpful here.

@cthoyt
Copy link
Contributor

cthoyt commented Jan 15, 2022

I’d be happy to help. I’d like to try running bern2 myself locally and I’m sure doing this would make it easier to evaluate if the results are useful

@mjeensung
Copy link
Contributor

mjeensung commented Jan 18, 2022

Hi @dhimmel

Thank you for your suggestions for improving BERN2.

Do you mean that it is more standardized to use NCBIGene:10533 (for gene/protein) and NCBITaxon:10095 (for species) instead of EntrezGene:10533 and NCBI:txid10095?

@dhimmel
Copy link
Author

dhimmel commented Jan 19, 2022

Do you mean that it is more standardized to use NCBIGene:10533 (for gene/protein) and NCBITaxon:10095 (for species) instead of EntrezGene:10533 and NCBI:txid10095?

Exactly. I see a benefit if all entities tagged are represented as Bioregistry supported CURIEs to make integration with other datasets the most straightforward as possible.

Additional notes:

  • I was surprised that Bioregistry doesn't list EntrezGene as an alternative prefix for NCBIGene.
  • Looking at the NCBI page here, I see "for references in articles please use NCBI:txid10095". @cthoyt how do you reconcile the Bioregistry format being incompatible with the source-recommend format?

@cthoyt
Copy link
Contributor

cthoyt commented Jan 19, 2022

I just added in EntrezGene to https://bioregistry.io/registry/ncbigene, but I don't really see a standardized way to reconcile things that look like NCBI:txid10095 since it doesn't follow the spirit of CURIEs. You could always do some text-based preprocessing if you get stuff like this.

@mjeensung
Copy link
Contributor

@dhimmel @cthoyt

Thank you for your suggestions!
We will consider replacing current prefixes with the prefixes in BioRegistry.

@mjeensung
Copy link
Contributor

Hi @cthoyt,

I was checking BioRegistry and noticed something that I'd like to clarify. While Entrez Gene ID has the preferred prefix NCBIGene, MESH ID does not and instead uses the prefix mesh in its CURIE. My first question is why some CURIEs use preferred prefixes while others use prefixes. My second question is whether it is common in CURIE to use lowercased prefixes such as mesh:C063233 rather than MESH:C063233 as BERN2 does.

cthoyt added a commit to cthoyt/BERN2 that referenced this issue Jan 24, 2022
Related to the discussion in dmis-lab#3, the Bioregistry has the logic for generating URLs given CURIEs
@mjeensung mjeensung reopened this Jan 25, 2022
@mjeensung
Copy link
Contributor

bbad178

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants