Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Records for languages like SMILES, HGVS, and SPDI #460

Open
sierra-moxon opened this issue Jul 19, 2022 · 4 comments
Open

Records for languages like SMILES, HGVS, and SPDI #460

sierra-moxon opened this issue Jul 19, 2022 · 4 comments
Labels

Comments

@sierra-moxon
Copy link
Contributor

sierra-moxon commented Jul 19, 2022

Background:

SPDI (Sequence Position Deletion Insertion) nomenclature and HGVS (Human Genome Variation Society) nomenclature are two standards that when used correctly, can uniquely identify sequence variants. The HGVS and SPDI nomenclature provide a short-hand notation for capturing: the genome, assembly, position, and sequence change of a sequence variant. In this way, they are a kind of identifier.

Motivation for Prefixes:

We have a group of users that would like to identify a prefix for either (or both of):

Use Cases:

continuing from biolink-model issue:biolink/biolink-model#1042

  1. Biolink Model cares about identifier prefixes and in particular, uses them to document data sources that provide each class via the "id_prefixes" construct: https://linkml.io/linkml-model/docs/id_prefixes/
  2. Software built to find identifier equivalences between data sources for the NCATS Data Translator project (the Node Normalizer), uses Biolink Model prefix lists to:
    • return the preferred CURIE for this entity
    • return all other known equivalent identifiers for the entity
    • return semantic types for the entity as defined by the Biolink Model

It would be helpful to be able to reuse the existing architecture above to place HGVS and SPDI "identifiers" in their appropriate biolink model classes and normalize them in the context of other sequence variant identifiers in disparate data sets across NCATS Data Translator.

Challenges:

  1. There is no service or site that resolves these identifiers to some sort of informational page about the sequence variant, however, SPDI does have an API that gives a JSON data structure that reflects the content of the SPDI nomenclature of particular variants.
  2. Other groups also mint identifiers for the variant described by this syntax. For example, Rat Genome Database has identifiers for Rat variants that resolve to detail pages about the variant: https://rgd.mcw.edu/rgdweb/report/rgdvariant/main.html?id=RGD:14349032. That same rat variant can also have HGVS nomenclature: https://www.alliancegenome.org/allele/RGD:1600311#genomic-variant-information
  3. For HGVS nomenclature, there are three sites (perhaps more) that describe the standard:

Questions:

Does this group have any opinions on a registered prefix for an identifier that isn't resolvable (and isn't a parked prefix, meaning, there is no service that plans to support the expansion of the registered prefix)?

@cthoyt
Copy link
Member

cthoyt commented Jul 19, 2022

TLDR: this is like a 2/5 on in-scope, but we might be able to give some support anyway

It's possible to register prefixes even if there's no website that resolves them. However, I'm familiar with HGVS and it's not clear if HGVS strings count as identifiers the same way other "controlled" vocabularies do (the same way that we don't think of InChI and SMILES as prefixes where those strings are identifiers).

However, both of those managed to get prefixes anyway, so it might not be the worst thing to skip passing judgment. Please go ahead and send some new prefix requests for these and we will do our best to get as much info about them before accepting

@cthoyt
Copy link
Member

cthoyt commented Jul 19, 2022

Let's say that for these to be useful in the Bioregistry, we need a very good regex for enforcement

@cthoyt cthoyt changed the title prefixes for unresolveable identifiers Records for languages like SMILES, HGVS, and SPDI Jul 19, 2022
@sierra-moxon
Copy link
Contributor Author

sierra-moxon commented Jul 19, 2022

spdi is supported by an API: https://api.ncbi.nlm.nih.gov/variation/v0/spdi/NC_000001.10%3A12345%3A%3AC/ which basically breaks apart the nomenclature into its component pieces. In this case, perhaps the prefix 'spdi' would be useful.

J. Bradley Holmes (orcid:0000-0001-8354-5062) might be a good owner of this prefix because he authored SPDI: data model for variants and applications at NCBI.

SPDI nomenclature syntax examples:

str:str:[str|""]:str[:|""]
NG_012345.1:4:1:T
NG_012345.1:4:G:T
NG_012345.1:4:1:
NG_012345.1:4:G:
NG_012345.1:4:0:T
NG_012345.1:4::T

@cmungall
Copy link
Contributor

cmungall commented Aug 15, 2022

Similar use cases:

  • UCUM codes (which are composable), acting as identifiers for units
  • arbitrary OWL expressions for post-composition, acting as identifiers for anonymous class expressions
  • GFF tuples to act as identifiers for regions on a genome
  • syntax for protein modifications, as an alternative to precomposed terms in PRO
  • LINUCS strings e.g. http://www.glycosciences.de/database/start.php?action=explore_linucsid&linucsid=1

https://units-of-measurement.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants