Records for languages like SMILES, HGVS, and SPDI #460

sierra-moxon · 2022-07-19T14:55:27Z

Background:

SPDI (Sequence Position Deletion Insertion) nomenclature and HGVS (Human Genome Variation Society) nomenclature are two standards that when used correctly, can uniquely identify sequence variants. The HGVS and SPDI nomenclature provide a short-hand notation for capturing: the genome, assembly, position, and sequence change of a sequence variant. In this way, they are a kind of identifier.

Motivation for Prefixes:

We have a group of users that would like to identify a prefix for either (or both of):

Use Cases:

continuing from biolink-model issue:biolink/biolink-model#1042

Biolink Model cares about identifier prefixes and in particular, uses them to document data sources that provide each class via the "id_prefixes" construct: https://linkml.io/linkml-model/docs/id_prefixes/
Software built to find identifier equivalences between data sources for the NCATS Data Translator project (the Node Normalizer), uses Biolink Model prefix lists to:
- return the preferred CURIE for this entity
- return all other known equivalent identifiers for the entity
- return semantic types for the entity as defined by the Biolink Model

It would be helpful to be able to reuse the existing architecture above to place HGVS and SPDI "identifiers" in their appropriate biolink model classes and normalize them in the context of other sequence variant identifiers in disparate data sets across NCATS Data Translator.

Challenges:

There is no service or site that resolves these identifiers to some sort of informational page about the sequence variant, however, SPDI does have an API that gives a JSON data structure that reflects the content of the SPDI nomenclature of particular variants.
Other groups also mint identifiers for the variant described by this syntax. For example, Rat Genome Database has identifiers for Rat variants that resolve to detail pages about the variant: https://rgd.mcw.edu/rgdweb/report/rgdvariant/main.html?id=RGD:14349032. That same rat variant can also have HGVS nomenclature: https://www.alliancegenome.org/allele/RGD:1600311#genomic-variant-information
For HGVS nomenclature, there are three sites (perhaps more) that describe the standard:

Questions:

Does this group have any opinions on a registered prefix for an identifier that isn't resolvable (and isn't a parked prefix, meaning, there is no service that plans to support the expansion of the registered prefix)?

cthoyt · 2022-07-19T15:03:10Z

TLDR: this is like a 2/5 on in-scope, but we might be able to give some support anyway

It's possible to register prefixes even if there's no website that resolves them. However, I'm familiar with HGVS and it's not clear if HGVS strings count as identifiers the same way other "controlled" vocabularies do (the same way that we don't think of InChI and SMILES as prefixes where those strings are identifiers).

However, both of those managed to get prefixes anyway, so it might not be the worst thing to skip passing judgment. Please go ahead and send some new prefix requests for these and we will do our best to get as much info about them before accepting

cthoyt · 2022-07-19T16:08:11Z

Let's say that for these to be useful in the Bioregistry, we need a very good regex for enforcement

sierra-moxon · 2022-07-19T18:08:54Z

spdi is supported by an API: https://api.ncbi.nlm.nih.gov/variation/v0/spdi/NC_000001.10%3A12345%3A%3AC/ which basically breaks apart the nomenclature into its component pieces. In this case, perhaps the prefix 'spdi' would be useful.

J. Bradley Holmes (orcid:0000-0001-8354-5062) might be a good owner of this prefix because he authored SPDI: data model for variants and applications at NCBI.

SPDI nomenclature syntax examples:

str:str:[str|""]:str[:|""]
NG_012345.1:4:1:T
NG_012345.1:4:G:T
NG_012345.1:4:1:
NG_012345.1:4:G:
NG_012345.1:4:0:T
NG_012345.1:4::T

cmungall · 2022-08-15T21:01:35Z

Similar use cases:

UCUM codes (which are composable), acting as identifiers for units
arbitrary OWL expressions for post-composition, acting as identifiers for anonymous class expressions
GFF tuples to act as identifiers for regions on a genome
syntax for protein modifications, as an alternative to precomposed terms in PRO
LINUCS strings e.g. http://www.glycosciences.de/database/start.php?action=explore_linucsid&linucsid=1

https://units-of-measurement.org/

cthoyt changed the title ~~prefixes for unresolveable identifiers~~ Records for languages like SMILES, HGVS, and SPDI Jul 19, 2022

cmungall mentioned this issue Aug 30, 2022

Add prefix glycosciences #537

Closed

cmungall mentioned this issue Nov 8, 2022

Discussion about how to improve UCUM #648

Open

cthoyt added the Curation label Mar 18, 2023

cthoyt mentioned this issue Jan 24, 2024

Add prefix [hgvs] #1032

Closed

cmungall mentioned this issue Feb 15, 2024

Handling of CURIEs that include square brackets don't conform to W3C specs and are incompatible with semantic web tools biopragmatics/curies#103

Open

cmungall mentioned this issue May 31, 2024

Add prefix uniprot.mnemonic #1110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Records for languages like SMILES, HGVS, and SPDI #460

Records for languages like SMILES, HGVS, and SPDI #460

sierra-moxon commented Jul 19, 2022 •

edited by cthoyt

Loading

cthoyt commented Jul 19, 2022 •

edited

Loading

cthoyt commented Jul 19, 2022

sierra-moxon commented Jul 19, 2022 •

edited by cthoyt

Loading

cmungall commented Aug 15, 2022 •

edited

Loading

Records for languages like SMILES, HGVS, and SPDI #460

Records for languages like SMILES, HGVS, and SPDI #460

Comments

sierra-moxon commented Jul 19, 2022 • edited by cthoyt Loading

Background:

Motivation for Prefixes:

Use Cases:

Challenges:

Questions:

cthoyt commented Jul 19, 2022 • edited Loading

cthoyt commented Jul 19, 2022

sierra-moxon commented Jul 19, 2022 • edited by cthoyt Loading

cmungall commented Aug 15, 2022 • edited Loading

sierra-moxon commented Jul 19, 2022 •

edited by cthoyt

Loading

cthoyt commented Jul 19, 2022 •

edited

Loading

sierra-moxon commented Jul 19, 2022 •

edited by cthoyt

Loading

cmungall commented Aug 15, 2022 •

edited

Loading