# VMC Digest and Computed Identifiers Example

This notebook walks through the VMC digest and its application to computed identifiers.

The VMC digest is *not* a new algorithm. Instead, it is merely a set of conventions prescribing the use of existing (and well-respected) methods to arbitrary data. When data participants use these conventions, they are guaranteed to generate exactly the same identifiers for objects from the underlying data, thereby obviating central naming authorities and also enabling names for local entities.

Identifiers provide unique names for data. Identifiers may imply the naming authority and/or where to obtain data, but these inferences are not generally guaranteed. For example, NCBI:NC_000019.10 refers to chromosome 19 from GRCh37 primary assembly, but that sequence may be obtained from many sources. (Conceptually, this is akin to the ontology of URIs: A URI is an identifier for a resource; most URIs are URLs because they define where to get data, but some are URNs because they name data without describing where to get it.)

An Identifier is a *structured* object with a *namespace* and an *accession* components. Identifier is modeled after modeled after the W3C CURIE format (https://www.w3.org/TR/curie/) and is intended to become consistent with FHIR specifications. (As of Sprint 2018, FHIR precludes a colon in an identifier, which conflicts with the CURIE standard.) The namespace and accession componenets of an Identifier correspond to the "namespace" and "local part"
attributes of a CURIE.

In [1]:
import base64
import binascii
import hashlib

digest_size = 24

## ① vmc_digest


Specifically, the conventions are:
* The payload to be digested should be ASCII encoded
* The digest is computed by SHA512, truncated to 24 bytes
* The truncated digest is encoded using base64 with the URL-safe character set

In [2]:
# First, a complete example. Here we generate the vmc_digest for the string "ACGT" 

data = "ACGT" # unicode
blob = data.encode("ascii")
digest = hashlib.sha512(blob).digest()
truncated_digest = digest[:digest_size]
vmc_digest = base64.urlsafe_b64encode(truncated_digest)
vmc_digest.decode("ascii")

'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

In [3]:
# For code clarity and maintainability, it is preferably to have a single function to generate digests.

def vmc_digest(data):
    blob = data.encode("ascii")
    digest = hashlib.sha512(blob).digest()
    tdigest = digest[:digest_size]
    vdigest = base64.urlsafe_b64encode(tdigest).decode("ascii")
    return vdigest

In [4]:
vmc_digest("ACGT")

'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

## ② computed identifiers and ids

In [5]:
# For structured data, we must first *serialize* the object. VMC defines serialization formats
# for all objects. For example, a Location object serialization might be:
data = "<Location|VMC:GS_IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl|<Interval|44908683|44908684>>"
vdigest = vmc_digest(data)
vdigest

'L1IS6jOwSUsOpKihGRcqxHul1IwbV-1s'

In [6]:
# An Identifier is a structured object with a namespace and an accession. Namespace
# indicates the origin of the identifier and the accession is the unique key within the namespace.
# For VMC objects, the namespace is "VMC" and the accession is constructed by prefixing the digest with
# a type indicator. For Location objects, the type indicator is "GL" (Global Location).
ir = {"namespace": "VMC", "accession": "GL" + "_" + vdigest}
ir

{'namespace': 'VMC', 'accession': 'GL_L1IS6jOwSUsOpKihGRcqxHul1IwbV-1s'}

In [7]:
# An Id is a string that may be used as a key for data. One way to generate an Id is to serialize an
# Identifier, like so:
id = "{ir[namespace]}:{ir[accession]}".format(ir=ir)
id

'VMC:GL_L1IS6jOwSUsOpKihGRcqxHul1IwbV-1s'

In [8]:
# VMC makes extensive use of ids when linking objects