VRS provides an algorithmic solution to deterministically generate a globally unique identifier from a VRS object itself. All valid implementations of the VRS Computed Identifier will generate the same identifier when the objects are identical, and will generate different identifiers when they are not. The VRS Computed Digest algorithm obviates centralized registration services, allows computational pipelines to generate "private" ids efficiently, and makes it easier for distributed groups to share data.
A VRS Computed Identifier for a VRS concept is computed as follows:
- The object SHOULD be
normalized <normalization>
. Normalization formally applies to all VRS classes. - Generate binary data to digest. If the object is a
sequence
string, encode it using UTF-8. Otherwise, serialize the object usingDigest Serialization <digest-serialization>
. Generate a truncated digest <truncated-digest>
from the binary data.Construct an identifier <identify>
based on the digest and object type.
Important
Normalizing objects is STRONGLY RECOMMENDED for interoperability. While normalization is not strictly required, automated validation mechanisms are anticipated that will likely disqualify Variation that is not normalized. See should-normalize
for a rationale.
The following diagram depicts the operations necessary to generate a computed identifier. These operations are described in detail in the subsequent sections.
Serialization, Digest, and Computed Identifier Operations
Entities are shown in gray boxes. Functions are denoted by bold italics. The yellow, green, and blue boxes, corresponding to the
sha512t24u
,ga4gh_digest
, andga4gh_identify
functions respectively, depict the dependencies among functions.SHA512
is SHA-512 truncated to 24 bytes (192 bits), using the SHA-512 initialization vector. base64url is the official name of the variant of Base64 encoding that uses a URL-safe character set. [figure source]
Note
Most implementation users will need only the ga4gh_identify
function. We describe the ga4gh_serialize
, ga4gh_digest
, and sha512t24u
functions here primarily for implementers.
Implementations MUST adhere to the following requirements:
- Implementations MUST use the normalization, serialization, and digest mechanisms described in this section when generating GA4GH Computed Identifiers. Implementations MUST NOT use any other normalization, serialization, or digest mechanism to generate a GA4GH Computed Identifier.
- Implementations MUST ensure that all nested objects are identified with GA4GH Computed Identifiers. Implementations MAY NOT reference nested objects using identifiers in any namespace other than
ga4gh
.
Note
The GA4GH schema MAY be used with identifiers from any namespace. For example, a SequenceLocation may be defined using a sequence_id = refseq:NC_000019.10
. However, an implementation of the Computed Identifier algorithm MUST first translate sequence accessions to GA4GH SQ
accessions to be compliant with this specification.
Digest serialization converts a VRS object into a binary representation in preparation for computing a digest of the object. The Digest Serialization specification ensures that all implementations serialize variation objects identically, and therefore that the digests will also be identical. provides validation tests to ensure compliance.
Important
Do not confuse Digest Serialization with JSON serialization or other serialization forms. Although Digest Serialization and JSON serialization appear similar, they are NOT interchangeable and will generate different GA4GH Digests.
Although several proposals exist for serializing arbitrary data in a consistent manner ([Gibson], [OLPC], [JCS]), none have been ratified. As a result, defines a custom serialization format that is consistent with these proposals but does not rely on them for definition; it is hoped that a future ratified standard will be forward compatible with the process described here.
The first step in serialization is to generate message content. If the object is a string representing a sequence
, the serialization is the UTF-8 encoding of the string. Because this is a common operation, implementations are strongly encouraged to precompute GA4GH sequence identifiers as described in required-data
.
If the object is an instance of a VRS class, implementations MUST:
- ensure that objects are referenced with identifiers in the
ga4gh
namespace- replace each nested
identifiable object
with their corresponding digests.- order arrays of digests and ids by Unicode Character Set values
- filter out fields that start with underscore (e.g., _id)
- filter out fields with null values
The second step is to JSON serialize the message content with the following REQUIRED constraints:
The criteria for the digest serialization method was that it must be relatively easy and reliable to implement in any common computer language.
Example
allele = models.Allele(location=models.SequenceLocation(
sequence_id="ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
interval=simple_interval),
state=models.SequenceState(sequence="T"))
ga4gh_serialize(allele)
Gives the following binary (UTF-8 encoded) data:
{"location":"u5fspwVbQ79QkX6GHLF8tXPCAXFJqRPx","state":{"sequence":"T","type":"SequenceState"},"type":"Allele"}
For comparison, here is one of many possible JSON serializations of the same object:
allele.for_json()
- {
- "location": {
- "interval": {
"end": 44908822, "start": 44908821, "type": "SimpleInterval"
}, "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl", "type": "SequenceLocation"
}, "state": { "sequence": "T", "type": "SequenceState" }, "type": "Allele"
}
The sha512t24u truncated digest algorithm [Hart2020] computes an ASCII digest from binary data. The method uses two well-established standard algorithms, the SHA-512 hash function, which generates a binary digest from binary data, and a URL-safe variant of Base64 encoding, which encodes binary data using printable characters.
Computing the sha512t24u truncated digest for binary data consists of three steps:
- Compute the SHA-512 digest of a binary data.
- Truncate the digest to the left-most 24 bytes (192 bits). See
truncated-digest-collision-analysis
for the rationale for 24 bytes. - Encode the truncated digest as a base64url ASCII string.
>>> import base64, hashlib
>>> def sha512t24u(blob):
digest = hashlib.sha512(blob).digest()
tdigest = digest[:24]
tdigest_b64u = base64.urlsafe_b64encode(tdigest).decode("ASCII")
return tdigest_b64u
>>> sha512t24u(b"ACGT")
'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'
The final step of generating a computed identifier for a VRS object is to generate a W3C CURIE formatted identifier, which has the form:
prefix ":" reference
The GA4GH VRS constructs computed identifiers as follows:
"ga4gh" ":" type_prefix "." <digest>
Warning
Do not confuse the W3C CURIE prefix
("ga4gh") with the type prefix.
Type prefixes used by VRS are:
SQ, Sequence VA, Allele VH, Haplotype VAB, Abundance VS, VariationSet VSL, SequenceLocation VCL, ChromosomeLocation VT, Text
For example, the identifer for the allele example under digest-serialization
gives:
ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_
- Gibson
- Hart2020
Hart RK, Prlić A. SeqRepo: A system for managing local collections of biological sequences. PLoS One. 2020;15: e0239883. doi:10.1371/journal.pone.0239883
- JCS
- OLPC