# Operations on VR Objects

The VR Specification describes operations on variants that should be supported by implementations. This notebook demonstrates the following functions:

* `normalize`: Implements sequence normalization for insertion and deletion variation
* `sha512t24u`: Implements a convention constructing and formatting digests for an object
* `ga4gh_digest`: Generates a digest for a GA4GH object
* `ga4gh_serialize`: Serializes a GA4GH object using a canonical binary form
* `ga4gh_identify`: Generates a CURIE identifier for a GA4GH object

<img src="images/id-dig-ser.png" width="75%" alt="Operations Overview"/>

**Note:** Most implementation users will need only the `ga4gh_identify` function.
We describe the `ga4gh_serialize`, `ga4gh_digest`, and `sha512t24u` functions here for completeness.

<div class="alert alert-warning">
    These operations require access to external data to translate sequence identifiers.
    See the vr-python README for installation options.
</div>

## Load data saved by Schema notebook
Loads the allele json and rehydrates an Allele object

In [1]:
import json
from ga4gh.vrs import models

data = json.load(open("objects.json"))
allele = models.Variation(**data["alleles"][0])
print(allele)
print(allele.as_dict())

Removing allOf attribute from CopyNumber to avoid python-jsonschema-objects error.
Removing allOf attribute from SequenceInterval to avoid python-jsonschema-objects error.
Removing allOf attribute from RepeatedSequenceExpression to avoid python-jsonschema-objects error.


<Allele attributes: _id, location, state, type>
{'type': 'Allele', 'location': {'type': 'SequenceLocation', 'sequence_id': 'refseq:NC_000019.10', 'interval': {'type': 'SimpleInterval', 'start': 44908821, 'end': 44908822}}, 'state': {'type': 'SequenceState', 'sequence': 'A'}}


## External Sequence Data

In order to support the full functionality of VR, implementations require access to all sequences and sequence identifiers that are uses as variation reference sequences.  For the purposes of this notebook, data are mocked as static responses. 

The VR specification leaves the choice of those data sources to the implementations.  In vr-python, `ga4gh.vrs.dataproxy` provides an abstract base class as a basis for data source adapters.  One source is [SeqRepo](https://github.com/biocommons/biocommons.seqrepo/), which is used below.  (An adapter based on the GA4GH refget specification exists, but is pending necessary changes to the refget interface to provide accession-based lookups.)

SeqRepo: [github](https://github.com/biocommons/biocommons.seqrepo/) | [data snapshots](http://dl.biocommons.org/seqrepo/) | [seqrepo-rest-service @ github](https://github.com/biocommons/seqrepo-rest-service) | [seqrepo-rest-service docker images](https://cloud.docker.com/u/biocommons/repository/docker/biocommons/seqrepo-rest-service)

RefGet: [spec](https://samtools.github.io/hts-specs/refget.html) | [perl server](https://github.com/andrewyatz/refget-server-perl)

In [2]:
from ga4gh.core import sha512t24u
from ga4gh.core import ga4gh_digest, ga4gh_identify, ga4gh_serialize
from ga4gh.vrs import __version__, models
from ga4gh.vrs.dataproxy import SeqRepoRESTDataProxy

# Requires seqrepo REST interface is running on this URL (e.g., using docker image)
seqrepo_rest_service_url = "https://services.genomicmedlab.org/seqrepo"
dp = SeqRepoRESTDataProxy(base_url=seqrepo_rest_service_url)

In [3]:
dp.translate_sequence_identifier("refseq:NC_000019.10", "ga4gh")

['ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl']

In [4]:
dp.get_sequence("ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl", start=44908821-25, end=44908822+25)

'CCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGTACCAGGCCGGGGC'

## normalize() 

VR Spec REQUIRES that variation is reported as "expanded" alleles. Expanded alleles capture the entire region of insertion/deletion amiguity, thereby facilitating comparisons that would otherwise require on-the-fly computations. Note: this example is using the bioutils normalize method, rather than the vrs, since that one does not support shuffling.

In [5]:
# Define a dinucleotide insertion on the following sequence at interbase (13, 13)
sequence = "CCCCCCCCACACACACACTAGCAGCAGCA"
#    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
#     C C C C C C C C A C A C A C A C A C T A G C A G C A G C A
#                              ^ insert CA here
interval = (13, 13)
alleles = (None, "CA")
args = dict(sequence=sequence, interval=interval, alleles=alleles, bounds=(0,len(sequence)))

In [6]:
import bioutils
# The expanded allele sequences. This is a concept that is valid in HGVS space.
bioutils.normalize.normalize(**args, mode="EXPAND")

((7, 18), ('CACACACACAC', 'CACACACACACAC'))

In [7]:
# For comparison, the left and right shuffled alleles
bioutils.normalize.normalize(**args, mode="LEFTSHUFFLE")

((7, 7), ('', 'CA'))

In [8]:
bioutils.normalize.normalize(**args, mode="RIGHTSHUFFLE")

((18, 18), ('', 'AC'))

In [9]:
# In contrast in the VR spec we provide fully justified representations:
from ga4gh.vrs import normalize

### sha512t24u() — Truncated SHA-512 digest
The `sha512t24u` is a convention for constructing unique identifiers from binary objects (as from serialization) using well-known SHA512 hashing and Base64 (i.e., base64url) encoding. 

In [10]:
sha512t24u(b"")

'z4PhNX7vuL3xVChQ1m2AB9Yg5AULVxXc'

In [11]:
sha512t24u(b"ACGT")

'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

## Computing Identifiers for objects

### ga4gh_serialize()
Serialization is the process of converting an object to a *binary* representation for transmission or communication. In the context of generating GA4GH identifiers, serialization is a process to generate a *canonical* JSON form in order to generate a digest. The VR serialization is based on a JSON canonincialization scheme consistent with several existing proposals. See the spec for details.

Because the serialization and digest methods are well-defined, groups with the same data will generate the same digests and computed identifiers.

GA4GH serialization replaces inline identifiable objects with their digests in order to create a well-defined ordering. See the `location` property in the `Allele` example below.

<br>
<div>
    <div style="border-radius: 10px; width: 80%; margin: 0 auto; padding: 5px; border: 2pt solid #660000; color: #660000; background: #f4cccc;">
        <span style="font-size: 200%">⚠</span> Although JSON serialization and GA4GH canonical JSON serialization appear similar, they are NOT interchangeable and will generated different digests. GA4GH identifiers are defined <i>only</i> when used with GA4GH serialization process.
    </div>
</div>

In [12]:
# This is the "simple" allele defined above, repeated here for readability
# Note that the location data is inlined
allele.as_dict()

{'type': 'Allele',
 'location': {'type': 'SequenceLocation',
  'sequence_id': 'refseq:NC_000019.10',
  'interval': {'type': 'SimpleInterval', 'start': 44908821, 'end': 44908822}},
 'state': {'type': 'SequenceState', 'sequence': 'A'}}

In [13]:
# This is the serialized form. Notice that the inline `Location` instance was replaced with
# its identifier and that the Allele id is not included. 
ga4gh_serialize(allele)

b'{"location":"EhF8FehHeWNA9-R2CmWul4UU2D1eoqbZ","state":{"sequence":"A","type":"SequenceState"},"type":"Allele"}'

## ga4gh_digest()
ga4gh_digest() returns the sha512t24u digest of a ga4gh_serialize'd object.  The digest is cached within the object itself to minimize recomputation.

In [14]:
ga4gh_digest(allele)

'BMtuBCtBgBsT5hEpVcy7dxjCDT1kuXwu'

In [15]:
sha512t24u(ga4gh_serialize(allele))

'BMtuBCtBgBsT5hEpVcy7dxjCDT1kuXwu'

### ga4gh_identify()
VR computed identifiers are constructed from digests on serialized objects by prefixing a VR digest with a type-specific code.

In [16]:
# identify() uses this digest to construct a CURIE-formatted identifier.
# The VA prefix identifies this object as a Variation Allele.
ga4gh_identify(allele)

'ga4gh:VA.BMtuBCtBgBsT5hEpVcy7dxjCDT1kuXwu'