# GA4GH Variation Representation Overview

The GA4GH Variation Representation Specification consists of two components: a JSON schema that describes the structure of data, and algorithmic conventions for how to use VR data structures to improve the consistency of sequence variation shared in the community.  This overview of the VR Schema is from the python reference implementation at https://github.com/ga4gh/vr-python/blob/master/notebooks/Overview.ipynb. Users may wish to explore https://github.com/ga4gh/vr-python/blob/master/notebooks/Extras.ipynb, which provides additional functionality to construct VR objects from HGVS, SPDI, and VCF.

GA4GH VR was formerly known as the Variation Modelling Collaboration (VMC).

## Using the Reference Implementation
All publicly available functionality is accessed by importing from `ga4gh.vrs`, as shown below.

In [1]:
from ga4gh.core import sha512t24u, ga4gh_digest, ga4gh_identify, ga4gh_serialize
from ga4gh.vrs import __version__, models, normalize

<br>
<div>
    <div style="border-radius: 10px; width: 80%; margin: 0 auto; padding: 5px; border: 2pt solid #660000; color: #660000; background: #f4cccc;">
    <span style="font-size: 200%">⚠</span> Import from <code>ga4gh.core</code> and <code>ga4gh.vrs</code> as shown above.  Submodules contain implementation details that are likely to change without notice.
    </div>
</div>

In [2]:
# You can see the version of ga4gh.vrs like so:
__version__

'0.2.2.dev14+gdafd779.d20190804'

## Top-Down View of VR Schema Classes

The top-level VR classes are Location and Variation.  A Location describes *where* an event occurs.  Variation describes an event at a Location. Location and Variation are *abstract* objects — their purpose is to provide a framework for the way we think about variation, but they doen't represent any particular instance themselves. 

Currently, there is only one Location class: SequenceLocation, which defines a precise span on a named sequence. Future Location classes will include Cytoband Locations, Gene Locations, as well as SequenceLocations using fuzzy and/or intronic coordinates.

There are two Variation subclasses: 
* Text -- a blob of text, used when a textual representation is not (yet) parseable
* Allele -- contiguous state of a sequence or conceptual region
Future kinds of Variation will support haplotypes, genotypes, and translocations/fusions.

The top-level classes in VR are *identifiable*, meaning that VR proscribes how implementations can compute globally-consistent identifiers from the data.

See <a href="ga4gh-vr-schema.svg">this figure</a> for a schematic representation.


### Locations
A Location is an *abstract* object that refer to contiguous regions of biological sequences. Concrete types of Locations are shown below.

The most common Location is a SequenceLocation, i.e., a Location based on a named sequence and an Interval on that sequence. Locations may also be conceptual or symbolic locations, such as a cytoband region or a gene.
Any of these may be used as the Location for Variation.

#### SimpleInterval

In [3]:
simple_interval = models.SimpleInterval(start=42, end=43)
simple_interval.as_dict()

{'end': 43, 'start': 42, 'type': 'SimpleInterval'}

#### SequenceLocation

In [4]:
# Implementations are responsible for providing a mechanism to convert
# conventional sequence accessions to ga4gh sequence references.
# See Extras notebook for an example.
sequence_id = "ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_"

In [5]:
# A SequenceLocation based on a SimpleInterval
sequence_location_si = models.SequenceLocation(
    sequence_id=sequence_id,
    interval=simple_interval)
ga4gh_identify(sequence_location_si)
sequence_location_si.as_dict()

{'_digest': 'v__fHi86NVjkAHVlswpvQfcY0W5nG0Dk',
 'interval': {'end': 43, 'start': 42, 'type': 'SimpleInterval'},
 'sequence_id': 'ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_',
 'type': 'SequenceLocation'}

### Text Variation

In order to support variation descriptions that cannot be parsed, or cannot be parsed yet, the VR provides a Text schema object. The intention is to provide ids for *any* variation, particularly human descriptions of variation.

In [6]:
text_variation = models.Text(definition="PTEN loss")
text_variation.as_dict()

{'definition': 'PTEN loss', 'type': 'Text'}

In [7]:
ga4gh_identify(text_variation)

'ga4gh:VT.VX60NSGLem4X3Q8gnOSx48pZDCmJVSUk'

In [8]:
text_variation.as_dict()

{'_digest': 'VX60NSGLem4X3Q8gnOSx48pZDCmJVSUk',
 'definition': 'PTEN loss',
 'type': 'Text'}

### Alleles

An Allele is an asserion of a SequenceState at a Location. The many possible Location and SequenceState classes enable the representation of many kinds of Variation.

### "Simple" sequence replacements
This case covers any "ref-alt" style variation, which includes SNVs, MNVs, del, ins, and delins.

In [9]:
sequence_state = models.SequenceState(sequence="A")
allele = models.Allele(location=sequence_location_si, state=sequence_state)
ga4gh_identify(allele)
allele.as_dict()

{'_digest': 'weKX1iFVAnAa-jXQ4T8RijCcIlDnAaIe',
 'location': {'_digest': 'v__fHi86NVjkAHVlswpvQfcY0W5nG0Dk',
  'interval': {'end': 43, 'start': 42, 'type': 'SimpleInterval'},
  'sequence_id': 'ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'A', 'type': 'SequenceState'},
 'type': 'Allele'}

---
## Functions
Conventions in the VR specification are implemented through several algorithmic functions. They are:

* `normalize`: Implements sequence normalization for ins and del variation
* `sha512t24u`: Implements a convention constructing and formatting digests for an object
* `ga4gh_digest`: Generates a digest for a GA4GH object
* `ga4gh_serialize`: Serializes a GA4GH object using a canonical binary form
* `ga4gh_identify`: Generates a CURIE identifier for a GA4GH object


### normalize()
VR Spec REQUIRES that variation is reported as "expanded" alleles. Expanded alleles capture the entire region of insertion/deletion amiguity, thereby facilitating comparisons that would otherwise require on-the-fly computations.

In [10]:
# Define a dinucleotide insertion on the following sequence at interbase (13, 13)
sequence = "CCCCCCCCACACACACACTAGCAGCAGCA"
#    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
#     C C C C C C C C A C A C A C A C A C T A G C A G C A G C A
#                              ^ insert CA here
interval = (13, 13)
alleles = (None, "CA")
args = dict(sequence=sequence, interval=interval, alleles=alleles, bounds=(0,len(sequence)))

In [11]:
# The expanded allele sequences
normalize(**args, mode="EXPAND")

((7, 18), ('CACACACACAC', 'CACACACACACAC'))

In [12]:
# For comparison, the left and right shuffled alleles
normalize(**args, mode="LEFTSHUFFLE")

((7, 7), ('', 'CA'))

In [13]:
normalize(**args, mode="RIGHTSHUFFLE")

((18, 18), ('', 'AC'))

### sha512t24u() — Truncated SHA-512 digest
The `sha512t24u` is a convention for constructing unique identifiers from binary objects (as from serialization) using well-known SHA512 hashing and Base64 (i.e., base64url) encoding. 

In [14]:
sha512t24u(b"")

'z4PhNX7vuL3xVChQ1m2AB9Yg5AULVxXc'

In [15]:
sha512t24u(b"ACGT")

'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

### ga4gh_serialize()
Serialization is the process of converting an object to a *binary* representation for transmission or communication. In the context of generating GA4GH identifiers, serialization is a process to generate a *canonical* JSON form in order to generate a digest. The VR serialization is based on a JSON canonincialization scheme consistent with several existing proposals. See the spec for details.

Because the serialization and digest methods are well-defined, groups with the same data will generate the same digests and computed identifiers.

GA4GH serialization replaces inline identifiable objects with their digests in order to create a well-defined ordering. See the `location` property in the `Allele` example below.

<br>
<div>
    <div style="border-radius: 10px; width: 80%; margin: 0 auto; padding: 5px; border: 2pt solid #660000; color: #660000; background: #f4cccc;">
        <span style="font-size: 200%">⚠</span> Although JSON serialization and GA4GH canonical JSON serialization appear similar, they are NOT interchangeable and will generated different digests. GA4GH identifiers are defined <i>only</i> when used with GA4GH serialization process.
    </div>
</div>

In [16]:
# This is the "simple" allele defined above, repeated here for readability
# Note that the location data is inlined
allele.as_dict()

{'_digest': 'weKX1iFVAnAa-jXQ4T8RijCcIlDnAaIe',
 'location': {'_digest': 'v__fHi86NVjkAHVlswpvQfcY0W5nG0Dk',
  'interval': {'end': 43, 'start': 42, 'type': 'SimpleInterval'},
  'sequence_id': 'ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'A', 'type': 'SequenceState'},
 'type': 'Allele'}

In [17]:
# This is the serialized form. Notice that the inline `Location` instance was replaced with
# its identifier and that the Allele id is not included. 
ga4gh_serialize(allele)

b'{"location":"v__fHi86NVjkAHVlswpvQfcY0W5nG0Dk","state":{"sequence":"A","type":"SequenceState"},"type":"Allele"}'

### ga4gh_digest()
ga4gh_digest() returns the sha512t24u digest of a ga4gh_serialize'd object.  The digest is cached within the object itself to minimize recomputation.

In [18]:
ga4gh_digest(allele)

'weKX1iFVAnAa-jXQ4T8RijCcIlDnAaIe'

In [19]:
sha512t24u(ga4gh_serialize(allele))

'weKX1iFVAnAa-jXQ4T8RijCcIlDnAaIe'

### ga4gh_identify()
VR computed identifiers are constructed from digests on serialized objects by prefixing a VR digest with a type-specific code.

In [20]:
# identify() uses this digest to construct a CURIE-formatted identifier.
# The VA prefix identifies this object as a Variation Allele.
ga4gh_identify(allele)

'ga4gh:VA.weKX1iFVAnAa-jXQ4T8RijCcIlDnAaIe'