# GA4GH Variation Representation Overview

This notebook provides an overview of all objects in the GA4GH VR schema. (GA4GH VR was formerly known as VMC. Renaming is in progress.)

## Top-down view of the VR Schema

The highest level objects of VR are Variation and VariationSet. Both are *abstract* objects. 

Conceptually, Variation can be any of the following:
* Text -- a blob of text, used when a textual representation is not (yet) parseable
* Allele -- contiguous state of a sequence or conceptual region
* Haplotype -- a set of Alleles known to be in phase
* Genotype -- a set of Haplotypes

VariationSet -- is an arbitrary set of anything

ðŸ‘‰ The above hierarchy is being reevaluated.

An Allele consists of a Location and State, which are both abstract concepts.

Kinds of Locations:
* SequenceLocation
* CytobandLocation
* GeneLocation

Kinds of State:
* SequenceState
* CNVState


<div id='svgWrapper'>
    <img src='ga4gh-vr-schema.svg'/>
</div>

---
# Reference Implementation

In [2]:
from ga4gh.vr import models, serialize, computed_id, digest

---
# Functions
VR defines functions that operate on some objects. They are demonstrated below.

The VR digest is a convention for constructing unique identifiers from binary objects (as from serialization) using well-known SHA512 hashing and Base64 (i.e., base64url) encoding. 

Serialization is the process of converting a VR object to a *binary* representation. 

VR computed identifiers are constructed from digests on serialized objects by prefixing a VR digest with a type-specific code.

In [3]:
digest("")

'z4PhNX7vuL3xVChQ1m2AB9Yg5AULVxXc'

In [4]:
digest("ACGT")

'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

---
# Text Variation

In order to support variation descriptions that cannot be parsed, or cannot be parsed yet, the VR provides a Text schema object. The intention is to provide ids for *any* variation, particularly human descriptions of variation.

In this example, the `serialize`, `digest`, and `computed_id` functions are demonstrated.

In [5]:
v = models.Text(definition="PTEN loss")
v.as_dict()

{'definition': 'PTEN loss', 'type': 'Text'}

In [6]:
# serialize returns a deterministic json format. See the spec for details.
serialize(v)

b'{"definition":"PTEN loss","type":"Text"}'

In [7]:
# digest returns the digest for binary data
digest(serialize(v))

'VX60NSGLem4X3Q8gnOSx48pZDCmJVSUk'

In [8]:
# The computed_id is constructed from the digest
computed_id(v)

'VMC:GT_VX60NSGLem4X3Q8gnOSx48pZDCmJVSUk'

---
# Locations
A Location is an *abstract* object that refer to contiguous regions of biological sequences. Concrete types of Locations are shown below.

The most common Location is a SequenceLocation, i.e., a Region on a named sequence.
Locations may also be more conceptual, such as a cytoband region or a gene.
Any of these may be used as the Location for Variation.

## Regions

Regions refer to contiguous spans of an implied sequence.
Regions are not identifiable objects, so have no computed identifier defined.

### SimpleInterval

In [11]:
si = models.SimpleInterval(start=42, end=43)
serialize(si)

b'{"end":43,"start":42,"type":"SimpleInterval"}'

### NestedInterval
Document conversion with ranged format

In [12]:
ni = models.NestedInterval(
    inner=models.SimpleInterval(start=29,end=30),
    outer=models.SimpleInterval(start=30,end=39))
serialize(ni)

b'{"inner":{"end":30,"start":29,"type":"SimpleInterval"},"outer":{"end":39,"start":30,"type":"SimpleInterval"},"type":"NestedInterval"}'

### SequenceLocation

In [13]:
slsi = models.SequenceLocation(
    sequence_id="NM_0001234.5",
    interval=si)
slsi.id = computed_id(slsi)
serialize(slsi)

b'{"interval":{"end":43,"start":42,"type":"SimpleInterval"},"sequence_id":"NM_0001234.5","type":"SequenceLocation"}'

In [14]:
slni = models.SequenceLocation(sequence_id="NM_0001234.5", interval=ni)
slni.id = computed_id(slni)
serialize(slni)

b'{"interval":{"inner":{"end":30,"start":29,"type":"SimpleInterval"},"outer":{"end":39,"start":30,"type":"SimpleInterval"},"type":"NestedInterval"},"sequence_id":"NM_0001234.5","type":"SequenceLocation"}'

### CytobandLocation

In [15]:
cbl = models.CytobandLocation(chr="11", start="q22.3", end="q23.1")
cbl.id = computed_id(cbl)
serialize(cbl)

b'{"chr":"11","end":"q23.1","start":"q22.3","type":"CytobandLocation"}'

### GeneLocation

In [16]:
gl = models.GeneLocation(gene="HGNC:MSH2")
gl.id = computed_id(gl)
serialize(gl)

b'{"gene":"HGNC:MSH2","type":"GeneLocation"}'

# Alleles

An Allele is essentially just a pair of Location and State. The many possible Location and State types permit representing many flavors of variation.

### "Simple" sequence replacements
This case covers any "ref-alt" style variation, which includes SNVs, MNVs, del, ins, and delins.

In [17]:
ss = models.SequenceState(sequence="A")
a = models.Allele(location=slsi, state=ss)
a.id = computed_id(a)
serialize(a)

b'{"location":{"id":"VMC:GL_8KJJStVL_dJigtK_AHyVp5AAipy1pMh8","interval":{"end":43,"start":42,"type":"SimpleInterval"},"sequence_id":"NM_0001234.5","type":"SequenceLocation"},"state":{"sequence":"A","type":"SequenceState"},"type":"Allele"}'

## EXPERIMENTAL: Copy Number Examples

> **_NOTE_**:  The copy number model is under development. The intention below is to demonstrate how CNV structures might be incorporated into the existing schema, not the merits of the CNV support.


### CNV of a simple SequenceLocation, copy location unknown/unspecified
The only difference between this example and the above SequenceState example is that we use a new State subclass, CNVState.

In [None]:
si = models.SimpleInterval(start=20,end=30)
sl = models.SequenceLocation(sequence_id="NM_0001234.5", interval=si)

cnvstate = models.CNVState(min_copies=3, max_copies=5, copy_measure="TOTAL")

a = models.Allele(location=sl, state=cnvstate)
a.id = computed_id(a)
a.as_dict()

## Same CNV, now with known location

In [None]:
si = models.SimpleInterval(start=20,end=30)
sl = models.SequenceLocation(sequence_id="NM_0001234.5", interval=si)

# ðŸ‘‰ Note addition of location in CNVState
# When CNV.location == Allele.location, CN is total copy number and copies are tandem
cnvstate = models.CNVState(min_copies=3, max_copies=5, copy_measure="TOTAL", location = sl)

a = models.Allele(location=sl, state=cnvstate)
a.id = computed_id(a)
a.as_dict()

### CNV at a Gene Location

Any Location subclass may be used to define an Allele. Below, a GeneLocation is used to define a gene-level CNV.

The `ncbigene` CURIE prefix is taken from identifiers.org.

In [None]:
gl = models.GeneLocation(gene="ncbigene:1473")

cnvstate = models.CNVState(min_copies=3, max_copies=5, copy_measure="RELATIVE")

a = models.Allele(location=gl, state=cnvstate)
a.id = computed_id(a)
a.as_dict()

### Gene Deletion

STILL THINKING: Expressing gene-level deletion could use a SequenceState(""), a CNVState(min_copies=0, max_copies=0), or create a new Deletion state. All have tradeoffs. Need to evaluate which minimizes the possibility of confusion.

In [None]:
a = models.Allele(location=gl, state=models.SequenceState(sequence=""))
a.id = computed_id(a)
a.as_dict()

# Haplotypes
A Haplotype is a collection of allele_ids, with optional specification for covered location and completeness

In [None]:
h = models.Haplotype(
    location_id=slsi.id,
    allele_ids=[
    'BOGUS:XX_WMv1y-3Q460hi_S3ND5N5Ct2Ci58TOZd',
    'BOGUS:XX_jW7bSR3Obmx3IewIRSJkJMf6t7b73LVU',
    'BOGUS:XX_23DL4svp8FvWdMrkhuOckbyjM-0I1Dov',
    'BOGUS:XX_n363FutAEo79HhjNl7wea61SGc_tU40j'],
    completeness="PARTIAL")
h.id = computed_id(h)
h.as_dict()

# Genotypes
A Genotype is a collection of Haplotypes_ids, with optional specification for completeness

In [None]:
g = models.Genotype(
    haplotype_ids=[
    'BOGUS:XX_WMv1y-3Q460hi_S3ND5N5Ct2Ci58TOZd',
    'BOGUS:XX_jW7bSR3Obmx3IewIRSJkJMf6t7b73LVU',
    'BOGUS:XX_23DL4svp8FvWdMrkhuOckbyjM-0I1Dov',
    'BOGUS:XX_n363FutAEo79HhjNl7wea61SGc_tU40j'],
    completeness="PARTIAL")
g.id = computed_id(g)
g.as_dict()

# VariationSet
VariationSet is just a bucket of ids, which may not even exist.

In [None]:
vs = models.VariationSet(member_ids=[
    'BOGUS:XX_WMv1y-3Q460hi_S3ND5N5Ct2Ci58TOZd',
    'BOGUS:XX_jW7bSR3Obmx3IewIRSJkJMf6t7b73LVU',
    'BOGUS:XX_23DL4svp8FvWdMrkhuOckbyjM-0I1Dov',
    'BOGUS:XX_n363FutAEo79HhjNl7wea61SGc_tU40j',
    'BOGUS:XX_pel3HzoNSMCEvPoQQD-AOBE8I8s0eCn9',
    'BOGUS:XX_X2x6a4Xvil365Ea-Po8WcuuQPWx973U8',
    'BOGUS:XX_QHDx_0DbssgtGljy-K1q7WAcNkqD5TY-',
    'BOGUS:XX_3x2p-8eCIc0pU-if_6CFBKGLziZRSWdz',
    'BOGUS:XX_RXF8gSNDDyPQ0opTA8ordEE6hGGm2GYJ',
    'BOGUS:XX_7tDaRPzXL4rfOLoYtRUUGCTy65ptDs8J'])
vs.id = computed_id(vs)
vs.as_dict()

---
# ga4gh.vr.extras

## Format translator
ga4gh.vr.extras.translator translates various formats into VR representations

In [None]:
from ga4gh.vr.extras.translator import Translator
tlr = Translator()

In [None]:
a = tlr.from_beacon("13 : 32936732 G > C")
a.as_dict()

In [None]:
a = tlr.from_hgvs("NM_012345.6:c.22A>T")
a.as_dict()

In [None]:
a = tlr.from_spdi("NM_012345.6:21:1:T")
a.as_dict()

In [None]:
a = tlr.from_vcf("1-55516888-G-GA")   # gnomAD-style expression
a.as_dict()

## Feature "Localizer"
Converts feature-based locations to SequenceLocations. Cytoband features are taken from UCSC.

In [None]:
cbl = models.CytobandLocation(chr="11", start="q22.3", end="q23.1")
a = models.Allele(location=cbl, state=ss)

from ga4gh.vr.extras.localizer import Localizer
locr = Localizer(default_assembly_name="GRCh38")

aloc = locr.localize(a)
aloc.as_dict()

## Translating sequence identifiers to VMC sequence identifiers
Sequence lookup services are required to implement VMC operations, but the exact implementation is up to the implementer. The most important need is to translate sequence identifiers from RefSeq or other sources into VMC sequence identifiers.

In [None]:
from ga4gh.vr.extras.seqrepo import get_vmc_sequence_identifier
get_vmc_sequence_identifier("RefSeq:NC_000019.10")