# GA4GH Variation Representation Schema

This notebook demonstrates the use of the VR schema to represent variation in APOE.  Objects created in this notebook are saved at the end and used by other notebooks to demonstrate other features of the VR specification.


## APOE Variation

                                 rs7412 
                                 NC_000019.10:g.44908822
                                 NM_000041.3:c.526
                                 C          T
    rs429358                 C   APOE-ε4    APOE-ε1
    NC_000019.10:g.44908684  T   APOE-ε3    APOE-ε2
    NM_000041.3:c.388

Note: The example currently uses only rs7412:T. Future versions of the schema will support haplotypes and genotypes, and these examples will be extended appropriately.

## Using the VR Reference Implemention

See https://github.com/ga4gh/vr-python for information about installing the reference implementation.

In [1]:
from ga4gh.vrs import __version__, models
__version__

Removing allOf attribute from CopyNumber to avoid python-jsonschema-objects error.
Removing allOf attribute from SequenceInterval to avoid python-jsonschema-objects error.
Removing allOf attribute from RepeatedSequenceExpression to avoid python-jsonschema-objects error.


'0.7.7.dev1+g92313b5.d20230223'

## Schema Overview

<img src="images/schema-current.png" width="75%" alt="Current Schema"/>

## Sequences

The VR Specfication expects the existence of a repository of biological sequences. At a minimum, these sequences must be indexed using whatever accessions are available. Implementations that wish to use the computed identifier mechanism should also have precomputed ga4gh sequence accessions. Either way, sequences must be referred to using [W3C Compact URIs (CURIEs)](https://w3.org/TR/curie/). In the examples below, we'll use "refseq:NC_000019.10" to refer to chromosome 19 from GRCh38.

## Locations
A Location is an *abstract* object that refer to contiguous regions of biological sequences.

In the initial release of VR, the only Location is a SequenceLocation, which represents a precise interval (`SimpleInterval`) on a sequence.  GA4GH VR uses interbase coordinates exclusively; therefore the 1-based residue position 44908822 is referred to using the 0-based interbase interval <44908821, 44908822>.

Future Location subclasses will provide for approximate coordinates, gene symbols, and cytogenetic bands.

#### SequenceLocation

In [2]:
location = models.SequenceLocation(
    sequence_id="refseq:NC_000019.10",
    interval=models.SimpleInterval(start=44908821, end=44908822, type="SimpleInterval"),
    type="SequenceLocation")

In [3]:
location.as_dict()

{'type': 'SequenceLocation',
 'sequence_id': 'refseq:NC_000019.10',
 'interval': {'type': 'SimpleInterval', 'start': 44908821, 'end': 44908822}}

## Variation

### Text Variation

The TextVariation class represents variation descriptions that cannot be parsed, or cannot be parsed yet.  The primary use for this class is to allow unparsed variation to be represented within the VR framework and be associated with annotations.

In [4]:
variation = models.Text(definition="APO loss", type="Text")
variation.as_dict()

{'type': 'Text', 'definition': 'APO loss'}

### Alleles

An Allele is an asserion of a state of biological sequence at a Location.  In the first version of the VR Schema, the only State subclass is SequenceState, which represents the replacement of sequence.  Future versions of State will enable representations of copy number variation.

### "Simple" sequence replacements
This case covers any "ref-alt" style variation, which includes SNVs, MNVs, del, ins, and delins.

In [5]:
allele = models.Allele(location=location,
                       state=models.SequenceState(sequence="A", type="SequenceState"),
                       type="Allele")
allele.as_dict()

{'type': 'Allele',
 'location': {'type': 'SequenceLocation',
  'sequence_id': 'refseq:NC_000019.10',
  'interval': {'type': 'SimpleInterval', 'start': 44908821, 'end': 44908822}},
 'state': {'type': 'SequenceState', 'sequence': 'A'}}

----

## Saving the objects

Objects created in this notebook will be saved as a json file and loaded by subsequent notebooks.

In [6]:
import json
filename = "objects.json"

In [7]:
data = {
    "alleles": [allele.as_dict()],
    "locations": [location.as_dict()]
}

In [8]:
json.dump(data, open(filename, "w"), indent=4)