# Variation Set Examples

In [1]:
from ga4gh.vrs import models, vrs_enref, vrs_deref
from ga4gh.core import ga4gh_identify, ga4gh_serialize, ga4gh_digest, ga4gh_enref, ga4gh_deref

import json
def ppo(o):
    """pretty print object as json"""
    print(json.dumps(o.as_dict(), indent=2))

Removing allOf attribute from CopyNumber to avoid python-jsonschema-objects error.
Removing allOf attribute from SequenceInterval to avoid python-jsonschema-objects error.
Removing allOf attribute from RepeatedSequenceExpression to avoid python-jsonschema-objects error.


## Setup Sample Alleles
VariationSet objects may contain any kind of Variation and need not be homogeneous. For example, a VariationSet might contain two genomic Alleles (perhaps on different primary assemblies), a transcript Allele, and a protein Allele. We'll use three Alleles below on a fake sequence_id.

In [2]:
a1 = models.Allele(
    location=models.SequenceLocation(
        sequence_id="ga4gh:SQ.01234abcde",
        interval=models.SimpleInterval(start=10, end=11, type="SimpleInterval"),
        type="SequenceLocation"
    ),
    state=models.SequenceState(sequence="C", type="SequenceState"),
    type="Allele"
)
a2 = models.Allele(
    location=models.SequenceLocation(
        sequence_id="ga4gh:SQ.01234abcde",
        interval=models.SimpleInterval(start=20, end=21, type="SimpleInterval"),
        type="SequenceLocation"
    ),
    state=models.SequenceState(sequence="C", type="SequenceState"),
    type="Allele"
)
a3 = models.Allele(
    location=models.SequenceLocation(
        sequence_id="ga4gh:SQ.01234abcde",
        interval=models.SimpleInterval(start=30, end=31, type="SimpleInterval"),
        type="SequenceLocation"
    ),
    state=models.SequenceState(sequence="C", type="SequenceState"),
    type="Allele"
)

In [3]:
alleles = [a1,a2,a3]
alleles

[<Allele _id=None location=<SequenceLocation _id=None interval=<SimpleInterval end=<Literal<int> 11> start=<Literal<int> 10> type=<Literal<str> SimpleInterval>> sequence_id=<Literal<str> ga4gh:SQ.01234abcde> type=<Literal<str> SequenceLocation>> state=<LiteralSequenceExpression sequence=<Literal<str> C> type=<Literal<str> SequenceState>> type=<Literal<str> Allele>>,
 <Allele _id=None location=<SequenceLocation _id=None interval=<SimpleInterval end=<Literal<int> 21> start=<Literal<int> 20> type=<Literal<str> SimpleInterval>> sequence_id=<Literal<str> ga4gh:SQ.01234abcde> type=<Literal<str> SequenceLocation>> state=<LiteralSequenceExpression sequence=<Literal<str> C> type=<Literal<str> SequenceState>> type=<Literal<str> Allele>>,
 <Allele _id=None location=<SequenceLocation _id=None interval=<SimpleInterval end=<Literal<int> 31> start=<Literal<int> 30> type=<Literal<str> SimpleInterval>> sequence_id=<Literal<str> ga4gh:SQ.01234abcde> type=<Literal<str> SequenceLocation>> state=<LiteralSe

## VariationSet

### Inlined
"Inlined" VR objects are those that have identifiable objects nested within them. Because the objects are nested, they are self-contained. 

In [4]:
vs_inlined = models.VariationSet(members=[a1,a2,a3], type="VariationSet")

In [5]:
ppo(vs_inlined)

{
  "type": "VariationSet",
  "members": [
    {
      "type": "Allele",
      "location": {
        "type": "SequenceLocation",
        "sequence_id": "ga4gh:SQ.01234abcde",
        "interval": {
          "type": "SimpleInterval",
          "start": 10,
          "end": 11
        }
      },
      "state": {
        "type": "SequenceState",
        "sequence": "C"
      }
    },
    {
      "type": "Allele",
      "location": {
        "type": "SequenceLocation",
        "sequence_id": "ga4gh:SQ.01234abcde",
        "interval": {
          "type": "SimpleInterval",
          "start": 20,
          "end": 21
        }
      },
      "state": {
        "type": "SequenceState",
        "sequence": "C"
      }
    },
    {
      "type": "Allele",
      "location": {
        "type": "SequenceLocation",
        "sequence_id": "ga4gh:SQ.01234abcde",
        "interval": {
          "type": "SimpleInterval",
          "start": 30,
          "end": 31
        }
      },
      "state": {
      

In [6]:
ga4gh_identify(vs_inlined)

'ga4gh:VS.WVC_R7OJ688EQX3NrgpJfsf_ctQUsVP3'

In [7]:
# computed id does not depend on order of members
vs_inlined2 = models.VariationSet(members=[a3,a2,a1], type="VariationSet")
ga4gh_identify(vs_inlined2)

'ga4gh:VS.AajzQroZB9ZUK4tNzVuDOiG2WhMczHUe'

### Referenced Objects
"Referenced" VR objects are those that refer to other objects by identifiers. The examples in this section show the referenced form of the previous inlined examples. 

In [8]:
allele_ids = [ga4gh_identify(a) for a in alleles]
allele_ids

['ga4gh:VA.6xjH0Ikz88s7MhcyN5GJTa1p712-M10W',
 'ga4gh:VA.7k2lyIsIsoBgRFPlfnIOeCeEgj_2BO7F',
 'ga4gh:VA.ikcK330gH3bYO2sw9QcTsoptTFnk_Xjh']

In [9]:
# computed id is the same when members are defined by id
vs_referenced = models.VariationSet(members=allele_ids, type="VariationSet")
vs_referenced.as_dict()

{'type': 'VariationSet',
 'members': ['ga4gh:VA.6xjH0Ikz88s7MhcyN5GJTa1p712-M10W',
  'ga4gh:VA.7k2lyIsIsoBgRFPlfnIOeCeEgj_2BO7F',
  'ga4gh:VA.ikcK330gH3bYO2sw9QcTsoptTFnk_Xjh']}

In [10]:
ga4gh_identify(vs_referenced)

'ga4gh:VS.WVC_R7OJ688EQX3NrgpJfsf_ctQUsVP3'

**IMPORTANT** Notice that the computed identifiers for inlined and referenced VariationSets are identical. 

In [11]:
assert ga4gh_identify(vs_inlined) == ga4gh_identify(vs_referenced)

## Enref / Deref

In [12]:
vs = models.VariationSet(members=[a1,a2,a3], type="VariationSet") 
vs.as_dict()

{'type': 'VariationSet',
 'members': [{'type': 'Allele',
   'location': {'type': 'SequenceLocation',
    'sequence_id': 'ga4gh:SQ.01234abcde',
    'interval': {'type': 'SimpleInterval', 'start': 10, 'end': 11}},
   'state': {'type': 'SequenceState', 'sequence': 'C'}},
  {'type': 'Allele',
   'location': {'type': 'SequenceLocation',
    'sequence_id': 'ga4gh:SQ.01234abcde',
    'interval': {'type': 'SimpleInterval', 'start': 20, 'end': 21}},
   'state': {'type': 'SequenceState', 'sequence': 'C'}},
  {'type': 'Allele',
   'location': {'type': 'SequenceLocation',
    'sequence_id': 'ga4gh:SQ.01234abcde',
    'interval': {'type': 'SimpleInterval', 'start': 30, 'end': 31}},
   'state': {'type': 'SequenceState', 'sequence': 'C'}}]}

In [13]:
# "enref" recursively identifies and stores the embedded objects in the object store
object_store={}
vs2 = vrs_enref(vs, object_store=object_store)
vs2.as_dict()

{'type': 'VariationSet',
 'members': ['ga4gh:VA.6xjH0Ikz88s7MhcyN5GJTa1p712-M10W',
  'ga4gh:VA.7k2lyIsIsoBgRFPlfnIOeCeEgj_2BO7F',
  'ga4gh:VA.ikcK330gH3bYO2sw9QcTsoptTFnk_Xjh']}

In [14]:
# object_store now contains the fully-referenced forms of all objects, recursively
list(object_store.keys())

['ga4gh:VSL.EIy4ssWCI2YW3XDTSaf26A75Zjxqu0qD',
 'ga4gh:VA.6xjH0Ikz88s7MhcyN5GJTa1p712-M10W',
 'ga4gh:VSL.SHAyou8BM660a9u9OXzn7h-DYOX9OSMD',
 'ga4gh:VA.7k2lyIsIsoBgRFPlfnIOeCeEgj_2BO7F',
 'ga4gh:VSL.FEJTkuL6G4U2WUJ2LgejLm--ZUDnCiV7',
 'ga4gh:VA.ikcK330gH3bYO2sw9QcTsoptTFnk_Xjh',
 'ga4gh:VS.WVC_R7OJ688EQX3NrgpJfsf_ctQUsVP3']

In [15]:
# "deref" reconstitutes the fully inlined objects
vs3 = vrs_deref(vs2, object_store=object_store)
vs3.as_dict()

{'type': 'VariationSet',
 'members': [{'type': 'Allele',
   'location': {'type': 'SequenceLocation',
    'sequence_id': 'ga4gh:SQ.01234abcde',
    'interval': {'type': 'SimpleInterval', 'start': 10, 'end': 11}},
   'state': {'type': 'SequenceState', 'sequence': 'C'}},
  {'type': 'Allele',
   'location': {'type': 'SequenceLocation',
    'sequence_id': 'ga4gh:SQ.01234abcde',
    'interval': {'type': 'SimpleInterval', 'start': 20, 'end': 21}},
   'state': {'type': 'SequenceState', 'sequence': 'C'}},
  {'type': 'Allele',
   'location': {'type': 'SequenceLocation',
    'sequence_id': 'ga4gh:SQ.01234abcde',
    'interval': {'type': 'SimpleInterval', 'start': 30, 'end': 31}},
   'state': {'type': 'SequenceState', 'sequence': 'C'}}]}

In [16]:
vs == vs3

True

## Error cases
The following examples show intentional errors

### Members must be unique (a set)

In [17]:
import python_jsonschema_objects as pjs
try:
    vs = models.VariationSet(members=[a1,a2,a3,a3], type="VariationSet")
except pjs.ValidationError as e:
    print(e)

[<Allele _id=None location=<SequenceLocation _id=None interval=<SimpleInterval end=<Literal<int> 11> start=<Literal<int> 10> type=<Literal<str> SimpleInterval>> sequence_id=<Literal<str> ga4gh:SQ.01234abcde> type=<Literal<str> SequenceLocation>> state=<LiteralSequenceExpression sequence=<Literal<str> C> type=<Literal<str> SequenceState>> type=<Literal<str> Allele>>, <Allele _id=None location=<SequenceLocation _id=None interval=<SimpleInterval end=<Literal<int> 21> start=<Literal<int> 20> type=<Literal<str> SimpleInterval>> sequence_id=<Literal<str> ga4gh:SQ.01234abcde> type=<Literal<str> SequenceLocation>> state=<LiteralSequenceExpression sequence=<Literal<str> C> type=<Literal<str> SequenceState>> type=<Literal<str> Allele>>, <Allele _id=None location=<SequenceLocation _id=None interval=<SimpleInterval end=<Literal<int> 31> start=<Literal<int> 30> type=<Literal<str> SimpleInterval>> sequence_id=<Literal<str> ga4gh:SQ.01234abcde> type=<Literal<str> SequenceLocation>> state=<LiteralSequ