# VMC Demo
## Model Background

* 4 types: Locations, Alleles, Haplotypes, Genotypes
* Id v. Identifier
* Objects reference by Id
* Types have a serialization format based on Id (for now)
* Computed Identifier is based on digest of serialized instance

Types have a linear dependency:

    Sequence < Location < Allele < Haplotype < Genotype
               └Interval
    
## ApoE

The ApoE gene is known primarily for risks associated with Alzheimer's disease and hypercholesterolemia. Risk of AD is attributed to two positions, rs429358 and rs7412. Both positions are C/T transitions.

```
                             rs7412 
                             NC_000019.10:g.44908822
                             C          T
rs429358                 C   APOE-ε4    APOE-ε1
NC_000019.10:g.44908684  T   APOE-ε3    APOE-ε2
http://snpedia.com/index.php/APOE
```

## Setup

In [1]:
import collections
import datetime
import json

import jsonschema

from vmcdemo import models, computed_id, serialize, get_vmc_sequence_id, schema_path

# pretty print json
def ppj(o): print(json.dumps(json.loads(o.serialize()), indent=4, sort_keys=True))

## Identifiers

In [2]:
identifiers = collections.defaultdict(list)

## Sequences
A description of sequence variation, with VMC or otherwise, requires the availability of sequences in order to define coordinate systems.  Typically sequences are referred to with an accession like NC_000019.10.  There are two issues with using sequence accessions:

* Identical sequences have different names (e.g., "NC_000019.10" == "CM000681.2" == (GRCh38) "19" == (GRCh38 UCSC) "chr19").  Naive comparison of the same allele defined using different sequence name will fail.
* With graph genomes, it will become infeasible to assign sequence identifiers.

For these reasons, VMC encourages (but doesn't require) the use of computed identifiers based on a SHA512 digest, truncated to 24 bytes, and URL-safe base64 encoded.

get_vmc_sequence_id returns the computed sequence identifier for a given accession.

In [3]:
ir = models.Identifier(namespace="NCBI", accession="NC_000019.10")
sequence_id = get_vmc_sequence_id(ir)

In [4]:
identifiers[sequence_id].append(ir)

## Intervals and Locations
An Interval is a <start, end> tuple, in interbase coordinates 
A Location refers to a continuous span within a sequence identified by reference.

In [5]:
locations_by_name = {
    "rs429358": models.Location(
        sequence_id = sequence_id,
        interval = models.Interval(start=44908683, end=44908684),
    ),
    "rs7412": models.Location(
        sequence_id = sequence_id,
        interval=models.Interval(start=44908821, end=44908822),
    )
}
for n, l in locations_by_name.items():
    l.id = computed_id(l)
    identifiers[l.id].append(models.Identifier(accession=n))

In [6]:
# This is the string that is hashed to generate a computed identifier
serialize(locations_by_name["rs429358"])

'<Location:VMC:GS_IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl:<Interval:44908683:44908684>>'

In [7]:
ppj(locations_by_name["rs429358"])

{
    "id": "VMC:GL_1vQegOig0Fpx8eny8biLzexkhIAWeOZr",
    "interval": {
        "end": 44908684,
        "start": 44908683
    },
    "sequence_id": "VMC:GS_IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"
}


## Alleles

In [8]:
alleles_by_name = {
    "rs429358T": models.Allele(location_id=locations_by_name["rs429358"].id, state="T"),
    "rs429358C": models.Allele(location_id=locations_by_name["rs429358"].id, state="C"),
    "rs7412T":   models.Allele(location_id=locations_by_name["rs7412"].id,   state="T"),
    "rs7412C":   models.Allele(location_id=locations_by_name["rs7412"].id,   state="C"),
}
for n, a in alleles_by_name.items():
    a.id = computed_id(a)
    identifiers[a.id].append(models.Identifier(accession=n))

In [9]:
serialize(alleles_by_name["rs429358C"])

'<Allele:VMC:GL_1vQegOig0Fpx8eny8biLzexkhIAWeOZr:C>'

In [10]:
ppj(alleles_by_name["rs429358C"])

{
    "id": "VMC:GA_NfZPeapbh-xGxqxvGM8X2Jer4BoQJOja",
    "location_id": "VMC:GL_1vQegOig0Fpx8eny8biLzexkhIAWeOZr",
    "state": "C"
}


## Haplotypes

In [11]:
haplotypes_by_name = {
    "ε1": models.Haplotype(
        allele_ids = [alleles_by_name["rs429358C"].id, alleles_by_name["rs7412T"].id],
        completeness = "COMPLETE"
    ),
    "ε2": models.Haplotype(
        allele_ids = [alleles_by_name["rs429358T"].id, alleles_by_name["rs7412T"].id],
        completeness = "COMPLETE"
    ),
    "ε3": models.Haplotype(
        allele_ids = [alleles_by_name["rs429358T"].id, alleles_by_name["rs7412C"].id],
        completeness = "COMPLETE"
    ),
    "ε4": models.Haplotype(
        allele_ids = [alleles_by_name["rs429358C"].id, alleles_by_name["rs7412C"].id],
        completeness = "COMPLETE"
    ),
}

for n, h in haplotypes_by_name.items():
    h.id = computed_id(h)
    identifiers[h.id].append(models.Identifier(accession=n))

In [12]:
ppj(haplotypes_by_name["ε4"])

{
    "allele_ids": [
        "VMC:GA_NfZPeapbh-xGxqxvGM8X2Jer4BoQJOja",
        "VMC:GA_nx6G6W7tgdd4TfZ9ZGiBxhLO31oEmq8c"
    ],
    "completeness": "COMPLETE",
    "id": "VMC:GH_SQqTwi0l0VhEMI2mMGFsin6sYpMnbez9"
}


In [13]:
# Reversing allele ids results in the same digest (that's good!)
h_ε4r = models.Haplotype(
        allele_ids = [alleles_by_name["rs7412C"].id, alleles_by_name["rs429358C"].id],
        completeness = "COMPLETE"
)
h_ε4r.id = computed_id(h_ε4r)
ppj(h_ε4r)

{
    "allele_ids": [
        "VMC:GA_nx6G6W7tgdd4TfZ9ZGiBxhLO31oEmq8c",
        "VMC:GA_NfZPeapbh-xGxqxvGM8X2Jer4BoQJOja"
    ],
    "completeness": "COMPLETE",
    "id": "VMC:GH_SQqTwi0l0VhEMI2mMGFsin6sYpMnbez9"
}


## Genotypes

In [14]:
genotypes_by_name = {
    "ε2/ε3": models.Genotype(
        haplotype_ids = [haplotypes_by_name["ε2"].id, haplotypes_by_name["ε3"].id],
        completeness = "COMPLETE"
    ),
    "ε3/ε2": models.Genotype(
        haplotype_ids = [haplotypes_by_name["ε3"].id, haplotypes_by_name["ε2"].id],
        completeness = "COMPLETE"
    ),
    "ε4/ε4": models.Genotype(
        haplotype_ids = [haplotypes_by_name["ε4"].id, haplotypes_by_name["ε4"].id],
        completeness = "COMPLETE"
    ),
}

for n, h in genotypes_by_name.items():
    h.id = computed_id(h)
    identifiers[h.id].append(models.Identifier(accession=n))

## Bundle Serialization, Validation, and Roundtripping

In [15]:
bundle = models.Vmcbundle(
    meta=models.Meta(
            generated_at=datetime.datetime.isoformat(datetime.datetime.now()),
            vmc_version=0,
        ),
    locations = {o.id: o.as_dict() for o in locations_by_name.values()},
    alleles = {o.id: o.as_dict() for o in alleles_by_name.values()},
    haplotypes = {o.id: o.as_dict() for o in haplotypes_by_name.values()},
    genotypes = {o.id: o.as_dict() for o in genotypes_by_name.values()},
    identifiers = {n: [ir.as_dict() for ir in irs] for n, irs in identifiers.items()}
)

In [16]:
ppj(bundle)

{
    "alleles": {
        "VMC:GA_5Zd4WePIpdwMddQ8j5_KkfAKbVPMh96i": {
            "id": "VMC:GA_5Zd4WePIpdwMddQ8j5_KkfAKbVPMh96i",
            "location_id": "VMC:GL_1vQegOig0Fpx8eny8biLzexkhIAWeOZr",
            "state": "T"
        },
        "VMC:GA_NfZPeapbh-xGxqxvGM8X2Jer4BoQJOja": {
            "id": "VMC:GA_NfZPeapbh-xGxqxvGM8X2Jer4BoQJOja",
            "location_id": "VMC:GL_1vQegOig0Fpx8eny8biLzexkhIAWeOZr",
            "state": "C"
        },
        "VMC:GA_nx6G6W7tgdd4TfZ9ZGiBxhLO31oEmq8c": {
            "id": "VMC:GA_nx6G6W7tgdd4TfZ9ZGiBxhLO31oEmq8c",
            "location_id": "VMC:GL_Nt3BHblGnII4w04gaNLtRfBv1AJh9yvL",
            "state": "C"
        },
        "VMC:GA_s_N4tR8QoWlw0zWO-C5Ksnnd7iEwbiDW": {
            "id": "VMC:GA_s_N4tR8QoWlw0zWO-C5Ksnnd7iEwbiDW",
            "location_id": "VMC:GL_Nt3BHblGnII4w04gaNLtRfBv1AJh9yvL",
            "state": "T"
        }
    },
    "genotypes": {
        "VMC:GG_IFk8MgNL6B4IA_O7QOM7JHGBuc6sLfuE": {
            "completene

### Validate against schema

In [17]:
s = bundle.serialize()  # same as above ppj(bundle), but not pretty printed

In [18]:
schema = json.load(open(schema_path))
jsonschema.validate(bundle.as_dict(), schema)

### Verify that bundle roundtrips to same structure

In [19]:
bundle_round_trip = models.Vmcbundle(**json.loads(s))

In [20]:
bundle == bundle_round_trip

True