# Haplotypes

This notebook demonstrates VRS Haplotypes using ApoE alleles.

The ApoE gene is associated with risks of Alzheimer's disease and hypercholesterolemia. Risk of AD is attributed to haplotypes comprised of two locations, [rs429358](https://www.ncbi.nlm.nih.gov/snp/rs429358) and [rs7412](https://www.ncbi.nlm.nih.gov/snp/rs7412), both of which are C/T transitions.  The four ApoE haplotypes are defined by the two states (C and T) at the two locations shown below. (Each location is shown with GRCh37 , GRCh38, and RefSeq transcript coordinates.) 

```
                             rs7412 
                             NC_000019.9:g.45411941
                             NC_000019.10:g.44908822
                             NM_000041.3:c.526
rs429358                        C          T
NC_000019.9:g.45412079   C   APOE-ε4    APOE-ε1
NC_000019.10:g.44908684  T   APOE-ε3    APOE-ε2
NM_000041.3:c.388
```

Links:
* [NCBI APOE Gene record](https://ghr.nlm.nih.gov/gene/APOE)
* [ClinVar APO E4 record](https://www.ncbi.nlm.nih.gov/clinvar/variation/441269/)
* [Snpedia APOE page](http://snpedia.com/index.php/APOE)

# Setup

In [1]:
from ga4gh.vrs import models, vrs_deref, vrs_enref
from ga4gh.core import ga4gh_identify, ga4gh_serialize, ga4gh_digest, ga4gh_deref

import json
def ppo(o, indent=2):
    """pretty print object as json"""
    print(json.dumps(o.as_dict(), sort_keys=True, indent=indent))
    


Removing allOf attribute from CopyNumber to avoid python-jsonschema-objects error.
Removing allOf attribute from SequenceInterval to avoid python-jsonschema-objects error.
Removing allOf attribute from RepeatedSequenceExpression to avoid python-jsonschema-objects error.


## APOE Alleles
Construct the four Alleles above on GRCh38.

In [2]:
# NC_000019.10 (GRCh38 chr 19 primary assembly) sequence id
# The sequence id would typically be provided by a sequence repository
sequence_id = "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"

In [3]:
locations = {
    "rs429358_38": models.SequenceLocation(
        sequence_id = sequence_id,
        interval = models.SequenceInterval(start=models.Number(value=44908683, type="Number"), 
                                           end=models.Number(value=44908684, type="Number"), 
                                           type="SequenceInterval"),
        type="SequenceLocation"),
    "rs7412_38": models.SequenceLocation(
        sequence_id = sequence_id,
        interval=models.SequenceInterval(start=models.Number(value=44908821, type="Number"), 
                                         end=models.Number(value=44908822, type="Number"),
                                         type="SequenceInterval"),
        type="SequenceLocation")
}

In [4]:
alleles = {
    "rs429358_38_C": models.Allele(location=locations["rs429358_38"], state=models.SequenceState(sequence="C", type="SequenceState"), type="Allele"),
    "rs429358_38_T": models.Allele(location=locations["rs429358_38"], state=models.SequenceState(sequence="T", type="SequenceState"), type="Allele"),
    "rs7412_38_C":   models.Allele(location=locations["rs7412_38"],   state=models.SequenceState(sequence="C", type="SequenceState"), type="Allele"),
    "rs7412_38_T":   models.Allele(location=locations["rs7412_38"],   state=models.SequenceState(sequence="T", type="SequenceState"), type="Allele"),
}

In [5]:
haplotypes = {
    "APOE-ε1": models.Haplotype(members=[alleles["rs429358_38_C"], alleles["rs7412_38_T"]]),
    "APOE-ε2": models.Haplotype(members=[alleles["rs429358_38_T"], alleles["rs7412_38_T"]]),
    "APOE-ε3": models.Haplotype(members=[alleles["rs429358_38_T"], alleles["rs7412_38_C"]]),
    "APOE-ε4": models.Haplotype(members=[alleles["rs429358_38_C"], alleles["rs7412_38_C"]]),
}

In [6]:
ppo(haplotypes["APOE-ε1"])

{
  "members": [
    {
      "location": {
        "interval": {
          "end": {
            "type": "Number",
            "value": 44908684
          },
          "start": {
            "type": "Number",
            "value": 44908683
          },
          "type": "SequenceInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "C",
        "type": "SequenceState"
      },
      "type": "Allele"
    },
    {
      "location": {
        "interval": {
          "end": {
            "type": "Number",
            "value": 44908822
          },
          "start": {
            "type": "Number",
            "value": 44908821
          },
          "type": "SequenceInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "T",
        "type": "SequenceState"
      },

In [7]:
# Generated a computed identifier for the Haplotype
ga4gh_identify(haplotypes["APOE-ε1"])

'ga4gh:VH.gwpj5LNuNqwI9TZ-MNI6h7AZhjJztQ4O'

In [8]:
# The order of haplotype members does not change GA4GH Computed Identifier
apoe1_alleles = (alleles["rs7412_38_T"], alleles["rs429358_38_C"])

#note: this should be the same, but currently is not?
print(ga4gh_identify(models.Haplotype(members=apoe1_alleles, type="Haplotype")))
print(ga4gh_identify(models.Haplotype(members=tuple(reversed(apoe1_alleles)), type="Haplotype")))
# assert (ga4gh_identify(models.Haplotype(members=apoe1_alleles)) ==
#        ga4gh_identify(models.Haplotype(members=tuple(reversed(apoe1_alleles)))))

ga4gh:VH.XcRh22GN0SXmi0J7RyvQFw5cXb35Pesy
ga4gh:VH.gwpj5LNuNqwI9TZ-MNI6h7AZhjJztQ4O


In [9]:
# Haplotype members may be referenced (rather than inline) for more concise representations
apoe1_haplotype_ref = vrs_enref(haplotypes["APOE-ε1"])
ppo(apoe1_haplotype_ref)

{
  "members": [
    "ga4gh:VA.Nat5xaRs9TtSkR5Vf33VgA9OLC72qRQW",
    "ga4gh:VA.KnG6BLTexv7o-j9LnYsgPxZkRUu1IRnp"
  ],
  "type": "Haplotype"
}
