# Haplotypes and Genotypes

This notebook demonstrates VRS Haplotypes and Genotypes using ApoE alleles.

The ApoE gene is associated with risks of Alzheimer's disease and hypercholesterolemia. Risk of AD is attributed to haplotypes comprised of two locations, [rs429358](https://www.ncbi.nlm.nih.gov/snp/rs429358) and [rs7412](https://www.ncbi.nlm.nih.gov/snp/rs7412), both of which are C/T transitions.  The four ApoE haplotypes are defined by the two states (C and T) at the two locations shown below. (Each location is shown with GRCh37 , GRCh38, and RefSeq transcript coordinates.) 

```
                             rs7412 
                             NC_000019.9:g.45411941
                             NC_000019.10:g.44908822
                             NM_000041.3:c.526
rs429358                        C          T
NC_000019.9:g.45412079   C   APOE-ε4    APOE-ε1
NC_000019.10:g.44908684  T   APOE-ε3    APOE-ε2
NM_000041.3:c.388
```

Links:
* [NCBI APOE Gene record](https://ghr.nlm.nih.gov/gene/APOE)
* [ClinVar APO E4 record](https://www.ncbi.nlm.nih.gov/clinvar/variation/441269/)
* [Snpedia APOE page](http://snpedia.com/index.php/APOE)

# Setup

In [1]:
from ga4gh.vr import models, class_refatt_map
from ga4gh.core import ga4gh_identify, ga4gh_serialize, ga4gh_digest, ga4gh_enref, ga4gh_deref

import json
def ppo(o, indent=2):
    """pretty print object as json"""
    print(json.dumps(o.as_dict(), sort_keys=True, indent=indent))
    
object_store = {}
def vr_enref(o): return ga4gh_enref(o, cra_map=class_refatt_map, object_store=object_store)
def vr_deref(o): return ga4gh_deref(o, cra_map=class_refatt_map, object_store=object_store)

## APOE Alleles
Construct the four Alleles above on GRCh38.

In [2]:
# NC_000019.10 (GRCh38 chr 19 primary assembly) sequence id
# Typically, this would provided by a sequence repository that
# translates accessions into ga4gh sequence identifiers
sequence_id = "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"

In [3]:
locations = {
    "rs429358_38": models.Location(
        sequence_id = sequence_id,
        interval = models.Interval(start=44908683, end=44908684)),
    "rs7412_38": models.Location(
        sequence_id = sequence_id,
        interval=models.Interval(start=44908821, end=44908822))
}

In [4]:
alleles = {
    "rs429358_38_C": models.Allele(location=locations["rs429358_38"], state=models.SequenceState(sequence="C")),
    "rs429358_38_T": models.Allele(location=locations["rs429358_38"], state=models.SequenceState(sequence="T")),
    "rs7412_38_C": models.Allele(location=locations["rs7412_38"], state=models.SequenceState(sequence="C")),
    "rs7412_38_T": models.Allele(location=locations["rs7412_38"], state=models.SequenceState(sequence="T")),
}

# Quick Overview

In [5]:
# Create a haplotype
hap1 = models.Haplotype(members=[alleles["rs429358_38_C"], alleles["rs7412_38_T"]])
ppo(hap1)

{
  "completeness": "UNKNOWN",
  "members": [
    {
      "location": {
        "interval": {
          "end": 44908684,
          "start": 44908683,
          "type": "SimpleInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "C",
        "type": "SequenceState"
      },
      "type": "Allele"
    },
    {
      "location": {
        "interval": {
          "end": 44908822,
          "start": 44908821,
          "type": "SimpleInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "T",
        "type": "SequenceState"
      },
      "type": "Allele"
    }
  ],
  "type": "Haplotype"
}


In [6]:
# Generated a computed identifier for the Haplotype
ga4gh_identify(hap1)

'ga4gh:VH.ZJLc3-U--2-R3v_lt8fKqsgk7c2tOeP_'

In [7]:
# The order of haplotype members does not change GA4GH Computed Identifier
hap2 = models.Haplotype(members=[alleles["rs7412_38_T"], alleles["rs429358_38_C"]])
ga4gh_identify(hap2)

'ga4gh:VH.ZJLc3-U--2-R3v_lt8fKqsgk7c2tOeP_'

In [8]:
# Create a Genotype and generate a computed identifier
gen1 = models.Genotype(members=[hap1, hap1])
ga4gh_identify(gen1)

'ga4gh:VG.TLzW4FHpcMUiUi6qARHvUS4VM0r6JOxk'

In [10]:
# Haplotypes and Genotypes may be referenced for more concise representations
hap1r = vr_enref(hap1)
ppo(hap1r)

{
  "completeness": "UNKNOWN",
  "members": [
    "ga4gh:VA.iXjilHZiyCEoD3wVMPMXG3B8BtYfL88H",
    "ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_"
  ],
  "type": "Haplotype"
}


In [11]:
gen1r = vr_enref(gen1)
ppo(gen1r)

{
  "members": [
    "ga4gh:VH.ZJLc3-U--2-R3v_lt8fKqsgk7c2tOeP_",
    "ga4gh:VH.ZJLc3-U--2-R3v_lt8fKqsgk7c2tOeP_"
  ],
  "type": "Genotype"
}


# SCRAPS BELOW

## APOE Haplotypes

In [None]:
haplotypes = {
    "APOE-ε1": models.Haplotype(members=[alleles["rs429358_38_C"], alleles["rs7412_38_T"]]),
    "APOE-ε2": models.Haplotype(members=[alleles["rs429358_38_T"], alleles["rs7412_38_T"]]),
    "APOE-ε3": models.Haplotype(members=[alleles["rs429358_38_T"], alleles["rs7412_38_C"]]),
    "APOE-ε4": models.Haplotype(members=[alleles["rs429358_38_C"], alleles["rs7412_38_C"]]),
}

In [None]:
ppo(haplotypes["APOE-ε1"])

In [None]:
ga4gh_identify(haplotypes["APOE-ε1"])

## Genotypes