# Modeling Genotypes and PGx Star Alleles with VRS

## Imports and Configuration

**NOTE:** A dynamic, web-based version of VRS-Python may be loaded through [mybinder.org](https://mybinder.org/v2/gh/ga4gh/vrs-python/pgxhttps://mybinder.org/v2/gh/ga4gh/vrs-python/pgx) where this notebook may be run without any local installation. Steps:
1. Navigate to the [VRS-Python mybinder.org build](https://mybinder.org/v2/gh/ga4gh/vrs-python/pgx)
2. Once the Binder build is completed, navigate to notebooks/PGx.ipynb
3. Under kernel, select `VRS_kernel`

In [1]:
from ga4gh.vrs import models, vrs_deref, vrs_enref
from ga4gh.core import ga4gh_identify
from jsonschema import validate
import pathlib
import csv
from copy import deepcopy

import re
import yaml
def ppo(o, indent=3):
    """pretty print object as yaml"""
    print(yaml.dump(o.as_dict(), sort_keys=True, indent=indent))
    
from ga4gh.vrs.dataproxy import SeqRepoRESTDataProxy
from ga4gh.vrs.extras.translator import Translator

SCHEMA_DIR = pathlib.Path.cwd() / 'schemas'
SCHEMA_URI_ROOT = "file://" + str(SCHEMA_DIR)



In [2]:
seqrepo_rest_service_url = "https://services.genomicmedlab.org/seqrepo"
dp = SeqRepoRESTDataProxy(base_url=seqrepo_rest_service_url)

tlr = Translator(data_proxy=dp)

## Representing the PharmGKB CYP2C19 Star Alleles using VRS

### CYP2C19 Allele Definitions
The CYP2C19*1 PGx Allele is defined here: https://www.pharmgkb.org/haplotype/PA165980634.
This page also contains an [Allele Definition Table](https://api.pharmgkb.org/v1/download/file/attachment/CYP2C19_allele_definition_table.xlsxhttps://api.pharmgkb.org/v1/download/file/attachment/CYP2C19_allele_definition_table.xlsx) for all major Star Alleles in the CYP2C19 gene. For convenience, this file has been converted to CSV and added to this repository as `/notebooks/data/CYP2C19_allele_definition_table.csv` for use in the following exercises.

In [3]:
# Load the allele definition table
with open('data/CYP2C19_allele_definition_table.csv', 'r') as f:
    reader = csv.reader(f)
    records = list(reader)

In [4]:
haplotype_alt_genomic_hgvs = records[3][1:-1]
haplotype_alt_definitions = records[7:]

compound_code_star_allele = set()
compound_code_star_allele_index = list()

# Find all records with compound codes
for i, definition_record in enumerate(haplotype_alt_definitions):
    compound_code_count = 0
    for code in definition_record[1:-1]:
        if code not in ['A', 'C', 'T', 'G', '']:
            compound_code_count += 1
    if compound_code_count == 0:
        continue
    elif compound_code_count == 1:
        compound_code_star_allele.add("CYP2C19" + definition_record[0])
        compound_code_star_allele_index.insert(0, i)
    else:
        # Does not handle multiple compound codes in definition
        raise ValueError

# Defining lookups for compound codes as described here:
# https://www.bioinformatics.org/sms/iupac.html
IUPAC_compound_codes = {
    'R':['A','G'],
    'Y':['C','T'],
    'M':['A','C'],
}

# Remove all records with compound codes
new_records = list()
for index in compound_code_star_allele_index:
    old_record = haplotype_alt_definitions.pop(index)
    for i, code in enumerate(old_record):
        if i == 0:
            continue
        if code in IUPAC_compound_codes:
            new_record_1 = deepcopy(old_record)
            new_record_2 = deepcopy(old_record)
            new_record_1[i], new_record_2[i] = IUPAC_compound_codes[code]
            break
    haplotype_alt_definitions.extend([new_record_1, new_record_2])

In [5]:
haplotype_alt_genomic_vrs = [tlr.translate_from(':'.join(['NC_000010.11', x]), 'hgvs') for x in haplotype_alt_genomic_hgvs]

In [6]:
vrs_object_store = dict()
star_allele = dict()
compatible_star_alleles = dict()

def get_ref_allele(location):
    return dp.get_sequence(location.sequence_id, start=location.interval.start.value, end=location.interval.end.value)

def create_vrs_haplotype_from_genomic_definitions(genomic_vrs_defs, haploytpe_alt_defs):
    for haplotype_alt_def in haploytpe_alt_defs:
        allele_name = "CYP2C19" + haplotype_alt_def[0]
        special_definition = haplotype_alt_def[-1]
        allele_alts = haplotype_alt_def[1:-1]
        members = list()
        if special_definition:
            continue
        for base_allele, alt in zip(genomic_vrs_defs, allele_alts):
            if alt == '':
                alt = get_ref_allele(base_allele.location)
            if alt not in ['A','C','T','G']:
                # If this error occurs, you need to address demuxing compound codes
                raise ValueError
            new_allele = models.Allele(location=base_allele.location, 
                                       state=models.LiteralSequenceExpression(sequence=alt))
            refd_allele = vrs_enref(new_allele, object_store=vrs_object_store)
            allele_id = ga4gh_identify(refd_allele)
            vrs_object_store[allele_id] = refd_allele
            members.append(refd_allele)
            compatible_set = compatible_star_alleles.get(allele_id, set())
            compatible_set.add(allele_name)
            compatible_star_alleles[allele_id] = compatible_set
        haplotype = models.Haplotype(members=members)
        refd_haplotype = vrs_enref(haplotype, object_store=vrs_object_store)
        haplotype_id = ga4gh_identify(refd_haplotype)
        vrs_object_store[haplotype_id] = refd_haplotype
        if allele_name in compound_code_star_allele:
            haplotype_set = star_allele.get(allele_name, set())
            haplotype_set.add(haplotype_id)
            star_allele[allele_name] = haplotype_set
        else:
            star_allele[allele_name] = haplotype_id

create_vrs_haplotype_from_genomic_definitions(haplotype_alt_genomic_vrs, haplotype_alt_definitions)

In [7]:
star_allele

{'CYP2C19*38': 'ga4gh:VH.FkhAIRV9DT8KUM6ZRpzECQ2o6i6pqHsX',
 'CYP2C19*1': 'ga4gh:VH.Py9ikChcoN8lU4bGxcwilbMcafqMrR4y',
 'CYP2C19*3': 'ga4gh:VH.eaxIbw6hCuiI7A8ElpbQZk2XbPEQ5pAz',
 'CYP2C19*5': 'ga4gh:VH.pmYj9lbzm_QX5CFP3kBXxFsAMqBQPFrF',
 'CYP2C19*6': 'ga4gh:VH.9gYWdH3ubsssAzjOJ7h3DwLZEP0-eZY3',
 'CYP2C19*7': 'ga4gh:VH.ZGg35ABWAaQlU6-4KXcmYyt-pV4tmr2k',
 'CYP2C19*8': 'ga4gh:VH.vCab1KWGn52Zo1xiPGQRMNLqt1i71iyN',
 'CYP2C19*9': 'ga4gh:VH.pGSS8ikrJOi3qfmV1HPJI2PNlygBWnsh',
 'CYP2C19*10': 'ga4gh:VH.NPE64q_K739eslbVtny0Zveiz1o21UCf',
 'CYP2C19*11': 'ga4gh:VH.ozsW9ivqBRZLxcfwTthre4MK4WWPos5O',
 'CYP2C19*12': 'ga4gh:VH.fV3C_E5bLTXnVfWTMDq7T7vjxTuVeK52',
 'CYP2C19*13': 'ga4gh:VH.WVv7Se8DTvbyKSBlvc-k5oMn2to5C68z',
 'CYP2C19*14': 'ga4gh:VH.EpPPyos_2lkzbpuK0LyytaFJDjVZ7ff8',
 'CYP2C19*15': 'ga4gh:VH.dZAsiqTAoYsIk7TLySp5y1CX1DR_3lwF',
 'CYP2C19*16': 'ga4gh:VH.OQN9cew2VGwhziABQJZeFdVE2pX-XaFd',
 'CYP2C19*17': 'ga4gh:VH.82M04IDxj6s2nwH9Nxd15uSVcL5Lcoq5',
 'CYP2C19*18': 'ga4gh:VH.lqeDsM6iXU5v2cHxxE8LLL

## Retrieving Star Alleles as Haplotypes

In [8]:
## CYP2C19*1

vrs_cyp2c19_1_id = star_allele["CYP2C19*1"]
vrs_cyp2c19_1_id

'ga4gh:VH.Py9ikChcoN8lU4bGxcwilbMcafqMrR4y'

In [9]:
vrs_cyp2c19_1_obj = vrs_object_store[vrs_cyp2c19_1_id]
ppo(vrs_cyp2c19_1_obj)

members:
- ga4gh:VA.MyADCvK6DOi3cnxksgB14WljIrg-ZI2z
- ga4gh:VA.1uIry_sfKDasqUQFDWlN4i8y3x5NCcoX
- ga4gh:VA.MegJUns5wJOG7BXEA70Ohq7V561OcoB7
- ga4gh:VA.KQ8c6bvw3XUFE0edAVwgNeCl1PvNmZr9
- ga4gh:VA.xjFHV26pLxfGzSPyCtIIvwTNeil0UqGr
- ga4gh:VA.dKgTBytJeBXy1k34yb0zpAQHAkFHfvRD
- ga4gh:VA.ne0L6WXxQfjSP-x114EJLUdNEGjW9zTh
- ga4gh:VA.dijsDtiZBZWwSrnwgJ20c0h74o5TTxsX
- ga4gh:VA.VtfREnnA9R-d8lbEAlPEo99j8BbsRaBl
- ga4gh:VA.PgsuPCl2O0s8UcWK1qHcD_u4YoB96xO1
- ga4gh:VA.zoLzyDHkRmBjOplHBS-Nd1SDOFPYFlXR
- ga4gh:VA.WR3WlywNms4jdQbPDrYkwx5lwfxMrUa9
- ga4gh:VA.EaeUSsUs8F0rxWCCjhKyUtVDYvN9_HKS
- ga4gh:VA.7ijuQlYS0BzAFXE-Q-enNTIIQT9cEt0k
- ga4gh:VA.w4kx8a9U3NZB9ewOU4KliYgnqO1hrLvQ
- ga4gh:VA.i2VXJYVOGplFtwp0wESgMHjRVoRytIAP
- ga4gh:VA.V316joLwuiSzz5gaYS7BpodgvqD0guxC
- ga4gh:VA.0GDQZCHWdndUYcHGxcBmG7TMDv5PmbKb
- ga4gh:VA.Z2LLoarnxwdQ00DqZQp2WnFxOhzTeHuR
- ga4gh:VA.D6e9KM2faacJPWgbdR1pEhXVrm0lzrxe
- ga4gh:VA.V3b7Qcfj-KQC1Yb4dhDyr-TUCuALSTbZ
- ga4gh:VA.b34egYFmn-7mHEqgdrzZoU0RqXhayDD-
- ga4gh:VA.AyFzahm7qk1w

In [10]:
# ppo(vrs_deref(vrs_cyp2c19_1_obj, vrs_object_store))

## Constructing a PGx Diplotype

See the Genotype class [here](https://vrs.ga4gh.org/en/latest/terms_and_model.html#genotype).

In [11]:
## One *1 allele
vrs_star_1 = vrs_deref(vrs_object_store[star_allele["CYP2C19*1"]], vrs_object_store)
genotype_member_1 = models.GenotypeMember(variation=vrs_star_1, count=models.Number(value=1))

## One *17 allele
vrs_star_17 = vrs_deref(vrs_object_store[star_allele["CYP2C19*17"]], vrs_object_store)
genotype_member_2 = models.GenotypeMember(variation=vrs_star_17, count=models.Number(value=1))

## Genotype
genotype = models.Genotype(members=[genotype_member_1, genotype_member_2], count=models.Number(value=2))

In [12]:
# A representation of this Genotype in YAML
# ppo(genotype)

## Searching for compatible PGx Star Alleles

In [13]:
observed_allele_1 = tlr.translate_from(':'.join(['NC_000010.11', 'g.94762760A>C']), 'hgvs')
observed_allele_2 = tlr.translate_from(':'.join(['NC_000010.11', 'g.94775423A>C']), 'hgvs')

In [14]:
observed_1_compatible = compatible_star_alleles[ga4gh_identify(observed_allele_1)]
observed_1_compatible

{'CYP2C19*15', 'CYP2C19*28', 'CYP2C19*35', 'CYP2C19*39'}

In [15]:
observed_2_compatible = compatible_star_alleles[ga4gh_identify(observed_allele_2)]
observed_1_compatible & observed_2_compatible

{'CYP2C19*39'}

## An example schema for extending Star Allele Haplotypes with definitive Sequence Locations

First, we define a JSON Schema document that extends Haplotypes with definitive regions to describe Star Alleles

In [16]:
star_allele_schema = {
    'type': 'object',
    'desciption': 'A representation of the Molecular Variation and \
                    definitive regions that constitute a PGx Star Allele',
    'properties': {
        'type': {
            'const': 'StarAllele',
            'type': 'string'
        },
        'variation': {
            '$ref': SCHEMA_URI_ROOT + '/vrs.json#/definitions/MolecularVariation'
        },
        'definitive_regions': {
            'type': 'array',
            'items': { '$ref': SCHEMA_URI_ROOT + '/vrs.json#/definitions/SequenceLocation' }
        }
    }
    
}

A list of definitive sites was extracted from the CYP2C19 Allele Definition Table:
https://files.cpicpgx.org/data/report/current/allele_definition/CYP2C19_allele_definition_table.xlsx

In [17]:
variant_sites = "g.94761900C>T	g.94762706A>G	g.94762712C>T	g.94762715T>C	g.94762755T>C	g.94762760A>C	g.94762788A>T	g.94762856A>G	g.94775106C>T	g.94775121C>T	g.94775160G>C	g.94775185A>G	g.94775367A>G	g.94775416T>C	g.94775423A>C	g.94775453G>A	g.94775489G>A	g.94775507G>A	g.94780574G>C	g.94780579G>A	g.94780653G>A	g.94781858C>T	g.94781859G>A	g.94781944G>A	g.94781999T>A	g.9486561G>A	g.94866A>G	g.94842879G>A	g.94842995G>A	g.94849995C>T	g.94852738C>T	g.94852765C>T	g.94852785C>G	g.94852914A>C".split()

In [18]:
site_re = re.compile(r'g.(\d+)\w>\w')

In [19]:
site_positions = [int(site_re.match(x).groups()[0]) for x in variant_sites] 

These sites were used to build VRS [Sequence Location](https://vrs.ga4gh.org/en/stable/terms_and_model.html#sequencelocation) objects.

In [20]:
site_seqlocs = list()
for position in site_positions:
    interval = models.SequenceInterval(
        start=models.Number(value=position-1), end=models.Number(value=position))
    seqloc = models.SequenceLocation(
        sequence_id='ga4gh:SQ.ss8r_wB0-b9r44TQTMmVTI92884QvBiB', interval=interval)
    site_seqlocs.append(seqloc.as_dict())

In [21]:
hgvs_expression = "NC_000010.11:g.94842866A>G"
h1_allele1 = tlr.translate_from(hgvs_expression,'hgvs')

This was used to build a domain-specific message structure for CYP2C19 *1:

In [22]:
msg = {
    'type': 'StarAllele',
    'variation': h1_allele1.as_dict(),
    'definitive_sites': site_seqlocs
}

# this object can be inspected with the following:
# print(yaml.dump(msg, sort_keys=True, indent=3))

Finally, we demonstrate that the defined Star Allele validates against the schema 

In [23]:
assert validate(msg, star_allele_schema) is None