# ga4gh.vrs.extras

This notebook demonstrates functionality in the vr-python package that builds on VRS but is not formally part of the specification. 

## Data Proxy
VRS implementations will need access to sequences and sequence identifiers. Sequences are used during normalization and during conversions with other formats. Sequence identifiers are necessary in order to translate identfiers from common forms to a digest-based identifier.

VRS leaves the choice of those data sources to the implementations.  In vr-python, `ga4gh.vrs.dataproxy` provides an abstract base class as a basis for data source adapters.  One source is [SeqRepo](https://github.com/biocommons/biocommons.seqrepo/), which is used below.  (An adapter based on the GA4GH refget specification exists, but is pending necessary changes to the refget interface to provide accession-based lookups.)

SeqRepo: [github](https://github.com/biocommons/biocommons.seqrepo/) | [data snapshots](http://dl.biocommons.org/seqrepo/) | [seqrepo-rest-service @ github](https://github.com/biocommons/seqrepo-rest-service) | [seqrepo-rest-service docker images](https://cloud.docker.com/u/biocommons/repository/docker/biocommons/seqrepo-rest-service)

RefGet: [spec](https://samtools.github.io/hts-specs/refget.html) | [perl server](https://github.com/andrewyatz/refget-server-perl)

In [1]:
from ga4gh.core import sha512t24u, ga4gh_digest, ga4gh_identify, ga4gh_serialize
from ga4gh.vrs import __version__, models, normalize
from ga4gh.vrs.dataproxy import SeqRepoRESTDataProxy
from biocommons.seqrepo import SeqRepo

seqrepo_rest_service_url = "https://services.genomicmedlab.org/seqrepo"
dp = SeqRepoRESTDataProxy(base_url=seqrepo_rest_service_url)



In [2]:
dp.get_metadata("refseq:NM_000551.3")

{'added': '2016-08-24T05:03:11Z',
 'aliases': ['MD5:215137b1973c1a5afcf86be7d999574a',
  'NCBI:NM_000551.3',
  'refseq:NM_000551.3',
  'SEGUID:T12L0p2X5E8DbnL0+SwI4Wc1S6g',
  'SHA1:4f5d8bd29d97e44f036e72f4f92c08e167354ba8',
  'VMC:GS_v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_',
  'sha512t24u:v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_',
  'ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_'],
 'alphabet': 'ACGT',
 'length': 4560}

In [3]:
dp.get_sequence("ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_", start=0, end=51) + "..."

'CCTCGCCTCCGTTACAACGGCCTACGGTGCTGGAGGATCCTTCTGCGCACG...'

## Format translator
ga4gh.vrs.extras.translator.Translator translates various formats into VRS representations. 

<div class="alert alert-success">
    <span style="font-size: larger">🚀</span> The examples below use the same variant in 4 formats: HGVS, beacon, spdi, and VCF/gnomAD. Notice that the resulting Allele objects and computed identifiers are identical.</b>
    
By default, `Translator` 1) translates sequence identifiers to ga4gh digest-based identifiers, 2) normalizes alleles, 3) adds a ga4gh identifier. These may be disabled as desired. (However, `ga4gh_identify` requires that all objects use identifiers, including sequence identifiers, in the `ga4gh` namespace.)
</div>

In [4]:
from ga4gh.vrs.extras.translator import Translator
tlr = Translator(data_proxy=dp,
                 translate_sequence_identifiers=True,  # default
                 normalize=True,                       # default
                 identify=True)                        # default

### From/To HGVS

<div class="alert alert-info">
    <span style="font-size: larger">☛</span> The HGVS variant below shows C>T.
    </div>

In [5]:
a = tlr.translate_from("NC_000019.10:g.44908822C>T","hgvs")
a.as_dict()

{'_id': 'ga4gh:VA.CxiA_hvYbkD8Vqwjhx5AYuyul4mtlkpD',
 'type': 'Allele',
 'location': {'type': 'SequenceLocation',
  'sequence_id': 'ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl',
  'interval': {'type': 'SequenceInterval',
   'start': {'type': 'Number', 'value': 44908821},
   'end': {'type': 'Number', 'value': 44908822}}},
 'state': {'type': 'LiteralSequenceExpression', 'sequence': 'T'}}

In [6]:
#tlr.translate_to(a, "hgvs")

"The postgres default port of 5432 is blocked outbound by binder and potentially other institutions. "\
"To circumvent users having to install UTA themselves we created a rest data proxy for variation normalizer for the to_hgvs endpoint."

'The postgres default port of 5432 is blocked outbound by binder and potentially other institutions. To circumvent users having to install UTA themselves we created a rest data proxy for variation normalizer for the to_hgvs endpoint.'

In [7]:
from ga4gh.vrs.extras.variation_normalizer_rest_dp import VariationNormalizerRESTDataProxy
vnorm = VariationNormalizerRESTDataProxy()
vnorm.to_hgvs(a)

['NC_000019.10:g.44908822C>T']

### From/To SPDI

In [8]:
# SPDI uses 0-based coordinates
a = tlr.translate_from("NC_000019.10:44908821:1:T","spdi")
a.as_dict()

{'_id': 'ga4gh:VA.CxiA_hvYbkD8Vqwjhx5AYuyul4mtlkpD',
 'type': 'Allele',
 'location': {'type': 'SequenceLocation',
  'sequence_id': 'ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl',
  'interval': {'type': 'SequenceInterval',
   'start': {'type': 'Number', 'value': 44908821},
   'end': {'type': 'Number', 'value': 44908822}}},
 'state': {'type': 'LiteralSequenceExpression', 'sequence': 'T'}}

In [9]:
tlr.translate_to(a, "spdi")

['NC_000019.10:44908821:1:T']

In [10]:
a.location.interval.end.value += 1
tlr.translate_to(a, "spdi")

['NC_000019.10:44908821:2:T']

In [11]:
a.state.sequence = ""
tlr.translate_to(a, "spdi")

['NC_000019.10:44908821:2:']

### from Beacon (VCF-like)

In [12]:
# from_beacon: Translate from beacon's form
a = tlr.translate_from("19 : 44908822 C > T", "beacon")
a.as_dict()

{'_id': 'ga4gh:VA.CxiA_hvYbkD8Vqwjhx5AYuyul4mtlkpD',
 'type': 'Allele',
 'location': {'type': 'SequenceLocation',
  'sequence_id': 'ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl',
  'interval': {'type': 'SequenceInterval',
   'start': {'type': 'Number', 'value': 44908821},
   'end': {'type': 'Number', 'value': 44908822}}},
 'state': {'type': 'LiteralSequenceExpression', 'sequence': 'T'}}

### from gnomAD style VCF

In [13]:
a = tlr.translate_from("19-44908822-C-T", "gnomad")   # gnomAD-style expression
a.as_dict()

{'_id': 'ga4gh:VA.CxiA_hvYbkD8Vqwjhx5AYuyul4mtlkpD',
 'type': 'Allele',
 'location': {'type': 'SequenceLocation',
  'sequence_id': 'ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl',
  'interval': {'type': 'SequenceInterval',
   'start': {'type': 'Number', 'value': 44908821},
   'end': {'type': 'Number', 'value': 44908822}}},
 'state': {'type': 'LiteralSequenceExpression', 'sequence': 'T'}}

## Advanced Examples

NM_000551.3 starts with `CCTCGCCTCC`. So, `NM_000551.3:n.5_6insC` inserts a C at the start of an existing run of two C residues.

In [14]:
from IPython.display import HTML, display
import tabulate

hgvs_expr = "NM_000551.3:n.5_6insC"

# Translator with default behaviors disabled
tlr2 = Translator(data_proxy=dp,
                  translate_sequence_identifiers=False,
                  normalize=False,
                  identify=False)

### translate_sequence_identifiers

In [15]:
header = "translate_sequence_identifiers= sequence_id".split()
table = [header]
for tsi in (False, True):
    tlr2.translate_sequence_identifiers = tsi
    a = tlr2.translate_from(hgvs_expr, "hgvs")
    row = [tlr2.translate_sequence_identifiers,
           a.location.sequence_id._value]
    table += [row]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

0,1
translate_sequence_identifiers=,sequence_id
False,refseq:NM_000551.3
True,ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_


### normalize
VRS normalization uses [fully-justified normalization](https://vr-spec.readthedocs.io/en/1.0/impl-guide/normalization.html). In this case, the left-aligned insertion (c.5_6insC) is renormalized as a replacement of the two C residues with three C residues at interbase coordinates [5,7].

In [16]:
tlr2.translate_sequence_identifiers = True

header = "normalize= sequence_id interval alt hgvs".split()
table = [header]
for normalize in (False, True):
    tlr2.normalize = normalize
    a = tlr2.translate_from(hgvs_expr, "hgvs")
    row = [tlr2.normalize,
           a.location.sequence_id,
           f"{a.location.interval.start.value},{a.location.interval.end.value}",
           a.state.sequence,
           #tlr2.translate_to(a, 'hgvs')[0]
           vnorm.to_hgvs(a)[0]
          ]
    table += [row]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

0,1,2,3,4
normalize=,sequence_id,interval,alt,hgvs
False,ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_,55,C,NM_000551.3:n.7dup
True,ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_,57,CCC,NM_000551.3:n.7dup
