# Using hgvs
This notebook demonstrates major features of the hgvs package.

In [27]:
import hgvs
hgvs.__version__

'0.5.0a6.dev3+nf998c16a46b3.d20161012'

## Variant I/O

### Initialize the parser

In [26]:
# You only need to do this once per process
import hgvs.parser
hp = hgvsparser = hgvs.parser.Parser()

### Parse a simple variant

In [8]:
v = hp.parse_hgvs_variant("NC_000007.13:g.21726874G>A")

In [13]:
v.ac, v.type

('NC_000007.13', 'g')

In [15]:
v.posedit

PosEdit(pos=21726874, edit=G>A, uncertain=False)

In [16]:
v.posedit.pos

Interval(start=21726874, end=21726874, uncertain=False)

In [18]:
v.posedit.pos.start

SimplePosition(base=21726874, uncertain=False)

### Parsing complex variants

In [20]:
v = hp.parse_hgvs_variant("NM_003777.3:c.13552_*36del57")

In [22]:
v.posedit.pos.start, v.posedit.pos.end

(BaseOffsetPosition(base=13552, offset=0, datum=1, uncertain=False),
 BaseOffsetPosition(base=36, offset=0, datum=2, uncertain=False))

In [23]:
v.posedit.edit

NARefAlt(ref=57, alt=None, uncertain=False)

### Formatting variants
All objects may be formatted simply by "stringifying" or printing them using `str`, `print()`, or `"".format()`.

In [29]:
str(v)

'NM_003777.3:c.13552_*36del57'

In [30]:
print(v)

NM_003777.3:c.13552_*36del57


In [31]:
"{v} spans the CDS end".format(v=v)

'NM_003777.3:c.13552_*36del57 spans the CDS end'

## Projecting variants between sequences

### Set up a dataprovider

Mapping variants requires exon structures, alignments, CDS bounds, and raw sequence. These are provided by a `hgvs.dataprovider` instance. The only dataprovider provided with hgvs uses UTA. You may write your own by subsclassing hgvs.dataproviders.interface.

In [4]:
import hgvs.dataproviders.uta
hdp = hgvs.dataproviders.uta.connect()

### Initialize mapper classes
The VariantMapper class projects variants between two sequence accessions using alignments from a specified source. In order to use it, you must know that two sequences are aligned. VariantMapper isn't demonstrated here.

EasyVariantMapper builds on VariantMapper and handles identifying appropriate sequences. It is configured for a particular genome assembly.

In [None]:
import hgvs.variantmapper
#vm = variantmapper = hgvs.variantmapper.VariantMapper(hdp)
evm37 = easyvariantmapper = hgvs.variantmapper.EasyVariantMapper(hdp, assembly_name='GRCh37')
evm38 = easyvariantmapper = hgvs.variantmapper.EasyVariantMapper(hdp, assembly_name='GRCh38')

### c_to_g
This is the easiest case because there is typically only one alignment between a transcript and the genome. (Exceptions exist for pseudoautosomal regions.)

In [41]:
var_c = hp.parse_hgvs_variant("NM_015120.4:c.35G>C")
var_g = evm37.c_to_g(var_c)
var_g

SequenceVariant(ac=NC_000002.11, type=g, posedit=73613031T>C)

In [42]:
evm38.c_to_g(var_c)

SequenceVariant(ac=NC_000002.12, type=g, posedit=73385903T>C)

### g_to_c
In order to project a genomic variant onto a transcript, you must tell the EasyVariantMapper which transcript to use.

In [43]:
evm37.g_to_c(var_g, "NM_015120.4")

SequenceVariant(ac=NM_015120.4, type=c, posedit=35T>C)

### c_to_p

In [46]:
var_p = evm37.c_to_p(var_c)
str(var_p)

'NP_055935.4:p.(Leu12Pro)'

In [47]:
var_p.posedit.uncertain = False
str(var_p)

'NP_055935.4:p.Leu12Pro'

## Normalizing variants
In hgvs, normalization means shifting variants 3' (as requried by the HGVS nomenclature) as well as rewriting variants. The variant "NM_001166478.1:c.30_31insT" is in a poly-T run (on the transcript). It should be shifted 3' and is better written as dup, as shown below:
```
                                         *                       NC_000006.11:g.49917127dupA
   NC_000006.11 g   49917117 > AGAAAGAAAAATAAAACAAAG  > 49917137 
   NC_000006.11 g   49917117 < TCTTTCTTTTTATTTTGTTTC  < 49917137 
                               |||||||||||||||||||||  21= 
 NM_001166478.1 n         41 < TCTTTCTTTTTATTTTGTTTC  <       21 NM_001166478.1:n.35dupT
 NM_001166478.1 c         41 <                        <       21 NM_001166478.1:c.30_31insT
```

In [54]:
import hgvs.normalizer
hn = hgvs.normalizer.Normalizer(hdp)

In [55]:
v = hp.parse_hgvs_variant("NM_001166478.1:c.30_31insT")
str(hn.normalize(v))

'NM_001166478.1:c.35dupT'

## Validating variants
`hgvs.validator.Validator` is a composite of two classes, `hgvs.validator.IntrinsicValidator` and `hgvs.validator.ExtrinsicValidator`. Intrinsic validation evaluates a given variant for *internal* consistency, such as requiring that insertions specify adjacent positions.  Extrinsic validation evaluates a variant using external data, such as ensuring that the reference nucleotide in the variant matches that implied by the reference sequence and position. Validation returns `True` if successful, and raises an exception otherwise. 

In [56]:
import hgvs.validator
hv = hgvs.validator.Validator(hdp)

In [57]:
hv.validate(hp.parse_hgvs_variant("NM_001166478.1:c.30_31insT"))

True

In [59]:
try:
    hv.validate(hp.parse_hgvs_variant("NM_001166478.1:c.30_32insT"))

HGVSInvalidVariantError: insertion length must be 1