# Development Notebook

This notebook demonstrates new features in the develop branch of vr-spec. The primary new features are Repeats and Abundance, and secondary classes that enable their use in representing common variation that involves the duplication of sequence.

The corresponding schema (currently) exists only in the vr-spec develop branch. It also requires an up-to-date vrs-python package.

In [1]:
from ga4gh.core import ga4gh_identify, ga4gh_serialize
from ga4gh.vrs import models
from nbsupport import ppo, translate_sequence_identifier

---
# Specific Examples

The following examples demonstrate the above classes.

Notes:
* Except for GeneLocation, gene names are ignored
* For HGVS ISCN notation, only the cytoband location is used. Locations using precise sequence coordinates are demonstrated in other examples.

## VCV000528890.1 NC_000001.11:g.(?\_218346682)\_(218441382\_?)del

In [4]:
a = models.Allele(
    location=models.SequenceLocation(
        sequence_id=translate_sequence_identifier("NC_000001.11"),
        interval=models.NestedInterval(
            inner=models.SimpleInterval(start=218346681, end=218441382),
            outer=models.SimpleInterval(start=None, end=None))),
    state=models.SequenceState(sequence=''))
ppo(a)

{
  "location": {
    "interval": {
      "inner": {
        "end": 218441382,
        "start": 218346681,
        "type": "SimpleInterval"
      },
      "outer": {
        "type": "SimpleInterval"
      },
      "type": "NestedInterval"
    },
    "sequence_id": "ga4gh:SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
    "type": "SequenceLocation"
  },
  "state": {
    "sequence": "",
    "type": "SequenceState"
  },
  "type": "Allele"
}


## VCV000665644.1 NC_000001.10:g.(?\_15764951)\_(15765010\_?)dup

In [5]:
a = models.Allele(
    location=models.SequenceLocation(
        sequence_id=translate_sequence_identifier("NC_000001.11"),
        interval=models.NestedInterval(
            inner=models.SimpleInterval(start=15764950, end=15765010),
            outer=models.SimpleInterval(start=None, end=None))),
    state=models.RepeatState(copies={"min":2, "max": 2}))
a.as_dict()

{'location': {'interval': {'inner': {'end': 15765010,
    'start': 15764950,
    'type': 'SimpleInterval'},
   'outer': {'type': 'SimpleInterval'},
   'type': 'NestedInterval'},
  'sequence_id': 'ga4gh:SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO',
  'type': 'SequenceLocation'},
 'state': {'copies': {'max': 2, 'min': 2}, 'type': 'RepeatState'},
 'type': 'Allele'}

## VCV000662440.1 NM_003000.2(SDHB):c.656_707dup (p.Pro237_Phe238insAspTer)

## VCV000007621.2 NM_000214.3(JAG1):c.2091_2095GAAAG[1] (p.Gly699fs)
```
This variant is incorrectly written.  The repeat unit GAAAG should be left-shuffled to GGAAA, making the left-shifted version c.2090_2094GGAAA.

 c.2081     2091     2101     2111
 n.2551     2561     2571     2581  
      |        |        |        |
      AAATGGGTGGAAAGGAAAGACCTGCCAC

Converting a repeat to a SequenceState requires examining downstream sequence for the repeat unit, and inserting or deleting as necessary.  Imagine that this sequence had five GAAAG repeats and we got `[1]`; then, 4 repeats would be deleted, which is evident only from sequence context.

```

In [None]:
a = models.Allele(
    location=models.SequenceLocation(
        sequence_id=translate_sequence_identifier("NM_000214.3"),
        interval=models.SimpleInterval(start=2090, end=2095)),
    state=models.RepeatState(copies={
        "min": 1,
        "max": 1}))
a.as_dict()

## VCV000144802.1 GRCh38/hg38 1p36.32(chr1:2651321-2701929)x1
⇒ make it possible to represent cytoband and coordinate representations, but not both in one object


In [None]:
cl = models.ChromosomeLocation(chr="1",
    interval=models.CytobandInterval(start="p36.32", end="p36.32"))
cl.as_dict()

In [None]:
ab = models.Abundance(
    location=cl,
    amount={
        "min": 1, "max": 1})
ab.as_dict()

In [None]:
ga4gh_identify(ab)

## VCV000149842.1 GRCh38/hg38 15q11.2(chr15:25334870-25351819)x3
⇒ specified with HGVS NC_000015.8:g.(?_23131110)_(23148059_?)dup, ...

## VCV000395246.1 GRCh37/hg19 Xq21.33-28(chrX:94043221-155246585)x1

## VCV000394192.1 GRCh37/hg19 Xp11.4(chrX:40456453-40487150)x2

## EID473 ... “increased copy number or amplification of EGFR”... (>8x copies)

## EID5925 “increased EGFR gene copy number”

## VCV000254074.1 NM_007294.3(BRCA1):c.5075-?_5277+?dup203

## VCV000145395.1 GRCh38/hg38 1p36.33(chr1:844353-911241)x3 (FAM41C , LINC01128)

## VCV000395687.1 GRCh37/hg19 Xp22.33-q28(chrX:70297-155255792)
NCBI calls this a "copy number loss" with corresponding HGVS `NC_000023.10:g.(?_70297)_(155255792_?)del`