# Abundance and Repeats Development Notebook

This notebook demonstrates two new and related VRS concepts, Repeats and Abundance, and secondary classes that enable their use in representing common variation expressions.  The corresponding schema (currently) exists only in the vr-spec develop branch. It also requires an up-to-date vrs-python package.

## Repeats
Repeats repesent tandem copies of a fixed sequence of any length (length >= 1). The number of copies may be known precisely or specified as a range (min, max), which is essential for certain applications (e.g., Huntington's disease CAG repeat sizes). For a precise repeat length, Repeats are equivalent to Alleles with a single SequenceState.  Repeats are implemented using the Allele class with a RepeatState applied to a Location. Repeats should be used only when copies are known to be tandem; if the location of repeats is unknown, Abundance should be used. 

## Abundance
Abundance is a systemic state that captures the amount of a "subject" location throughout the genome.

Valid Abundance subjects are:
* SequenceLocations
* TranscriptLocations
* ChromosomeLocations
* GeneLocations (a quasi-location of a gene)

Quantifiers may be:
* Absolute copy count range
* Relative copy count range, with reference
* Qualitative copy count (e.g., "greater than"), with reference

The intention is to allow all combinations of the above subjects and quantifiers.

Issues/questions:
* What are possible references? "diploid" (incl female X), "haploid" (X, Y in males)

In [1]:
from ga4gh.core import ga4gh_identify, ga4gh_serialize
from ga4gh.vrs import models
from nbsupport import ppo

In [2]:
from biocommons.seqrepo import SeqRepo
from ga4gh.vrs.dataproxy import SeqRepoDataProxy
dp = SeqRepoDataProxy(SeqRepo(root_dir="/usr/local/share/seqrepo/latest"))

def translate_sequence_identifier(ir):
    try:
        return dp.translate_sequence_identifier(ir, "ga4gh")[0]
    except IndexError:
        raise KeyError(f"Unable to translate {ir} to ga4gh sequence identifier")
    

2020-11-15 14:38:26 snafu biocommons.seqrepo[846557] INFO biocommons.seqrepo 0.6.3


# New Location Classes

## NestedInterval
Converting HGVS expressions to NestedIntervals is tricky. Imagine this sequence with an exon (or any feature) at base positions 4-6 inclusive and sequence DEF.

```
base:     1 2 3 4 5 6 7 8 9
          a b c D E F g h i
i-base:  0 1 2 3 4 5 6 7 8 9

```

HGVS for th region to the left and right (abc, ghi) would be `(1_3)_(7_9)`, corresponding to interbase ranges (0,3) and (6,9).

In [3]:
ni = models.NestedInterval(
    inner = models.SimpleInterval(start=20, end=30),
    outer = models.SimpleInterval(start=10, end=40)
)

In [4]:
ppo(ni)

{
  "inner": {
    "end": 30,
    "start": 20,
    "type": "SimpleInterval"
  },
  "outer": {
    "end": 40,
    "start": 10,
    "type": "SimpleInterval"
  },
  "type": "NestedInterval"
}


In [5]:
ni.outer.start = ni.outer.end = None
ppo(ni)

{
  "inner": {
    "end": 30,
    "start": 20,
    "type": "SimpleInterval"
  },
  "outer": {
    "end": null,
    "start": null,
    "type": "SimpleInterval"
  },
  "type": "NestedInterval"
}


In [6]:
sl = models.SequenceLocation(
    sequence_id = "ga4gh:SQ.0123abcd",
    interval = ni)

In [7]:
ppo(sl)

{
  "interval": {
    "inner": {
      "end": 30,
      "start": 20,
      "type": "SimpleInterval"
    },
    "outer": {
      "end": null,
      "start": null,
      "type": "SimpleInterval"
    },
    "type": "NestedInterval"
  },
  "sequence_id": "ga4gh:SQ.0123abcd",
  "type": "SequenceLocation"
}


In [8]:
ga4gh_identify(sl)

'ga4gh:VSL.QcNBGSvsJwz-J7LoT6BaH9ZyI_l0gKnd'

## GeneLocation

GeneLocation is the conceptual location of an entire Gene.


## TranscriptLocation

# New Variation Classes

## Repeats

## Abundance

In [8]:
ab = models.Abundance(location=sl, amount={"min": 5, "max": 10})
ab.as_dict()

{'amount': {'min': 5, 'max': 10},
 'location': {'interval': {'inner': {'end': 30,
    'start': 20,
    'type': 'SimpleInterval'},
   'outer': {'end': None, 'start': None, 'type': 'SimpleInterval'},
   'type': 'NestedInterval'},
  'sequence_id': 'ga4gh:SQ.0123abcd',
  'type': 'SequenceLocation'},
 'type': 'Abundance'}

In [9]:
ga4gh_identify(ab)

'ga4gh:VAB.NPxwy8GeFWFktfuscbnsC2eTint0x3Bc'

---
# Specific Examples

The following examples demonstrate the above classes.

Notes:
* Except for GeneLocation, gene names are ignored
* For HGVS ISCN notation, only the cytoband location is used. Locations using precise sequence coordinates are demonstrated in other examples.

## VCV000528890.1 NC_000001.11:g.(?\_218346682)\_(218441382\_?)del

In [24]:
a = models.Allele(
    location=models.SequenceLocation(
        sequence_id=translate_sequence_identifier("NC_000001.11"),
        interval=models.NestedInterval(
            inner=models.SimpleInterval(start=218346681, end=218441382),
            outer=models.SimpleInterval(start=None, end=None))),
    state=models.SequenceState(sequence=''))
a.as_dict()

{'location': {'interval': {'inner': {'end': 218441382,
    'start': 218346681,
    'type': 'SimpleInterval'},
   'outer': {'type': 'SimpleInterval'},
   'type': 'NestedInterval'},
  'sequence_id': 'ga4gh:SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO',
  'type': 'SequenceLocation'},
 'state': {'sequence': '', 'type': 'SequenceState'},
 'type': 'Allele'}

## VCV000665644.1 NC_000001.10:g.(?\_15764951)\_(15765010\_?)dup

In [26]:
a = models.Allele(
    location=models.SequenceLocation(
        sequence_id=translate_sequence_identifier("NC_000001.11"),
        interval=models.NestedInterval(
            inner=models.SimpleInterval(start=15764950, end=15765010),
            outer=models.SimpleInterval(start=None, end=None))),
    state=models.RepeatState(copies={"min":2, "max": 2}))
a.as_dict()

{'location': {'interval': {'inner': {'end': 15765010,
    'start': 15764950,
    'type': 'SimpleInterval'},
   'outer': {'type': 'SimpleInterval'},
   'type': 'NestedInterval'},
  'sequence_id': 'ga4gh:SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO',
  'type': 'SequenceLocation'},
 'state': {'copies': {'max': 2, 'min': 2}, 'type': 'RepeatState'},
 'type': 'Allele'}

## VCV000662440.1 NM_003000.2(SDHB):c.656_707dup (p.Pro237_Phe238insAspTer)

## VCV000007621.2 NM_000214.3(JAG1):c.2091_2095GAAAG[1] (p.Gly699fs)
```
This variant is incorrectly written.  The repeat unit GAAAG should be left-shuffled to GGAAA, making the left-shifted version c.2090_2094GGAAA.

 c.2081     2091     2101     2111
 n.2551     2561     2571     2581  
      |        |        |        |
      AAATGGGTGGAAAGGAAAGACCTGCCAC

Converting a repeat to a SequenceState requires examining downstream sequence for the repeat unit, and inserting or deleting as necessary.  Imagine that this sequence had five GAAAG repeats and we got `[1]`; then, 4 repeats would be deleted, which is evident only from sequence context.

```

In [4]:
a = models.Allele(
    location=models.SequenceLocation(
        sequence_id=translate_sequence_identifier("NM_000214.3"),
        interval=models.SimpleInterval(start=2090, end=2095)),
    state=models.RepeatState(copies={
        "min": 1,
        "max": 1}))
a.as_dict()

{'location': {'interval': {'end': 2095,
   'start': 2090,
   'type': 'SimpleInterval'},
  'sequence_id': 'ga4gh:SQ.9MpDPgYoiNK-w76vPWkEYB-kTeZIK2NQ',
  'type': 'SequenceLocation'},
 'state': {'copies': {'max': 1, 'min': 1}, 'type': 'RepeatState'},
 'type': 'Allele'}

## VCV000144802.1 GRCh38/hg38 1p36.32(chr1:2651321-2701929)x1
⇒ make it possible to represent cytoband and coordinate representations, but not both in one object


In [12]:
cl = models.ChromosomeLocation(chr="1",
    interval=models.CytobandInterval(start="p36.32", end="p36.32"))
cl.as_dict()

{'chr': '1',
 'interval': {'end': 'p36.32', 'start': 'p36.32', 'type': 'CytobandInterval'},
 'species_id': 'taxonomy:9606',
 'type': 'ChromosomeLocation'}

In [14]:
ab = models.Abundance(
    location=cl,
    amount={
        "min": 1, "max": 1})
ab.as_dict()

{'amount': {'min': 1, 'max': 1},
 'location': {'chr': '1',
  'interval': {'end': 'p36.32', 'start': 'p36.32', 'type': 'CytobandInterval'},
  'species_id': 'taxonomy:9606',
  'type': 'ChromosomeLocation'},
 'type': 'Abundance'}

In [15]:
ga4gh_identify(ab)

'ga4gh:VAB.ncinV7xih4BbFfyhGvp0-gx9__QmrdSP'

## VCV000149842.1 GRCh38/hg38 15q11.2(chr15:25334870-25351819)x3
⇒ specified with HGVS NC_000015.8:g.(?_23131110)_(23148059_?)dup, ...

## VCV000395246.1 GRCh37/hg19 Xq21.33-28(chrX:94043221-155246585)x1

## VCV000394192.1 GRCh37/hg19 Xp11.4(chrX:40456453-40487150)x2

## EID473 ... “increased copy number or amplification of EGFR”... (>8x copies)

## EID5925 “increased EGFR gene copy number”

## VCV000254074.1 NM_007294.3(BRCA1):c.5075-?_5277+?dup203

## VCV000145395.1 GRCh38/hg38 1p36.33(chr1:844353-911241)x3 (FAM41C , LINC01128)

## VCV000395687.1 GRCh37/hg19 Xp22.33-q28(chrX:70297-155255792)
NCBI calls this a "copy number loss" with corresponding HGVS `NC_000023.10:g.(?_70297)_(155255792_?)del`