# Abundance and Repeats Development Notebook

This notebook demonstrates two new and related VRS concepts, Repeats and Abundance.  The corresponding schema 
(currently) exists only in the vr-spec develop branch. It also requires an up-to-date vrs-python package.

## Repeats
Repeats repesent tandem copies of any length (VRS does not specify arbitrary thresholds). Because they are tandem by definition, they are effectively an alternative representation for SequenceState Alleles. Therefore, they are implemented using an alternative RepeatState applied to a Location. Because they are Alleles, they apply to any valid Allele location.


## Abundance
Abundance is a systemic state that captures the amount of a "subject" location throughout the genome/transcriptome/proteome.

Intended subjects are:
* SequenceLocations
* TranscriptLocations
* ChromosomeLocations
* GeneLocations (a quasi-location of a gene)

Amounts may be:
* Absolute copy count range
* Relative copy count range, with reference
* Qualitative copy count (e.g., "greater than"), with reference

The intention is to allow all combinations of entities and amount specifications.

Issues/questions:
* What are possible references? "diploid" (incl female X), "haploid" (X, Y in males)

In [1]:
from ga4gh.core import ga4gh_identify, ga4gh_serialize
from ga4gh.vrs import models
from nbsupport import ppo

# New Location Classes

## NestedInterval and SequenceLocations

In [2]:
ni = models.NestedInterval(
    inner = models.SimpleInterval(start=20, end=30),
    outer = models.SimpleInterval(start=10, end=40)
)

In [3]:
ppo(ni)

{
  "inner": {
    "end": 30,
    "start": 20,
    "type": "SimpleInterval"
  },
  "outer": {
    "end": 40,
    "start": 10,
    "type": "SimpleInterval"
  },
  "type": "NestedInterval"
}


In [4]:
ni.outer.start = ni.outer.end = None
ppo(ni)

{
  "inner": {
    "end": 30,
    "start": 20,
    "type": "SimpleInterval"
  },
  "outer": {
    "end": null,
    "start": null,
    "type": "SimpleInterval"
  },
  "type": "NestedInterval"
}


In [5]:
sl = models.SequenceLocation(
    sequence_id = "ga4gh:SQ.0123abcd",
    interval = ni)

In [6]:
ppo(sl)

{
  "interval": {
    "inner": {
      "end": 30,
      "start": 20,
      "type": "SimpleInterval"
    },
    "outer": {
      "end": null,
      "start": null,
      "type": "SimpleInterval"
    },
    "type": "NestedInterval"
  },
  "sequence_id": "ga4gh:SQ.0123abcd",
  "type": "SequenceLocation"
}


In [7]:
ga4gh_identify(sl)

'ga4gh:VSL.QcNBGSvsJwz-J7LoT6BaH9ZyI_l0gKnd'

## GeneLocation

GeneLocation is the conceptual location of an entire Gene.


## TranscriptLocation

# New Variation Classes

## Repeats

## Abundance

In [8]:
ab = models.Abundance(location=sl, amount={"min": 5, "max": 10})
ab.as_dict()

{'amount': {'min': 5, 'max': 10},
 'location': {'interval': {'inner': {'end': 30,
    'start': 20,
    'type': 'SimpleInterval'},
   'outer': {'end': None, 'start': None, 'type': 'SimpleInterval'},
   'type': 'NestedInterval'},
  'sequence_id': 'ga4gh:SQ.0123abcd',
  'type': 'SequenceLocation'},
 'type': 'Abundance'}

In [9]:
ga4gh_identify(ab)

'ga4gh:VAB.NPxwy8GeFWFktfuscbnsC2eTint0x3Bc'