This notebook demonstrates new features in the develop branch of vr-spec. The primary new features are Repeats and Abundance, and secondary classes that enable their use in representing common variation that involves the duplication of sequence.

The corresponding schema (currently) exists only in the vr-spec develop branch. It also requires an up-to-date vrs-python package.

**Models in this notebook are under development. They are likely to change before release and may not ever be released.  Don't use them yet.** 

In [1]:
from ga4gh.core import ga4gh_identify, ga4gh_serialize
from ga4gh.vrs import models, vrs_enref, vrs_deref
from nbsupport import ppo, translate_sequence_identifier

# from ga4gh.vrs._internal.models import _load_vrs_models # for testing

# New Features

## NestedInterval (Not Identfiable)

The NestedInterval class is intended to capture regions where the start and end coordinates are uncertain. For example, a deletion of an entire exon may occur anywhere 5' of the left intron-exon boundry and 3' of the right exon-intron boundary. 

NestedIntervals are modeled as a pair of `inner` and `outer` ranges, each of which is a SimpleInterval.  A `start` and `end` are consistent with a NestedInterval if they meet these criteria: `outer.start <= start <= inner.start <= inner.end <= end <= outer.end`.

NestedIntervals are conceptualy related to, and interchangeable with, intervals of intervals (i.e., start and end, each of which is defined by an interval).

Converting HGVS expressions to NestedIntervals requires care. Imagine this sequence with an exon (or any feature) at base positions 4-6 inclusive and sequence DEF.

```
base:     1   2   3   4   5   6   7   8   9
          a   b   c [ D   E   F ] g   h   i
i-base:  0  1   2   3   4   5   6   7   8   9

```

HGVS for the region to the left and right (abc, ghi) would be `(1_3)_(7_9)`, corresponding to interbase ranges (0,3) and (6,9). As a nested interval, the range becomes inner=(3,6) and outer=(0,9).

In [4]:
ni = models.NestedInterval(
    inner = models.SimpleInterval(start=20, end=30),
    outer = models.SimpleInterval(start=10, end=40))
ni.as_dict()

{'inner': {'end': 30, 'start': 20, 'type': 'SimpleInterval'},
 'outer': {'end': 40, 'start': 10, 'type': 'SimpleInterval'},
 'type': 'NestedInterval'}

In [5]:
sl = models.SequenceLocation(
    sequence_id = "ga4gh:SQ.0123abcd",
    interval = ni)
sl.as_dict()

{'interval': {'inner': {'end': 30, 'start': 20, 'type': 'SimpleInterval'},
  'outer': {'end': 40, 'start': 10, 'type': 'SimpleInterval'},
  'type': 'NestedInterval'},
 'sequence_id': 'ga4gh:SQ.0123abcd',
 'type': 'SequenceLocation'}

## Gene (Not Identfiable)

Use case: Provide a mechanism to refer to an entire gene to use as the subject of a repeat or abundance statement. For example, "PTEN del".

Issue: Genes themselves are not locations. "GeneLocation" might do it, but it doesn't follow the existing pattern of XLocation, where X is something that can be interpreted as a coordinate system. For example, SequenceLocation is a location on a Sequence; ChromosomeLocation is a location on a Chromosome (but not on a particular chromosome sequence). Without a GeneLocation concept, it is hard to see how VRS could represent knowledge like "EGFR, 3 copies".

Genes are intended to be used to represent systemic variation, not molecular variation.

In [4]:
g = models.Gene(gene_id="ncbigene:1234")
g.as_dict()

{'gene_id': 'ncbigene:1234', 'type': 'Gene'}

## ★ Sequence States (Not Identfiable)

A Sequence State is the root of class of ways to describe a sequence, including sequences that are inferred from locations, trasformed by reverse and/or complement operations. Future extensions may provide metods to describe approximate sequence matching (other than by IUPAC ambiguity codes) or sized sequences.

The sequence that is inserted and repeated (and deleted?) might be expressed in a variety of ways: 

* Literal sequences
* Exact sequence from a sequence location

Future possibilities:

* Inexact sequence from a sequence location
* Inverted
* Sequence ambiguity (either by regexp or IUPAC ambiguity codes)
* Matched by sequence size

Expressions fall into a few categories:
* Countable, Unary -- expressions that generate only one sequence. Examples: Literal, Inferred, Inverted (w/no ambiguity codes)
* Countable, non-unary -- expressions that generate more than one sequence. Examples: Literal, regexp (w/ambiguity codes)
* Uncountable -- expressions generate an infinite number of sequences (e.g., "size > 5 nt").

This is relevant to VRS because we *may* want to describe or constrain when certain types should be used.

Because unary sequences refer to a specific sequence, they may be used to impute a derived sequence. In contrast, the non-unary and uncountable expressions generate a family of sequences.

### LiteralSequence

In [5]:
ls = models.LiteralSequence(sequence="ACGT")
ls.as_dict()

{'sequence': 'ACGT', 'type': 'LiteralSequence'}

### DerivedSequence
is a sequence derived from a location. In this example, the location is a range, which feels nonsensical. The spec probably wants to require a precise sequence (which needs definition).

In [6]:
ds = models.DerivedSequence(location=sl)
ds.as_dict()

{'location': {'interval': {'inner': {'end': 30,
    'start': 20,
    'type': 'SimpleInterval'},
   'outer': {'end': 40, 'start': 10, 'type': 'SimpleInterval'},
   'type': 'NestedInterval'},
  'sequence_id': 'ga4gh:SQ.0123abcd',
  'type': 'SequenceLocation'},
 'type': 'DerivedSequence'}

In [7]:
ds = models.DerivedSequence(location=sl, transformation="reversecomplement")
ds.as_dict()

{'location': {'interval': {'inner': {'end': 30,
    'start': 20,
    'type': 'SimpleInterval'},
   'outer': {'end': 40, 'start': 10, 'type': 'SimpleInterval'},
   'type': 'NestedInterval'},
  'sequence_id': 'ga4gh:SQ.0123abcd',
  'type': 'SequenceLocation'},
 'transformation': 'reversecomplement',
 'type': 'DerivedSequence'}

### ★ Repeats (RepeatState)

In VRS, Repeats are defined to be tandem copies of a sequence of any length (length >= 1). In VRS, they are an assertion of state and not necessarily a change with respect to the reference sequence.

Repeats are implemented using the Allele class with a RepeatState applied to a Location. The RepeatState conveys both the repeat count and the repeat sequence.

The number of copies may be known precisely, or may be specified as a range (min, max), which is essential for certain applications (e.g., Huntington's disease CAG repeat sizes).

The repeat sequence may be verbatim or a derived/approximated sequence.

For a precise repeat length with a fixed sequence, Repeats are equivalent to Alleles with a single SequenceState.

Repeats should be used only when copies are known to be tandem; if the location of repeats is unknown or genome-wide, Abundance should be used. 

Discussion
* RepeatState will use LiteralSequence, not inlined sequence
* Allele currently uses inline sequence. Will add LiteralSequence and deprecate inline.


In [8]:
models.RepeatedSequence(
    sequence=models.LiteralSequence(sequence="CAG"),
    copies={"min": 5, "max": 10}).as_dict()

{'copies': {'max': 10, 'min': 5},
 'sequence': {'sequence': 'CAG', 'type': 'LiteralSequence'},
 'type': 'RepeatedSequence'}

In [9]:
sl = models.SequenceLocation(
    sequence_id="ga4gh:SQ.abc123",
    interval=models.SimpleInterval(start=20, end=30))
rs = models.RepeatedSequence(
    sequence=models.DerivedSequence(location=sl),
    copies={"min": 5, "max": 10})
rs.as_dict()

{'copies': {'max': 10, 'min': 5},
 'sequence': {'location': {'interval': {'end': 30,
    'start': 20,
    'type': 'SimpleInterval'},
   'sequence_id': 'ga4gh:SQ.abc123',
   'type': 'SequenceLocation'},
  'type': 'DerivedSequence'},
 'type': 'RepeatedSequence'}

## ★ Abundance
Abundance is a systemic state that captures the amount of a "subject" location throughout the genome.

Valid Abundance subjects are:
* SequenceLocations
* TranscriptLocations
* ChromosomeLocations
* GeneLocations (a quasi-location of a gene)
* Allele

Quantifiers may be:
* Absolute copy count range
* Relative copy count range, with reference
* Qualitative copy count (e.g., "greater than"), with reference

The intention is to allow all combinations of the above subjects and quantifiers.

Issues/questions:
* What are possible references? "diploid" (incl female X), "haploid" (X, Y in males)

In [31]:
ab = models.Abundance(
    amount={"min": 5, "max": 10, "measure": "AbsoluteCounts"}
    ).as_dict()

In [33]:
models.Abundance(
    amount={"min": 3, "max": 13, "measure": "RelativeCounts"}
    ).as_dict()

{'amount': {'min': 3, 'max': 13, 'measure': 'RelativeCounts'},
 'type': 'Abundance'}

In [35]:
models.Abundance(
    amount={"measure": "Qualitative", "rel": "gt"}
    ).as_dict()

{'amount': {'measure': 'Qualitative', 'rel': 'gt'}, 'type': 'Abundance'}

## ★ Transcripts and Transcript Locations

This section implements the unified model of transcripts, which defines a Transcript is a collection of exons and optional cds on any sequence.  For example, a RefSeq transcript on the defining RefSeq sequence would be one Transcript, and the projection of that transcript onto a genomic sequence (through splign, typically) would be a *different* Transcript with the *same* data structure.  Note that the colloquial use of "transcript" is ambiguous about the underlying sequence, which means that it refers non-specifically to a family of Transcripts (as defined above).

Considering HGVS syntax `NC_000007.13(NM_005228.4)` is useful. One interpretation of that syntax is that it is specifying the `NM_005228.4` exon structure on the `NC_000007.13` sequence. With that view, the AnnotatedSequence model uses distinct data structures for variants with reference sequences like `NM_005228.4` and `NC_000007.13(NM_005228.4)`. In contrast, the Unified Model effectively treats `NM_005228.4` as `NM_005228.4(NM_005228.4)`, and uses the same information model as for aligned sequences.

### Defining Transcripts

In [9]:
# Example transcript: NM_000314.6 (PTEN) on chr 10

# NM_000314.6(PTEN):c.78C>A  ~  NC_000010.11:g.87864547C>T
# NM_000314.6(PTEN):c.79+3A>G  ~  NC_000010.11:g.87864551A>G

# NM_000314.6
# These exons and CDS on the NM_000314.6 sequences constitute the
# definition of this transcript
t_exons = [(0,1110), (1110,1195), (1195,1240),
            (1240,1284), (1284,1523), (1523,1665),
            (1665,1832), (1832,2057), (2057,8701)]
t_cds = (1031,2243)

# NM_000314.6 aligned (by NCBI) to NC_000010.11 sequence (GRCh38 chr 10), + strand
g_exons = [(87863437, 87864548), (87894024, 87894109), (87925512, 87925557),
           (87931045, 87931089), (87933012, 87933251), (87952117, 87952259),
           (87957852, 87958019), (87960893, 87961118), (87965286, 87971930)]

# Cigars of alignment (relative to transcript)
tg_cigars = "666=1I39=1X404= 85= 45= 44= 239= 142= 167= 225= 6644="

# g_cds is computed from t_cds, accounting for alignment 
# 1032 = 1031 + 1I in cigar
g_cds = (87863437 + 1032, 87965286 + 2243 - 2057)


t_sequence_id = translate_sequence_identifier("refseq:NM_000314.6")
g_sequence_id = translate_sequence_identifier("refseq:NC_000010.11")

In [10]:
t_transcript = models.Transcript(
  sequence_id = t_sequence_id,
  exons = [models.SimpleInterval(start=ex[0], end=ex[1]) for ex in t_exons],
  cds = models.SimpleInterval(start=t_cds[0], end=t_cds[1])
)

g_transcript = models.Transcript(
  sequence_id = g_sequence_id,
  exons = [models.SimpleInterval(start=ex[0], end=ex[1]) for ex in g_exons],
  cds = models.SimpleInterval(start=g_cds[0], end=g_cds[1])
)

In [11]:
t_transcript.as_dict()

{'cds': {'end': 2243, 'start': 1031, 'type': 'SimpleInterval'},
 'exons': [{'end': 1110, 'start': 0, 'type': 'SimpleInterval'},
  {'end': 1195, 'start': 1110, 'type': 'SimpleInterval'},
  {'end': 1240, 'start': 1195, 'type': 'SimpleInterval'},
  {'end': 1284, 'start': 1240, 'type': 'SimpleInterval'},
  {'end': 1523, 'start': 1284, 'type': 'SimpleInterval'},
  {'end': 1665, 'start': 1523, 'type': 'SimpleInterval'},
  {'end': 1832, 'start': 1665, 'type': 'SimpleInterval'},
  {'end': 2057, 'start': 1832, 'type': 'SimpleInterval'},
  {'end': 8701, 'start': 2057, 'type': 'SimpleInterval'}],
 'sequence_id': 'ga4gh:SQ.7YNhHjHLiBJwNd43xjLJA7jjnuJwPhxn',
 'type': 'Transcript'}

In [12]:
g_transcript.as_dict()

{'cds': {'end': 87965472, 'start': 87864469, 'type': 'SimpleInterval'},
 'exons': [{'end': 87864548, 'start': 87863437, 'type': 'SimpleInterval'},
  {'end': 87894109, 'start': 87894024, 'type': 'SimpleInterval'},
  {'end': 87925557, 'start': 87925512, 'type': 'SimpleInterval'},
  {'end': 87931089, 'start': 87931045, 'type': 'SimpleInterval'},
  {'end': 87933251, 'start': 87933012, 'type': 'SimpleInterval'},
  {'end': 87952259, 'start': 87952117, 'type': 'SimpleInterval'},
  {'end': 87958019, 'start': 87957852, 'type': 'SimpleInterval'},
  {'end': 87961118, 'start': 87960893, 'type': 'SimpleInterval'},
  {'end': 87971930, 'start': 87965286, 'type': 'SimpleInterval'}],
 'sequence_id': 'ga4gh:SQ.ss8r_wB0-b9r44TQTMmVTI92884QvBiB',
 'type': 'Transcript'}

In [13]:
t_transcript_id = ga4gh_identify(t_transcript)
g_transcript_id = ga4gh_identify(g_transcript)
t_transcript_id, g_transcript_id

('ga4gh:VTX.nTRjcOgzR6_owupcO39owUADNZeFY0d3',
 'ga4gh:VTX.pTvbXNiGxYhmHFvCn7FeBnAFb14DzAZr')

### Transcript Feature Locations

In [14]:
tf = models.TranscriptFeature(feature_type="exon", index=0)
tf.as_dict()

{'feature_type': 'exon', 'index': 0}

In [15]:
tfi = models.TranscriptFeatureInterval(
    start=models.TranscriptFeature(feature_type="exon", index=0),
    end=models.TranscriptFeature(feature_type="exon", index=5),
)
tfi.as_dict()

{'end': {'feature_type': 'exon', 'index': 5},
 'start': {'feature_type': 'exon', 'index': 0},
 'type': 'TranscriptFeatureInterval'}

In [16]:
tfl = models.TranscriptLocation(
    transcript = g_transcript,
    interval = tfi)
tfl.as_dict()

{'interval': {'end': {'feature_type': 'exon', 'index': 5},
  'start': {'feature_type': 'exon', 'index': 0},
  'type': 'TranscriptFeatureInterval'},
 'transcript': {'cds': {'end': 87965472,
   'start': 87864469,
   'type': 'SimpleInterval'},
  'exons': [{'end': 87864548, 'start': 87863437, 'type': 'SimpleInterval'},
   {'end': 87894109, 'start': 87894024, 'type': 'SimpleInterval'},
   {'end': 87925557, 'start': 87925512, 'type': 'SimpleInterval'},
   {'end': 87931089, 'start': 87931045, 'type': 'SimpleInterval'},
   {'end': 87933251, 'start': 87933012, 'type': 'SimpleInterval'},
   {'end': 87952259, 'start': 87952117, 'type': 'SimpleInterval'},
   {'end': 87958019, 'start': 87957852, 'type': 'SimpleInterval'},
   {'end': 87961118, 'start': 87960893, 'type': 'SimpleInterval'},
   {'end': 87971930, 'start': 87965286, 'type': 'SimpleInterval'}],
  'sequence_id': 'ga4gh:SQ.ss8r_wB0-b9r44TQTMmVTI92884QvBiB',
  'type': 'Transcript'},
 'type': 'TranscriptLocation'}

In [17]:
ga4gh_identify(tfl)

'ga4gh:VTL.h2wLFU5g0cp5AxX59QsVKcllRljBhF-b'

In [18]:
vrs_enref(tfl).as_dict()

{'interval': {'end': {'feature_type': 'exon', 'index': 5},
  'start': {'feature_type': 'exon', 'index': 0},
  'type': 'TranscriptFeatureInterval'},
 'transcript': 'ga4gh:VTX.pTvbXNiGxYhmHFvCn7FeBnAFb14DzAZr',
 'type': 'TranscriptLocation'}

### Defining transcript locations for CDS variation
NM_000314.6(PTEN):c.78C>A  ~  NC_000010.11:g.87864547C>T

CDS start c.1 corresponds to n.1032.
c.78 corresponds to n.1110, or interbase position 1109,1110 on sequence NM_000314.6.
Due to the 1 nt insertion on NC_000010.11, that position aligns to interbase position 1110,1111.

In [19]:
t_tloc = models.TranscriptLocation(
    transcript = t_transcript,
    interval = models.SimpleInterval(start=1109,end=1110))
vrs_enref(t_tloc).as_dict()

{'interval': {'end': 1110, 'start': 1109, 'type': 'SimpleInterval'},
 'transcript': 'ga4gh:VTX.nTRjcOgzR6_owupcO39owUADNZeFY0d3',
 'type': 'TranscriptLocation'}

In [20]:
g_tloc = models.TranscriptLocation(
    transcript = g_transcript,
    interval = models.SimpleInterval(start=1110,end=1111))
vrs_enref(g_tloc).as_dict()

{'interval': {'end': 1111, 'start': 1110, 'type': 'SimpleInterval'},
 'transcript': 'ga4gh:VTX.pTvbXNiGxYhmHFvCn7FeBnAFb14DzAZr',
 'type': 'TranscriptLocation'}

---
# Specific Examples

The examples below use variation from ClinVar and VICC to demonstrate the above classes.

## VCV000528890.1 NC_000001.11:g.(?\_218346682)\_(218441382\_?)del


Three possible interpretations:
* Molecular variation w/del
* MV w/Repeat state 0
* Abundance abs copy = 1

In [7]:
a = models.Allele(
    location=models.SequenceLocation(
        sequence_id=translate_sequence_identifier("NC_000001.11"),
        interval=models.NestedInterval(
            inner=models.SimpleInterval(start=218346681, end=218441382),
            outer=models.SimpleInterval(start=None, end=None))),
    state=models.SequenceState(sequence=''))
ppo(a)

{
  "location": {
    "interval": {
      "inner": {
        "end": 218441382,
        "start": 218346681,
        "type": "SimpleInterval"
      },
      "outer": {
        "type": "SimpleInterval"
      },
      "type": "NestedInterval"
    },
    "sequence_id": "ga4gh:SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
    "type": "SequenceLocation"
  },
  "state": {
    "sequence": "",
    "type": "SequenceState"
  },
  "type": "Allele"
}


## VCV000665644.1 NC_000001.10:g.(?\_15764951)\_(15765010\_?)dup

In [37]:
# if the copies are known to be tandem
a = models.Allele(
    location=models.SequenceLocation(
        sequence_id=translate_sequence_identifier("NC_000001.11"),
        interval=models.NestedInterval(
            inner=models.SimpleInterval(start=15764950, end=15765010),
            outer=models.SimpleInterval(start=None, end=None))),
    state=models.RepeatState(
        copies={"min":2, "max": 2}),
        source? ~sequence~=(derived sequence | Alelle))
)
a.as_dict()

{'location': {'interval': {'inner': {'end': 15765010,
    'start': 15764950,
    'type': 'SimpleInterval'},
   'outer': {'type': 'SimpleInterval'},
   'type': 'NestedInterval'},
  'sequence_id': 'ga4gh:SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO',
  'type': 'SequenceLocation'},
 'state': {'copies': {'max': 2, 'min': 2}, 'type': 'RepeatState'},
 'type': 'Allele'}

In [39]:
# if the copies are not known to be tandem, this should be an Abundance statement
a = models.Abundance(
    location=models.SequenceLocation(
        sequence_id=translate_sequence_identifier("NC_000001.11"),
        interval=models.NestedInterval(
            inner=models.SimpleInterval(start=15764950, end=15765010),
            outer=models.SimpleInterval(start=None, end=None))),
    amount={"min":2, "max": 2, "measure": "AbsoluteCount"})
a.as_dict()

{'amount': {'min': 2, 'max': 2, 'measure': 'AbsoluteCount'},
 'location': {'interval': {'inner': {'end': 15765010,
    'start': 15764950,
    'type': 'SimpleInterval'},
   'outer': {'type': 'SimpleInterval'},
   'type': 'NestedInterval'},
  'sequence_id': 'ga4gh:SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO',
  'type': 'SequenceLocation'},
 'type': 'Abundance'}

## VCV000662440.1 NM_003000.2(SDHB):c.656_707dup (p.Pro237_Phe238insAspTer)

In [41]:
a = models.Allele(
    location=models.SequenceLocation(
        sequence_id=translate_sequence_identifier("NM_003000.2"),
        interval=models.SimpleInterval(start=655, end=707)),
    state=models.RepeatState(copies={"min":2, "max": 2}))
a.as_dict()

{'location': {'interval': {'end': 707, 'start': 655, 'type': 'SimpleInterval'},
  'sequence_id': 'ga4gh:SQ.Aqu6SGQJj0RGxWrihJZHQfzxCongdQ-W',
  'type': 'SequenceLocation'},
 'state': {'copies': {'max': 2, 'min': 2}, 'type': 'RepeatState'},
 'type': 'Allele'}

## VCV000007621.2 NM_000214.3(JAG1):c.2091_2095GAAAG[1] (p.Gly699fs)
```
This variant is incorrectly written.  The repeat unit GAAAG should be left-shuffled to GGAAA, making the left-shifted version c.2090_2094GGAAA.

 c.2081     2091     2101     2111
 n.2551     2561     2571     2581  
      |        |        |        |
      AAATGGGTGGAAAGGAAAGACCTGCCAC

Converting a repeat to a SequenceState requires examining downstream sequence for the repeat unit, and inserting or deleting as necessary.  Imagine that this sequence had five GAAAG repeats and we got `[1]`; then, 4 repeats would be deleted, which is evident only from sequence context.

See note below about sequence locations for repeats.
```

### Discuss: Location of repeats

```
 n.2551              2561              2571              2581  
 c.2081              2091              2101              2111
      |                 |                 |                 |
      A A A T G G G T G G A A A G G A A A G A C C T G C C A C
                      [-------] [-------] - 

* What is the assertion of reference? Two GGAAA at 2090?
* The variant above is an assertion of only 1 repeat. Can we assert zero repeats?
* Must be left aligned? How does that comport with fully justified?

```

In [43]:
a = models.Allele(
    location=models.SequenceLocation(
        sequence_id=translate_sequence_identifier("NM_000214.3"),
        interval=models.SimpleInterval(start=2089, end=2089)),
    state=models.RepeatState(sequence="GGAAA", copies={"min": 1,"max": 1}))
a.as_dict()

{'location': {'interval': {'end': 2089,
   'start': 2089,
   'type': 'SimpleInterval'},
  'sequence_id': 'ga4gh:SQ.9MpDPgYoiNK-w76vPWkEYB-kTeZIK2NQ',
  'type': 'SequenceLocation'},
 'state': {'copies': {'max': 1, 'min': 1},
  'sequence': 'GGAAA',
  'type': 'RepeatState'},
 'type': 'Allele'}

## VCV000144802.1 GRCh38/hg38 1p36.32(chr1:2651321-2701929)x1
⇒ make it possible to represent cytoband and coordinate representations, but not both in one object


In [None]:
cl = models.ChromosomeLocation(chr="1",
    interval=models.CytobandInterval(start="p36.32", end="p36.32"))
cl.as_dict()

In [None]:
ab = models.Abundance(
    location=cl,
    amount={
        "min": 1, "max": 1})
ab.as_dict()

In [None]:
ga4gh_identify(ab)

## VCV000149842.1 GRCh38/hg38 15q11.2(chr15:25334870-25351819)x3
⇒ specified with HGVS NC_000015.8:g.(?_23131110)_(23148059_?)dup, ...

## VCV000395246.1 GRCh37/hg19 Xq21.33-28(chrX:94043221-155246585)x1

## VCV000394192.1 GRCh37/hg19 Xp11.4(chrX:40456453-40487150)x2

## 3 copies EGFR

Larry:
ab = models.Abundance(
    molstate = Gene('ncbigene:1234')
    amount={min: 3, max: 3}
    )

Alex:
ab = models.Abundance(
    molvar = Allele(
       location = Gene('ncbigene:1234'),
       state = AmbiguousState()),
    amount={min: 3, max: 3}
    )

ab = models.Abundance(
    molvar = Allele(
       location = NM_1234.4:22_33,
       state = sequence='' ),
    amount={min: 3, max: 3}
    )



ab = models.Abundance(
    molvar = Allele(
       location = NM_01234.5,
       state = AmbiguousState()),
    amount={min: 3, max: 3}
    )

ab = models.Abundance(
    molvar = Transcript(NM_01234.5),
    amount={min: 3, max: 3}
    )

## EID473 ... “increased copy number or amplification of EGFR”... (>8x copies)

## EID5925 “increased EGFR gene copy number”

## VCV000254074.1 NM_007294.3(BRCA1):c.5075-?_5277+?dup203

## VCV000145395.1 GRCh38/hg38 1p36.33(chr1:844353-911241)x3 (FAM41C , LINC01128)

## VCV000395687.1 GRCh37/hg19 Xp22.33-q28(chrX:70297-155255792)
NCBI calls this a "copy number loss" with corresponding HGVS `NC_000023.10:g.(?_70297)_(155255792_?)del`

# Discussion/Questions/Decisions

## What is the location of a repeat?

A repeat should be located over the entire region of the repeated sequence. Rationale: RepeatState is essentially a delins, where the ins is a repeated sequence.

## Should we permit copy number zero?

RepeatState and Abundance may specify zero repeats/counts. 

Rationale: Users may wish to express count ranges that include 0, such as 0 <= count <= 2. Using a distinct deletion state in lieu of 0 would create significant discontinuities in the data model.

## How to describe repeat and abundance source sequences ?

Obvious needs: 1) inline? sequence, 2) sequence from SequenceLocation, 3) arbitrary sequence constrained by size (e.g., modulo 3). 

Re: "Approximate" repeats: We have yet to come up with a precise definition of approximate, that is a characterization of the ways in which a sequence might be approximate.

Reece thinks that it's important that an expression can be matched against a query sequence. Unclear whether other think that this is important too. A related but distinct issue is whether an expression can generate a sequence (as with a grammar) and whether that's finite.


## Are RepeatState instances normalized?

RepeatState instances are not normalized. Rationale: Data producers may choose a repeat unit that may have meaning, such as starting on a codon boundary. However, in the absence of a reason to choose a particular repeat unit, users should use the left-most representation of the repeat accounting for circular permutations of the repeat unit (as with normalization).


## What about digest?

In VRS, the digest is always a digest of the data structure.  If the implied sequence is well-defined, it the repeat must be converted to an equivalent SequenceState Allele. 


##  Can a SequenceLocation be used as a Sequence?


## Should VRS require all properties explicitly?