# Repeats
Repeats repesent tandem copies of a sequence of any length (length >= 1). In VRS, they are an assertion of state and not necessarily a change with respect to the reference sequence.

Repeats are implemented using the Allele class with a RepeatState applied to a Location. The RepeatState conveys both the repeat count and the repeat sequence.

The number of copies may be known precisely, or may be specified as a range (min, max), which is essential for certain applications (e.g., Huntington's disease CAG repeat sizes).

The repeat sequence may be verbatim or a derived/approximated sequence.

For a precise repeat length with a fixed sequence, Repeats are equivalent to Alleles with a single SequenceState.

Repeats should be used only when copies are known to be tandem; if the location of repeats is unknown or genome-wide, Abundance should be used. 

## Notes

## Q & A

### What is the location of a repeat?

* Option 1: Location of first repeat unit.
* Option 2: Zero-width position immediately before first repeat unit.


### How to differentiate exact v. approximate repeats v. sized repeats?
### How to differentiate repeat sources ?

* Literal Sequence (string, as with SequenceState now)
* Sequence from SequenceLocation
* Approximate Sequence
* Sized Sequence: length/length range, w/ modulus

```
RepeatState(
  copies={...},
  sequence=
  
    # door 1:
    * literal sequence ("CAG")
    * location
    * sequence_id
    * ApproximateSequence(sequence=literal or location or sequence_id)

    # door 2:
    * DerivedSequence(source=(literal or sequence id or SequenceLocation),
      qualifier=(exact, approximate, ...)
    
    # door 3:
    * DerivedSequence(source=sequencelocation)
    * ApproximateSequence(sequence=(...))
    
    AS(DS(...))
 
    door 3 pro: clearer semantics of intermediate objects
    
    # door 4: do it with jsonschema composition
    * GeneralizedSequence := (literal | location | sequence_id)
 
  )
```

### What about digest?

In VRS, the digest is always a digest of the data structure.  If the implied sequence is well-defined, it the repeat must be converted to an equivalent SequenceState Allele. 

###  Is 0 a valid repeat count?

If Yes, isn't this the same as a del? If No, what are the implications for searching? (e.g., "sequences with 3 or fewer repeats" should probably include zero repeats)

###  Can a SequenceLocation be used as a Sequence?


In [1]:
from ga4gh.core import ga4gh_identify, ga4gh_serialize
from ga4gh.vrs import models
from nbsupport import ppo, translate_sequence_identifier

## RepeatState

In [2]:
models.RepeatState(
    sequence="CAG",
    copies={"min": 5, "max": 10}).as_dict()

{'copies': {'max': 10, 'min': 5}, 'sequence': 'CAG', 'type': 'RepeatState'}

In [3]:
sl = models.SequenceLocation(
    sequence_id="ga4gh:SQ.abc123",
    interval=models.SimpleInterval(start=20, end=30))
rs = models.RepeatState(sequence=sl, copies={"min": 5, "max": 10})
rs.as_dict()

{'copies': {'max': 10, 'min': 5},
 'sequence': {'interval': {'end': 30, 'start': 20, 'type': 'SimpleInterval'},
  'sequence_id': 'ga4gh:SQ.abc123',
  'type': 'SequenceLocation'},
 'type': 'RepeatState'}

In [8]:
rs = models.RepeatState(
    sequence=models.RegularExpression(regexp="C[AC]G"),
    copies={"min": 5, "max": 10})
rs.as_dict()

{'copies': {'max': 10, 'min': 5},
 'sequence': {'regexp': 'C[AC]G'},
 'type': 'RepeatState'}

In [12]:
rs = models.RepeatState(
    sequence=models.SizedSequence(min=3),
    copies={"min": 5, "max": 10})
rs.as_dict()

{'copies': {'max': 10, 'min': 5},
 'sequence': {'min': 3, 'modulus': 1},
 'type': 'RepeatState'}