Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explain ref agree normalization rules #381

Merged
merged 9 commits into from
Apr 13, 2022
96 changes: 53 additions & 43 deletions docs/source/appendices/design_decisions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,11 @@ Allele Rather than Variant
The most primitive sequence assertion in VRS is the :ref:`Allele`
entity. Colloquially, the words "allele" and "variant" have similar
meanings and they are often used interchangeably. However, the VR
contributors believe that it is essential to distinguish the state of
the sequence from the change between states of a sequence. It is
contributors assert that it is essential to distinguish between the *state of*
a reference sequence from the *change from* a reference sequence. It is
imperative that precise terms are used when modelling data. Therefore,
within VRS, Allele refers to a state and "variant" refers to the change
from one Allele to another.
within VRS, "allele" refers to a state of a reference sequence and "variant" refers to a change
from a reference sequence.

The word "variant", which implies change, makes it awkward to refer to
the (unchanged) reference allele. Some systems will use an HGVS-like
Expand All @@ -45,45 +45,6 @@ when referring to an unchanged residue. In some cases, such "variants"
are even associated with allele frequencies. Similarly, a predicted
consequence is better associated with an allele than with a variant.

.. _should-normalize:

Implementations should normalize
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

VRS STRONGLY RECOMMENDS that Alleles be :ref:`normalized
<normalization>` when generating :ref:`computed identifiers
<computed-identifiers>`. The rationale for recommending, rather than
requiring, normalization is grounded in dual views of Allele objects
with distinct interpretations:

* Allele as minimal representation of a change in sequence. In this
view, normalization is a process that makes the representation
minimal and unambiguous.

* Allele as an assertion of state. In this view, it is reasonable to
want to assert state that may include (or be composed entirely of)
reference bases, for which the normalization process would alter the
intent.

Although this rationale applies only to Alleles, it may have have
parallels with other VRS types. In addition, it is desirable for all
VRS types to be treated similarly.

Furthermore, if normalization were required in order to generate
:ref:`computed-identifiers`, but did not apply to certain instances of
VRS Variation, implementations would likely require secondary
identifier mechanisms, which would undermine the intent of a global
computed identifier.

The primary downside of not requiring normalization is that Variation
objects might be written in non-canonical forms, thereby creating
unintended degeneracy.

Therefore, normalization of all VRS Variation classes is optional in
order to support the view of Allele as an assertion of state on a
sequence.



.. _fully-justified:

Expand Down Expand Up @@ -113,6 +74,55 @@ occurs in a low-complexity region, but rather describes the final and
unambiguous state of the resultant sequence.


.. _should-normalize:

Implementations should normalize Alleles
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

VRS STRONGLY RECOMMENDS that Alleles be :ref:`normalized
<normalization>` when generating :ref:`computed identifiers
<computed-identifiers>` unless there is compelling reason to do
otherwise. Those reasons are the subject of this section.

:ref:`Allele Normalization <normalization>` is the process of
comparing a span of reference sequence to a sequence state (often the
alternative sequence) and resolving that span to an unambiguous form. The fully-justified Allele normalization in VRS consists of two steps: trimming
and shuffling. In the trimming step, common flanking prefix and
suffix sequences are removed. For example, a CAG-to-CTG Allele would
be trimmed to merely A-to-T, with the position adjusted accordingly.
There are four cases of the resulting sequences:

1. The trimmed sequences are empty: The Allele refers to reference
state.
2. The trimmed sequences are non-empty: The Allele is a substitution
(perhaps multi-residue).
3. The reference sequence is empty: The Allele is a net insertion.
4. The state sequence is empty: The Allele is a net deletion.

When the Allele refers to a reference state (case 1), trimming would
reduce the variant to a null change. However, reduction to a null
state would make it impossible to refer to a specific span of
reference sequence. In order to permit users to refer to spans of
reference sequence, VRS does not require normalizing reference
agreement Alleles.

The trimming step applies only when the reference or the state
sequences are empty (cases 3 and 4). When these occur in the context
of repeating reference sequence that matches the inserted or deleted
sequence, the Allele may be shuffled left and right to identify the
fully-justified location of the variation. (See :ref:`normalization`
for details.)

In rare cases, data originators might have reason to associate an
annotation with a specific repeating unit in the context of repeated
sequence. In order to support this case, normalization is not
strictly required.

Most users will normalize most Alleles. Normalization should be
skipped only when doing so would decrease the intended precision of an
Allele.


.. _inter-residue-coordinates-design:

Inter-residue Coordinates
Expand Down