From 8fc1a1fc780dc8d8df9e3ffa451c67a006262a8a Mon Sep 17 00:00:00 2001 From: Reece Hart Date: Sat, 5 Mar 2022 22:40:45 -0800 Subject: [PATCH 1/8] explain ref agree normalization rules --- docs/source/appendices/design_decisions.rst | 46 +++++++-------------- 1 file changed, 16 insertions(+), 30 deletions(-) diff --git a/docs/source/appendices/design_decisions.rst b/docs/source/appendices/design_decisions.rst index c781c046..841873c3 100644 --- a/docs/source/appendices/design_decisions.rst +++ b/docs/source/appendices/design_decisions.rst @@ -32,10 +32,10 @@ Allele Rather than Variant The most primitive sequence assertion in VRS is the :ref:`Allele` entity. Colloquially, the words "allele" and "variant" have similar meanings and they are often used interchangeably. However, the VR -contributors believe that it is essential to distinguish the state of -the sequence from the change between states of a sequence. It is +contributors believe that it is essential to distinguish the *state* of +the sequence from the *change between states* of a sequence. It is imperative that precise terms are used when modelling data. Therefore, -within VRS, Allele refers to a state and "variant" refers to the change +within VRS, "Allele" refers to a state and "variant" refers to the change from one Allele to another. The word "variant", which implies change, makes it awkward to refer to @@ -47,42 +47,28 @@ consequence is better associated with an allele than with a variant. .. _should-normalize: -Implementations should normalize -@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ +Implementations should normalize Alleles +@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ VRS STRONGLY RECOMMENDS that Alleles be :ref:`normalized ` when generating :ref:`computed identifiers -`. The rationale for recommending, rather than -requiring, normalization is grounded in dual views of Allele objects -with distinct interpretations: +` unless there is compelling reason to do otherwise. +Those reasons are the subject of this section. -* Allele as minimal representation of a change in sequence. In this - view, normalization is a process that makes the representation - minimal and unambiguous. +:ref:`Fully-justified Normalization ` is the process of comparing a span of reference sequence to a sequence state (often the alternative sequence). Normalization consists of two steps: trimming and shuffling. In the trimming step, common flanking prefix and suffix sequences are removed. For example, a CAG-to-CTG Allele would be trimmed to merely A-to-T, with the position adjusted accordingly. There are four cases of the resulting sequences: -* Allele as an assertion of state. In this view, it is reasonable to - want to assert state that may include (or be composed entirely of) - reference bases, for which the normalization process would alter the - intent. + 1. The trimmed sequences are empty: The Allele refers to reference state. + 2. The trimmed sequences are non-empty: The Allele is a substitution (perhaps multi-residue). + 3. The reference sequence is empty: The Allele is a net insertion. + 4. The state sequence is empty: The Allele is a net deletion. -Although this rationale applies only to Alleles, it may have have -parallels with other VRS types. In addition, it is desirable for all -VRS types to be treated similarly. +When the Allele refers to a reference state (case 1), trimming would reduce the variant to a null change. However, reduction to a null state would make it impossible to refer to a specific span of reference sequence. In order to permit users to refer to spans of reference sequence, VRS does not require normalizing reference agreement Alleles. -Furthermore, if normalization were required in order to generate -:ref:`computed-identifiers`, but did not apply to certain instances of -VRS Variation, implementations would likely require secondary -identifier mechanisms, which would undermine the intent of a global -computed identifier. +The trimming step applies only when the reference or the state sequences are empty (cases 3 and 4). When these occur in the context of repeating reference sequence that matches the inserted or deleted sequence, the Allele may be shuffled left and right to identify the fully-justified location of the variation. (See :ref:`normalization` for details.) -The primary downside of not requiring normalization is that Variation -objects might be written in non-canonical forms, thereby creating -unintended degeneracy. - -Therefore, normalization of all VRS Variation classes is optional in -order to support the view of Allele as an assertion of state on a -sequence. +In rare cases, data originators might have reason to associate an annotation with a specific repeating unit in the context of repeated sequence. In order to support this case, normalization is not strictly required. +Most users will normalize most Alleles. Normalization should be skipped only when doing so would decrease the intended precision of an Allele. .. _fully-justified: From f832a3475b549ac4ce0717149782bbda18fa31b8 Mon Sep 17 00:00:00 2001 From: Reece Hart Date: Sat, 5 Mar 2022 22:43:11 -0800 Subject: [PATCH 2/8] fill paragraphs for consistency --- docs/source/appendices/design_decisions.rst | 51 +++++++++++++++------ 1 file changed, 37 insertions(+), 14 deletions(-) diff --git a/docs/source/appendices/design_decisions.rst b/docs/source/appendices/design_decisions.rst index 841873c3..5513b368 100644 --- a/docs/source/appendices/design_decisions.rst +++ b/docs/source/appendices/design_decisions.rst @@ -52,23 +52,46 @@ Implementations should normalize Alleles VRS STRONGLY RECOMMENDS that Alleles be :ref:`normalized ` when generating :ref:`computed identifiers -` unless there is compelling reason to do otherwise. -Those reasons are the subject of this section. - -:ref:`Fully-justified Normalization ` is the process of comparing a span of reference sequence to a sequence state (often the alternative sequence). Normalization consists of two steps: trimming and shuffling. In the trimming step, common flanking prefix and suffix sequences are removed. For example, a CAG-to-CTG Allele would be trimmed to merely A-to-T, with the position adjusted accordingly. There are four cases of the resulting sequences: - - 1. The trimmed sequences are empty: The Allele refers to reference state. - 2. The trimmed sequences are non-empty: The Allele is a substitution (perhaps multi-residue). +` unless there is compelling reason to do +otherwise. Those reasons are the subject of this section. + +:ref:`Fully-justified Normalization ` is the process of +comparing a span of reference sequence to a sequence state (often the +alternative sequence). Normalization consists of two steps: trimming +and shuffling. In the trimming step, common flanking prefix and +suffix sequences are removed. For example, a CAG-to-CTG Allele would +be trimmed to merely A-to-T, with the position adjusted accordingly. +There are four cases of the resulting sequences: + + 1. The trimmed sequences are empty: The Allele refers to reference + state. + 2. The trimmed sequences are non-empty: The Allele is a substitution + (perhaps multi-residue). 3. The reference sequence is empty: The Allele is a net insertion. 4. The state sequence is empty: The Allele is a net deletion. -When the Allele refers to a reference state (case 1), trimming would reduce the variant to a null change. However, reduction to a null state would make it impossible to refer to a specific span of reference sequence. In order to permit users to refer to spans of reference sequence, VRS does not require normalizing reference agreement Alleles. - -The trimming step applies only when the reference or the state sequences are empty (cases 3 and 4). When these occur in the context of repeating reference sequence that matches the inserted or deleted sequence, the Allele may be shuffled left and right to identify the fully-justified location of the variation. (See :ref:`normalization` for details.) - -In rare cases, data originators might have reason to associate an annotation with a specific repeating unit in the context of repeated sequence. In order to support this case, normalization is not strictly required. - -Most users will normalize most Alleles. Normalization should be skipped only when doing so would decrease the intended precision of an Allele. +When the Allele refers to a reference state (case 1), trimming would +reduce the variant to a null change. However, reduction to a null +state would make it impossible to refer to a specific span of +reference sequence. In order to permit users to refer to spans of +reference sequence, VRS does not require normalizing reference +agreement Alleles. + +The trimming step applies only when the reference or the state +sequences are empty (cases 3 and 4). When these occur in the context +of repeating reference sequence that matches the inserted or deleted +sequence, the Allele may be shuffled left and right to identify the +fully-justified location of the variation. (See :ref:`normalization` +for details.) + +In rare cases, data originators might have reason to associate an +annotation with a specific repeating unit in the context of repeated +sequence. In order to support this case, normalization is not +strictly required. + +Most users will normalize most Alleles. Normalization should be +skipped only when doing so would decrease the intended precision of an +Allele. .. _fully-justified: From 2a7521fcb1df438b839967d1393f15c8e8f174e1 Mon Sep 17 00:00:00 2001 From: Reece Hart Date: Sat, 5 Mar 2022 22:44:14 -0800 Subject: [PATCH 3/8] swapped order of FJ and normalization rational design decisions --- docs/source/appendices/design_decisions.rst | 57 +++++++++++---------- 1 file changed, 29 insertions(+), 28 deletions(-) diff --git a/docs/source/appendices/design_decisions.rst b/docs/source/appendices/design_decisions.rst index 5513b368..a2c76306 100644 --- a/docs/source/appendices/design_decisions.rst +++ b/docs/source/appendices/design_decisions.rst @@ -45,6 +45,35 @@ when referring to an unchanged residue. In some cases, such "variants" are even associated with allele frequencies. Similarly, a predicted consequence is better associated with an allele than with a variant. + +.. _fully-justified: + +Alleles are Fully Justified +@@@@@@@@@@@@@@@@@@@@@@@@@@@ + +In order to standardize the representation of sequence variation, +Alleles SHOULD be fully justified from the description of the NCBI +`Variant Overprecision Correction Algorithm (VOCA)`_. Furthermore, +normalization rules are identical for all sequence types (DNA, RNA, +and protein). + +The choice of algorithm was relatively straightforward: VOCA is +published, easily understood, easily implemented, and +covers a wide range of cases. + +The choice to fully justify is a departure from other common variation +formats. The HGVS nomenclature recommendations, originally published in +1998, require that alleles be right normalized `(3' rule)`_ on all sequence +types. The Variant Call Format (VCF), released as a PDF specification +in 2009, made the conflicting choice to write variants `left (5') +normalized`_ and anchored to the previous nucleotide. + +Fully-justified alleles represent an alternate approach. A fully-justified +representation does not make an arbitrary choice of where a variant truly +occurs in a low-complexity region, but rather describes the final and +unambiguous state of the resultant sequence. + + .. _should-normalize: Implementations should normalize Alleles @@ -94,34 +123,6 @@ skipped only when doing so would decrease the intended precision of an Allele. -.. _fully-justified: - -Alleles are Fully Justified -@@@@@@@@@@@@@@@@@@@@@@@@@@@ - -In order to standardize the representation of sequence variation, -Alleles SHOULD be fully justified from the description of the NCBI -`Variant Overprecision Correction Algorithm (VOCA)`_. Furthermore, -normalization rules are identical for all sequence types (DNA, RNA, -and protein). - -The choice of algorithm was relatively straightforward: VOCA is -published, easily understood, easily implemented, and -covers a wide range of cases. - -The choice to fully justify is a departure from other common variation -formats. The HGVS nomenclature recommendations, originally published in -1998, require that alleles be right normalized `(3' rule)`_ on all sequence -types. The Variant Call Format (VCF), released as a PDF specification -in 2009, made the conflicting choice to write variants `left (5') -normalized`_ and anchored to the previous nucleotide. - -Fully-justified alleles represent an alternate approach. A fully-justified -representation does not make an arbitrary choice of where a variant truly -occurs in a low-complexity region, but rather describes the final and -unambiguous state of the resultant sequence. - - .. _inter-residue-coordinates-design: Inter-residue Coordinates From b12347cabf0584e7a43b1ca7a2bf6807e20ce1aa Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Fri, 8 Apr 2022 20:45:31 -0400 Subject: [PATCH 4/8] Update docs/source/appendices/design_decisions.rst --- docs/source/appendices/design_decisions.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/appendices/design_decisions.rst b/docs/source/appendices/design_decisions.rst index a2c76306..14d6f151 100644 --- a/docs/source/appendices/design_decisions.rst +++ b/docs/source/appendices/design_decisions.rst @@ -32,8 +32,8 @@ Allele Rather than Variant The most primitive sequence assertion in VRS is the :ref:`Allele` entity. Colloquially, the words "allele" and "variant" have similar meanings and they are often used interchangeably. However, the VR -contributors believe that it is essential to distinguish the *state* of -the sequence from the *change between states* of a sequence. It is +contributors assert that it is essential to distinguish between the *state of* +a reference sequence from the *change from* a reference sequence. It is imperative that precise terms are used when modelling data. Therefore, within VRS, "Allele" refers to a state and "variant" refers to the change from one Allele to another. From 4bc9ab7c9da2e37cc5b27b6b0730f8bb63033a00 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Fri, 8 Apr 2022 20:45:44 -0400 Subject: [PATCH 5/8] Update docs/source/appendices/design_decisions.rst --- docs/source/appendices/design_decisions.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/appendices/design_decisions.rst b/docs/source/appendices/design_decisions.rst index 14d6f151..4edef074 100644 --- a/docs/source/appendices/design_decisions.rst +++ b/docs/source/appendices/design_decisions.rst @@ -35,7 +35,7 @@ meanings and they are often used interchangeably. However, the VR contributors assert that it is essential to distinguish between the *state of* a reference sequence from the *change from* a reference sequence. It is imperative that precise terms are used when modelling data. Therefore, -within VRS, "Allele" refers to a state and "variant" refers to the change +within VRS, "allele" refers to a state of a reference sequence and "variant" refers to a change from one Allele to another. The word "variant", which implies change, makes it awkward to refer to From 99e001af6784a098c21e4067a6c8ec7367c64bf5 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Fri, 8 Apr 2022 20:46:01 -0400 Subject: [PATCH 6/8] Update docs/source/appendices/design_decisions.rst --- docs/source/appendices/design_decisions.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/appendices/design_decisions.rst b/docs/source/appendices/design_decisions.rst index 4edef074..d81c6588 100644 --- a/docs/source/appendices/design_decisions.rst +++ b/docs/source/appendices/design_decisions.rst @@ -36,7 +36,7 @@ contributors assert that it is essential to distinguish between the *state of* a reference sequence from the *change from* a reference sequence. It is imperative that precise terms are used when modelling data. Therefore, within VRS, "allele" refers to a state of a reference sequence and "variant" refers to a change -from one Allele to another. +from a reference sequence. The word "variant", which implies change, makes it awkward to refer to the (unchanged) reference allele. Some systems will use an HGVS-like From 8dd547de08e40d367aca93fdbfbce73187ec45c8 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Fri, 8 Apr 2022 20:46:12 -0400 Subject: [PATCH 7/8] Update docs/source/appendices/design_decisions.rst --- docs/source/appendices/design_decisions.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/appendices/design_decisions.rst b/docs/source/appendices/design_decisions.rst index d81c6588..b437cd24 100644 --- a/docs/source/appendices/design_decisions.rst +++ b/docs/source/appendices/design_decisions.rst @@ -84,7 +84,7 @@ VRS STRONGLY RECOMMENDS that Alleles be :ref:`normalized ` unless there is compelling reason to do otherwise. Those reasons are the subject of this section. -:ref:`Fully-justified Normalization ` is the process of +:ref:`Allele Normalization ` is the process of comparing a span of reference sequence to a sequence state (often the alternative sequence). Normalization consists of two steps: trimming and shuffling. In the trimming step, common flanking prefix and From 4ce25041cf8f2b75cd60668cc4651507439e9215 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Fri, 8 Apr 2022 20:46:25 -0400 Subject: [PATCH 8/8] Update docs/source/appendices/design_decisions.rst --- docs/source/appendices/design_decisions.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/appendices/design_decisions.rst b/docs/source/appendices/design_decisions.rst index b437cd24..b42a04d5 100644 --- a/docs/source/appendices/design_decisions.rst +++ b/docs/source/appendices/design_decisions.rst @@ -86,7 +86,7 @@ otherwise. Those reasons are the subject of this section. :ref:`Allele Normalization ` is the process of comparing a span of reference sequence to a sequence state (often the -alternative sequence). Normalization consists of two steps: trimming +alternative sequence) and resolving that span to an unambiguous form. The fully-justified Allele normalization in VRS consists of two steps: trimming and shuffling. In the trimming step, common flanking prefix and suffix sequences are removed. For example, a CAG-to-CTG Allele would be trimmed to merely A-to-T, with the position adjusted accordingly.