Merge pull request #418 from ga4gh/main

v1.3 release candidate
ga4gh · Apr 16, 2023 · ca83d84 · ca83d84
2 parents 8a02415 + c4595ae
commit ca83d84
Show file tree

Hide file tree

Showing 23 changed files with 1,341 additions and 896 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -20,7 +20,7 @@ jobs:
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip setuptools
-          pip install -r .requirements.txt
+          pip install --pre -r .requirements.txt
 
       - name: Test with pytest
         run: |

diff --git a/.requirements.txt b/.requirements.txt
@@ -1,7 +1,8 @@
 pytest
-python-jsonschema-objects>=0.3,<=0.3.10
+python-jsonschema-objects>=0.4.0
 jsonschema==3.2.0
 ipython
 pyyaml
-ga4gh.gks.metaschema>=0.1.1
-sphinx ~= 3.5
+ga4gh.gks.metaschema==0.2.0rc4
+sphinx ~= 4.5
+sphinx-rtd-theme ~= 1.2
diff --git a/docs/source/appendices/design_decisions.rst b/docs/source/appendices/design_decisions.rst
@@ -32,11 +32,11 @@ Allele Rather than Variant
 The most primitive sequence assertion in VRS is the :ref:`Allele`
 entity. Colloquially, the words "allele" and "variant" have similar
 meanings and they are often used interchangeably. However, the VR
-contributors believe that it is essential to distinguish the state of
-the sequence from the change between states of a sequence. It is
+contributors assert that it is essential to distinguish between the *state of*
+a reference sequence from the *change from* a reference sequence. It is
 imperative that precise terms are used when modelling data. Therefore,
-within VRS, Allele refers to a state and "variant" refers to the change
-from one Allele to another.
+within VRS, "allele" refers to a state of a reference sequence and "variant" refers to a change
+from a reference sequence.
 
 The word "variant", which implies change, makes it awkward to refer to
 the (unchanged) reference allele. Some systems will use an HGVS-like
@@ -45,45 +45,6 @@ when referring to an unchanged residue. In some cases, such "variants"
 are even associated with allele frequencies. Similarly, a predicted
 consequence is better associated with an allele than with a variant.
 
-.. _should-normalize:
-
-Implementations should normalize
-@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
-
-VRS STRONGLY RECOMMENDS that Alleles be :ref:`normalized
-<normalization>` when generating :ref:`computed identifiers
-<computed-identifiers>`. The rationale for recommending, rather than
-requiring, normalization is grounded in dual views of Allele objects
-with distinct interpretations:
-
-* Allele as minimal representation of a change in sequence. In this
-  view, normalization is a process that makes the representation
-  minimal and unambiguous.
-
-* Allele as an assertion of state. In this view, it is reasonable to
-  want to assert state that may include (or be composed entirely of)
-  reference bases, for which the normalization process would alter the
-  intent.
-
-Although this rationale applies only to Alleles, it may have have
-parallels with other VRS types. In addition, it is desirable for all
-VRS types to be treated similarly.
-
-Furthermore, if normalization were required in order to generate
-:ref:`computed-identifiers`, but did not apply to certain instances of
-VRS Variation, implementations would likely require secondary
-identifier mechanisms, which would undermine the intent of a global
-computed identifier.
-
-The primary downside of not requiring normalization is that Variation
-objects might be written in non-canonical forms, thereby creating
-unintended degeneracy.
-
-Therefore, normalization of all VRS Variation classes is optional in
-order to support the view of Allele as an assertion of state on a
-sequence.
-
-
 
 .. _fully-justified:
 
@@ -113,6 +74,55 @@ occurs in a low-complexity region, but rather describes the final and
 unambiguous state of the resultant sequence.
 
 
+.. _should-normalize:
+
+Implementations should normalize Alleles
+@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+
+VRS STRONGLY RECOMMENDS that Alleles be :ref:`normalized
+<normalization>` when generating :ref:`computed identifiers
+<computed-identifiers>` unless there is compelling reason to do
+otherwise.  Those reasons are the subject of this section.
+
+:ref:`Allele Normalization <normalization>` is the process of
+comparing a span of reference sequence to a sequence state (often the
+alternative sequence) and resolving that span to an unambiguous form.  The fully-justified Allele normalization in VRS consists of two steps: trimming
+and shuffling.  In the trimming step, common flanking prefix and
+suffix sequences are removed.  For example, a CAG-to-CTG Allele would
+be trimmed to merely A-to-T, with the position adjusted accordingly.
+There are four cases of the resulting sequences:
+
+  1. The trimmed sequences are empty: The Allele refers to reference
+     state.
+  2. The trimmed sequences are non-empty: The Allele is a substitution
+     (perhaps multi-residue).
+  3. The reference sequence is empty: The Allele is a net insertion.
+  4. The state sequence is empty: The Allele is a net deletion.
+
+When the Allele refers to a reference state (case 1), trimming would
+reduce the variant to a null change.  However, reduction to a null
+state would make it impossible to refer to a specific span of
+reference sequence. In order to permit users to refer to spans of
+reference sequence, VRS does not require normalizing reference
+agreement Alleles.
+
+The trimming step applies only when the reference or the state
+sequences are empty (cases 3 and 4).  When these occur in the context
+of repeating reference sequence that matches the inserted or deleted
+sequence, the Allele may be shuffled left and right to identify the
+fully-justified location of the variation. (See :ref:`normalization`
+for details.)
+
+In rare cases, data originators might have reason to associate an
+annotation with a specific repeating unit in the context of repeated
+sequence.  In order to support this case, normalization is not
+strictly required.
+
+Most users will normalize most Alleles.  Normalization should be
+skipped only when doing so would decrease the intended precision of an
+Allele.
+
+
 .. _inter-residue-coordinates-design:
 
 Inter-residue Coordinates

diff --git a/docs/source/appendices/future_plans.rst b/docs/source/appendices/future_plans.rst
@@ -96,129 +96,6 @@ Under consideration. See https://github.com/ga4gh/vrs/issues/28.
 t(9;22)(q34;q11) in BCR-ABL
 
 
-.. _genotype:
-
-Genotype
-########
-
-The genetic state of an organism, whether complete (defined over the
-whole genome) or incomplete (defined over a subset of the genome).
-
-**Computational definition**
-
-A list of Haplotypes.
-
-**Information model**
-
-.. list-table::
-   :class: reece-wrap
-   :header-rows: 1
-   :align: left
-   :widths: auto
-
-   * - Field
-     - Type
-     - Limits
-     - Description
-   * - _id
-     - :ref:`CURIE`
-     - 0..1
-     - Variation Id; MUST be unique within document
-   * - type
-     - string
-     - 1..1
-     - Variation type; MUST be set to '**Genotype**'
-   * - completeness
-     - enum
-     - 1..1
-     - Declaration of completeness of the Haplotype definition.
-       Values are:
-
-       * UNKNOWN: Other Haplotypes may exist.
-       * PARTIAL: Other Haplotypes exist but are unspecified.
-       * COMPLETE: The Genotype declares a complete set of Haplotypes.
-
-   * - members
-     - :ref:`Haplotype`\[] or :ref:`CURIE`\[]
-     - 0..*
-     - List of Haplotypes or Haplotype identifiers; length MUST agree
-       with ploidy of genomic region
-
-
-**Implementation guidance**
-
-* Haplotypes in a Genotype MAY occur at different locations or on
-  different reference sequences. For example, an individual may have
-  haplotypes on two population-specific references.
-* Haplotypes in a Genotype MAY contain differing numbers of Alleles or
-  Alleles at different Locations.
-
-**Notes**
-
-* The term "genotype" has two, related definitions in common use. The
-  narrower definition is a set of alleles observed at a single
-  location and with a ploidy of two, such as a pair of single residue
-  variants on an autosome. The broader, generalized definition is a
-  set of alleles at multiple locations and/or with ploidy other than
-  two.The VRS Genotype entity is based on this broader definition.
-* The term "diplotype" is often used to refer to two haplotypes. The
-  VRS Genotype entity subsumes the conventional definition of
-  diplotype. Therefore, the VRS model does not include an explicit
-  entity for diplotypes. See :ref:`this note
-  <genotypes-represent-haplotypes-with-arbitrary-ploidy>` for a
-  discussion.
-* The VRS model makes no assumptions about ploidy of an organism or
-  individual. The number of Haplotypes in a Genotype is the observed
-  ploidy of the individual.
-* In diploid organisms, there are typically two instances of each
-  autosomal chromosome, and therefore two instances of sequence at a
-  particular location. Thus, Genotypes will often list two
-  Haplotypes. In the case of haploid chromosomes or
-  haploinsufficiency, the Genotype consists of a single Haplotype.
-* A consequence of the computational definition is that Haplotypes at
-  overlapping or adjacent intervals MUST NOT be included in the same
-  Genotype. However, two or more Alleles MAY always be rewritten as an
-  equivalent Allele with a common sequence and interval context.
-* The rationale for permitting Genotypes with Haplotypes defined on
-  different reference sequences is to enable the accurate
-  representation of segments of DNA with the most appropriate
-  population-specific reference sequence.
-
-**Sources**
-
-SO: `Genotype (SO:0001027)
-<http://www.sequenceontology.org/browser/current_svn/term/SO:0001027>`__
-— A genotype is a variant genome, complete or incomplete.
-
-.. _genotypes-represent-haplotypes-with-arbitrary-ploidy:
-
-.. note:: Genotypes represent Haplotypes with arbitrary ploidy
-     The VRS defines Haplotypes as a list of Alleles, and Genotypes as
-     a list of Haplotypes. In essence, Haplotypes and Genotypes represent
-     two distinct dimensions of containment: Haplotypes represent the "in
-     phase" relationship of Alleles while Genotypes represents sets of
-     Haplotypes of arbitrary ploidy.
-
-     There are two important consequences of these definitions: There is no
-     single-location Genotype. Users of SNP data will be familiar with
-     representations like rs7412 C/C, which indicates the diploid state at
-     a position. In the VRS, this is merely a special case of a
-     Genotype with two Haplotypes, each of which is defined with only one
-     Allele (the same Allele in this case).  The VRS does not define a
-     diplotype type. A diplotype is a special case of a VRS Genotype
-     with exactly two Haplotypes. In practice, software data types that
-     assume a ploidy of 2 make it very difficult to represent haploid
-     states, copy number loss, and copy number gain, all of which occur
-     when representing human data. In addition, assuming ploidy=2 makes
-     software incompatible with organisms with other ploidy. The VRS
-     makes no assumptions about "normal" ploidy.
-
-     In other words, the VRS does not represent single-position
-     Genotypes or diplotypes because both concepts are subsumed by the
-     Allele, Haplotype, and Genotypes entities.
-
-
-
 .. _GitHub issue: https://github.com/ga4gh/vrs/issues
 .. _genetic variation: https://en.wikipedia.org/wiki/Genetic_variation
 

diff --git a/docs/source/impl-guide/computed_identifiers.rst b/docs/source/impl-guide/computed_identifiers.rst
@@ -119,9 +119,7 @@ If the object is an instance of a VRS class, implementations MUST:
     * ensure that objects are referenced with identifiers in the
       ``ga4gh`` namespace
     * replace each nested :term:`identifiable object` with their
-      corresponding *digests*. (Note: Attributes of some objects, such
-      as :ref:`CopyNumber`, permit a mix of identifiable and
-      non-identifiable values.)
+      corresponding *digests*.
     * order arrays of digests and ids by Unicode Character Set values
     * filter out fields that start with underscore (e.g., `_id`)
     * filter out fields with null values
@@ -193,7 +191,7 @@ Truncated Digest (sha512t24u)
 The sha512t24u truncated digest algorithm [Hart2020]_ computes an ASCII digest
 from binary data.  The method uses two well-established standard
 algorithms, the `SHA-512`_ hash function, which generates a binary
-digest from binary data, and `Base64`_ URL encoding, which encodes
+digest from binary data, and a URL-safe variant of `Base64`_ encoding, which encodes
 binary data using printable characters.
 
 Computing the sha512t24u truncated digest for binary data consists of

diff --git a/docs/source/releases/1.3.rst b/docs/source/releases/1.3.rst
@@ -15,15 +15,15 @@ Major Changes
 #############
 
   * :ref:`CopyNumberChange` introduced for relative copy number calls
-  * :ref:`CopyNumberCount` replaces `CopyNumber`
-  * :ref:`Genotype` introduced for describing genotypes
-  * :ref:`ComposedSequenceExpression` introduced for composing expressions
-      from multiple other sequence expressions
+  * :ref:`CopyNumberCount` replaces `CopyNumber (v1.2) <https://vrs.ga4gh.org/en/1.2.1/terms_and_model.html#copynumber>`_
+  * :ref:`Genotype` introduced as a new systemic variation concept
+  * :ref:`ComposedSequenceExpression` introduced for composing expressions from multiple other sequence expressions
 
 Minor Changes
 #############
 
-  * Clarifying updates for :ref:`Allele normalization guidance <>`
+  * Clarifying updates for :ref:`Allele normalization guidance
+    <should-normalize>`
   * :ref:`Haplotype` allele member minimum was revised from 1 to 2
   * Updated metaschema processor version
   * Introduced ordered / unordered attribute in array declarations

diff --git a/docs/source/releases/index.rst b/docs/source/releases/index.rst
@@ -23,6 +23,7 @@ Releases
    :maxdepth: 2
    :includehidden:
 
+   1.3.rst
    1.2.rst
    1.1.rst
    1.0.rst