Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explain ref agree normalization rules #381

Merged
merged 9 commits into from Apr 13, 2022

Conversation

reece
Copy link
Member

@reece reece commented Mar 6, 2022

Finally got around to Larry's requests for explaining ref agree normalization rules.

Instead of writing a new section, I overhauled an existing related section (should-normalize).

@reece
Copy link
Member Author

reece commented Mar 28, 2022

Did y'all (i.e., @ahwagner , @andreasprlic ) see that I addressed #377 in this PR a few weeks ago?
@larrybabb approved

Copy link
Member

@ahwagner ahwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping on this thread @reece, I missed the original.

I think these changes accurately represent prior discussion on these points. I do think that the distinction between Allele and variant is artificial, particularly as we do not normalize reference-agree and reference-disagree Alleles the same. In addition, we (for well-justified reasons) represent Alleles as a state on a reference sequence, inherently comparing the state to some underlying reference.

To be very clear about what I mean, for two (short) reference sequences:

Reference 1: AACTA
Reference 2: AACGA

A resultant sequence of AACCA may be expressed as two different Alleles, conceptually:

  • A C at interresidue coordinates <3,4> on Reference 1
  • A C at interresidue coordinates <3,4> on Reference 2

Each of these result in different digests, as the information of the reference sequence is inherent to the representation of each Allele. The primary distinction between Allele and similar entities from other major variant formats is that we:

  1. Do not represent the reference state as it is retrievable
  2. Do not provide special character representation (e.g. =) of the reference-match state

I am not advocating that we compute resultant sequence to create digests; the process is far too cumbersome to be practical in implementation and the gains would be nearly non-existent (i.e. analogous reference sequences are too dissimilar for the above scenario to realistically occur). But I think we should revise the text here to better capture the notion of allele as a reference-match agnostic state instead of a reference-sequence agnostic state. The edits I have suggested here reflect this subtlety.

docs/source/appendices/design_decisions.rst Outdated Show resolved Hide resolved
docs/source/appendices/design_decisions.rst Outdated Show resolved Hide resolved
docs/source/appendices/design_decisions.rst Outdated Show resolved Hide resolved
docs/source/appendices/design_decisions.rst Outdated Show resolved Hide resolved
docs/source/appendices/design_decisions.rst Outdated Show resolved Hide resolved
@reece
Copy link
Member Author

reece commented Apr 8, 2022

Looks good to me (but I'm not a reviewer and am totally powerless! Yay!)

Copy link
Member

@ahwagner ahwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed applied per 👍 from @reece

@reece
Copy link
Member Author

reece commented Apr 13, 2022

Anyone want to hit the merge button?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants