Modify haploid Viterbi/FB to handle NONCOPY state in reference panel #31

szhan · 2024-03-21T16:48:10Z

If ancestral haplotypes are used as part of a reference panel for LS matching, then non-copying states in them need to be treated differently than other states. When a non-copying state is encountered, the emission probability should be 0, because it is not allowed to copy from it. We arbitrarily set the numeric value of non-copying states to be -2, which is distinct from missing state that is set to -1. Here, we modify how emission probabilities are computed in the haploid version of the Viterbi algorithm. We also add examples of reference panels, queries, and expected paths for testing.

szhan · 2024-03-21T21:47:57Z

The tests cover the following:

Four types of ancestral haplotypes that span:
- From the first position to the middle, with NONCOPY states on the right side.
- From the middle to the last position, with NONCOPY states on the left side.
- A middle section, excluding the first and last few positions, which are occupied by NONCOPY states.
- Completely from end to end, with no NONCOPY states.
Switching from:
- Sample to sample (1 switch).
- Sample to ancestor (1 switch).
- Ancestor to sample (1 switch).
- Ancestor to ancestor (1 switch).
- Sample to ancestor to sample (2 switches).
- Ancestor to ancestor to sample (2 switches).
Missing data in query at a switch position (which delays switching) and at a position to the left of it (which results in no delay in switching).
Multiallelic sites.
Two equally likely paths.

szhan · 2024-03-21T22:01:12Z

Presently, rho / n, which is part of the calculation of switch probabilities, is computed by dividing rho along the sites by a constant n, which is the number of reference haplotypes. It is part of the standard formulation of Li & Stephens HMM which does not define any non-copying state in the reference panel. Here, because of the presence of NONCOPY, we should allow the denominator to vary along the sites according to the number of reference haplotypes without the NONCOPY state. The consequences of this for the Viterbi, forward, backward, and forward-backward algorithms should be discussed in a separate issue.

jeromekelleher

Nice, looks great @szhan.

@astheeggeggs, what do you think? It would be really helpful to have a direct matrix-based implementation of the full "all haplotypes" model we can compare with, and this seems like the most natural way. Would be great to get your thoughts here.

astheeggeggs · 2024-03-22T10:03:07Z

So the intuition here is that the number of templates in the reference panel jumps from n -> 2n -1, and you fill the ancestral reference templates with missing values (on anything really) where there is no ancestral material in those haplotypes, give them a 'NON_COPY' label at those sites, and set the corresponding emission prob to 0. You can then set recombination rate as rho_j/(2*n-1-|sum(NON_COPY_j)|) for each site j. That logic all makes sense to me!

…

On Fri, Mar 22, 2024 at 9:10 AM Jerome Kelleher ***@***.***> wrote: ***@***.**** commented on this pull request. Nice, looks great @szhan <https://github.com/szhan>. @astheeggeggs <https://github.com/astheeggeggs>, what do you think? It would be really helpful to have a direct matrix-based implementation of the full "all haplotypes" model we can compare with, and this seems like the most natural way. Would be great to get your thoughts here. — Reply to this email directly, view it on GitHub <#31 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVQA77ORBRY3VYMCOP7QA3YZPYSBAVCNFSM6AAAAABFB2M6BGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSNJUGI4TOMRQGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

szhan · 2024-03-24T18:31:14Z

Is there a point to add tests for cases where we expect at least two equally likely paths? It seems quite easy to encounter cases with more than one best path when the number of reference haplotypes and number of sites grow.

szhan · 2024-03-26T09:48:46Z

I have reworked the tests, which now cover the following queries copying from:

An ancestor with only NONCOPY states.
An ancestor on the left and a sample on the right.
A sample on the left and an ancestor on the right.
An ancestor in the middle and a sample in both the flanking regions.
Two ancestors and one sample (two switches to a different haplotype).
An ancestor and a sample but with a MISSING state at the switch position.
An ancestor and a sample but with a MISSING state left of the switch position.

In the test ref. panels, all the ancestors carry at least one NONCOPY state, and all the samples carry no NONCOPY state.

Instead of asserting that the expected path and the actual path are identical, the new tests assert that the log-likelihood of the expected path and that of the actual path are approximately equal (i.e. passing np.allclose). This avoids needing to deal with equally likely paths that may not be so obvious (it gets too tedious to check anyway).

szhan · 2024-03-26T10:27:51Z

Just to add a note here that in tsinfer n (i.e. number of match nodes) is constant along the genome, given a ref. panel. So, the above tests should test the behavior of LS HMM Viterbi algorithm as in tsinfer, I think?

…n reference panel

szhan · 2024-04-04T14:40:53Z

I have two sets of tests passing now. One set (test_API_noncopy_manual.py) contains manually crafted test examples, and the other set (test_API_noncopy.py) has more or less the same set of tests as test_API_multiallelic.py.

In both sets of tests, NONCOPY characters in the reference panel are not counted when computing the number of alleles per site (i.e. n_alleles), which is used for mutation rate scaling.

The set of tests to keep is in test_API_noncopy.py. There are several major differences between this test file and test_API_multiallelic.py, on which it is heavily based.

Instead of getting a genotype matrix from a simulated tree sequence as the reference panel, it calls another function (get_ancestral_haplotypes) to extract ancestral haplotypes from it. When there is recombination in the past, partial haplotypes with NONCOPY denoting non-ancestral material are incorporated into the reference panel.
Tree sequences are simulated using SMC instead of Hudson's, in order to avoid generating discontiguous partial haplotypes caused by trapped non-ancestral material.
NONCOPY and MISSING characters are not counted as distinct alleles (see related MISSING as a distinct allele? #33).

Probably I should delete test_API_noncopy_manual.py, unless some of the hand-crafted examples may be useful for testing. @jeromekelleher ?

Otherwise, I think the tests are ready for review @astheeggeggs, besides some more minor tweaks I'd like to make.

szhan · 2024-04-04T14:57:47Z

Ah, by modifying lshmm/api.py, both the haploid and diploid versions are affected, but I haven't added tests for the diploid version.

Also, I think we should probably include assertions that none of the most likely paths go through a base in the reference panel that is equal to NONCOPY.

astheeggeggs · 2024-04-08T14:34:11Z

Looking at this again, I think it needs a rethink. We have emission probabilities that don't sum to 1, which doesn't make sense in the usual HMM framework.

szhan · 2024-04-17T10:43:51Z

I'm going to scrap most of the code here, since it makes sense to do refactoring before adding new features. I'll probably just keep some of the handcrafted test cases and the routine that gets a ref. panel with NONCOPY states.

szhan · 2024-04-19T08:13:04Z

Aside from rethinking about this, we should probably refactor the code a bit first. See #37

szhan · 2024-05-14T20:47:20Z

I've deleted this branch of code, but it is probably useful to keep the discussion here.

astheeggeggs

Looks great, I've not checked the tests, but the alterations in the functions absolutely make sense.

szhan · 2024-05-19T19:38:04Z

Thanks @astheeggeggs ! Please do not merge the code in this branch! Please look at the code in #41 instead.

szhan · 2024-05-22T08:45:05Z

Closing thie PR because it's replaced by PR #41.

szhan force-pushed the vit_noncopy branch from b44049c to 444491e Compare March 21, 2024 21:29

jeromekelleher reviewed Mar 22, 2024

View reviewed changes

szhan force-pushed the vit_noncopy branch from 0501f1c to e45ff53 Compare March 24, 2024 18:20

szhan force-pushed the vit_noncopy branch from aa9aa58 to 3f7a936 Compare March 26, 2024 09:36

szhan force-pushed the vit_noncopy branch 2 times, most recently from db075cf to ac20d0a Compare March 28, 2024 19:40

szhan changed the title ~~Modify haploid Viterbi to handle NONCOPY state in reference panel~~ Modify haploid Viterbi/FB to handle NONCOPY state in reference panel Mar 29, 2024

Modify haploid Viterbi and forward-backward to handle NONCOPY state i…

5ba8486

…n reference panel

szhan force-pushed the vit_noncopy branch from a70ed54 to 5ba8486 Compare April 4, 2024 14:20

szhan mentioned this pull request Apr 4, 2024

MISSING as a distinct allele? #33

Closed

This was referenced Apr 5, 2024

Exclude MISSING when counting alleles per site #34

Merged

Actions: Preparing metadata (pyproject.toml) did not run successfully #35

Closed

This was referenced May 14, 2024

Handle NONCOPY in reference panel #40

Closed

Implement NONCOPY for both the haploid and diploid cases #41

Merged

astheeggeggs approved these changes May 19, 2024

View reviewed changes

szhan closed this May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify haploid Viterbi/FB to handle NONCOPY state in reference panel #31

Modify haploid Viterbi/FB to handle NONCOPY state in reference panel #31

szhan commented Mar 21, 2024

szhan commented Mar 21, 2024

szhan commented Mar 21, 2024

jeromekelleher left a comment

astheeggeggs commented Mar 22, 2024 via email

szhan commented Mar 24, 2024

szhan commented Mar 26, 2024

szhan commented Mar 26, 2024

szhan commented Apr 4, 2024

szhan commented Apr 4, 2024

astheeggeggs commented Apr 8, 2024

szhan commented Apr 17, 2024

szhan commented Apr 19, 2024

szhan commented May 14, 2024

astheeggeggs left a comment

szhan commented May 19, 2024

szhan commented May 22, 2024

Modify haploid Viterbi/FB to handle NONCOPY state in reference panel #31

Modify haploid Viterbi/FB to handle NONCOPY state in reference panel #31

Conversation

szhan commented Mar 21, 2024

szhan commented Mar 21, 2024

szhan commented Mar 21, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

astheeggeggs commented Mar 22, 2024 via email

szhan commented Mar 24, 2024

szhan commented Mar 26, 2024

szhan commented Mar 26, 2024

szhan commented Apr 4, 2024

szhan commented Apr 4, 2024

astheeggeggs commented Apr 8, 2024

szhan commented Apr 17, 2024

szhan commented Apr 19, 2024

szhan commented May 14, 2024

astheeggeggs left a comment

Choose a reason for hiding this comment

szhan commented May 19, 2024

szhan commented May 22, 2024