Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify haploid Viterbi/FB to handle NONCOPY state in reference panel #31

Closed
wants to merge 1 commit into from

Conversation

szhan
Copy link
Collaborator

@szhan szhan commented Mar 21, 2024

If ancestral haplotypes are used as part of a reference panel for LS matching, then non-copying states in them need to be treated differently than other states. When a non-copying state is encountered, the emission probability should be 0, because it is not allowed to copy from it. We arbitrarily set the numeric value of non-copying states to be -2, which is distinct from missing state that is set to -1. Here, we modify how emission probabilities are computed in the haploid version of the Viterbi algorithm. We also add examples of reference panels, queries, and expected paths for testing.

@szhan
Copy link
Collaborator Author

szhan commented Mar 21, 2024

The tests cover the following:

  • Four types of ancestral haplotypes that span:
    • From the first position to the middle, with NONCOPY states on the right side.
    • From the middle to the last position, with NONCOPY states on the left side.
    • A middle section, excluding the first and last few positions, which are occupied by NONCOPY states.
    • Completely from end to end, with no NONCOPY states.
  • Switching from:
    • Sample to sample (1 switch).
    • Sample to ancestor (1 switch).
    • Ancestor to sample (1 switch).
    • Ancestor to ancestor (1 switch).
    • Sample to ancestor to sample (2 switches).
    • Ancestor to ancestor to sample (2 switches).
  • Missing data in query at a switch position (which delays switching) and at a position to the left of it (which results in no delay in switching).
  • Multiallelic sites.
  • Two equally likely paths.

@szhan
Copy link
Collaborator Author

szhan commented Mar 21, 2024

Presently, rho / n, which is part of the calculation of switch probabilities, is computed by dividing rho along the sites by a constant n, which is the number of reference haplotypes. It is part of the standard formulation of Li & Stephens HMM which does not define any non-copying state in the reference panel. Here, because of the presence of NONCOPY, we should allow the denominator to vary along the sites according to the number of reference haplotypes without the NONCOPY state. The consequences of this for the Viterbi, forward, backward, and forward-backward algorithms should be discussed in a separate issue.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looks great @szhan.

@astheeggeggs, what do you think? It would be really helpful to have a direct matrix-based implementation of the full "all haplotypes" model we can compare with, and this seems like the most natural way. Would be great to get your thoughts here.

@astheeggeggs
Copy link
Owner

astheeggeggs commented Mar 22, 2024 via email

@szhan
Copy link
Collaborator Author

szhan commented Mar 24, 2024

Is there a point to add tests for cases where we expect at least two equally likely paths? It seems quite easy to encounter cases with more than one best path when the number of reference haplotypes and number of sites grow.

@szhan
Copy link
Collaborator Author

szhan commented Mar 26, 2024

I have reworked the tests, which now cover the following queries copying from:

  • An ancestor with only NONCOPY states.
  • An ancestor on the left and a sample on the right.
  • A sample on the left and an ancestor on the right.
  • An ancestor in the middle and a sample in both the flanking regions.
  • Two ancestors and one sample (two switches to a different haplotype).
  • An ancestor and a sample but with a MISSING state at the switch position.
  • An ancestor and a sample but with a MISSING state left of the switch position.

In the test ref. panels, all the ancestors carry at least one NONCOPY state, and all the samples carry no NONCOPY state.

Instead of asserting that the expected path and the actual path are identical, the new tests assert that the log-likelihood of the expected path and that of the actual path are approximately equal (i.e. passing np.allclose). This avoids needing to deal with equally likely paths that may not be so obvious (it gets too tedious to check anyway).

@szhan
Copy link
Collaborator Author

szhan commented Mar 26, 2024

Just to add a note here that in tsinfer n (i.e. number of match nodes) is constant along the genome, given a ref. panel. So, the above tests should test the behavior of LS HMM Viterbi algorithm as in tsinfer, I think?

@szhan szhan force-pushed the vit_noncopy branch 2 times, most recently from db075cf to ac20d0a Compare March 28, 2024 19:40
@szhan szhan changed the title Modify haploid Viterbi to handle NONCOPY state in reference panel Modify haploid Viterbi/FB to handle NONCOPY state in reference panel Mar 29, 2024
@szhan
Copy link
Collaborator Author

szhan commented Apr 4, 2024

I have two sets of tests passing now. One set (test_API_noncopy_manual.py) contains manually crafted test examples, and the other set (test_API_noncopy.py) has more or less the same set of tests as test_API_multiallelic.py.

In both sets of tests, NONCOPY characters in the reference panel are not counted when computing the number of alleles per site (i.e. n_alleles), which is used for mutation rate scaling.

The set of tests to keep is in test_API_noncopy.py. There are several major differences between this test file and test_API_multiallelic.py, on which it is heavily based.

  • Instead of getting a genotype matrix from a simulated tree sequence as the reference panel, it calls another function (get_ancestral_haplotypes) to extract ancestral haplotypes from it. When there is recombination in the past, partial haplotypes with NONCOPY denoting non-ancestral material are incorporated into the reference panel.
  • Tree sequences are simulated using SMC instead of Hudson's, in order to avoid generating discontiguous partial haplotypes caused by trapped non-ancestral material.
  • NONCOPY and MISSING characters are not counted as distinct alleles (see related MISSING as a distinct allele? #33).

Probably I should delete test_API_noncopy_manual.py, unless some of the hand-crafted examples may be useful for testing. @jeromekelleher ?

Otherwise, I think the tests are ready for review @astheeggeggs, besides some more minor tweaks I'd like to make.

@szhan
Copy link
Collaborator Author

szhan commented Apr 4, 2024

Ah, by modifying lshmm/api.py, both the haploid and diploid versions are affected, but I haven't added tests for the diploid version.

Also, I think we should probably include assertions that none of the most likely paths go through a base in the reference panel that is equal to NONCOPY.

@astheeggeggs
Copy link
Owner

Looking at this again, I think it needs a rethink. We have emission probabilities that don't sum to 1, which doesn't make sense in the usual HMM framework.

@szhan
Copy link
Collaborator Author

szhan commented Apr 17, 2024

I'm going to scrap most of the code here, since it makes sense to do refactoring before adding new features. I'll probably just keep some of the handcrafted test cases and the routine that gets a ref. panel with NONCOPY states.

@szhan
Copy link
Collaborator Author

szhan commented Apr 19, 2024

Aside from rethinking about this, we should probably refactor the code a bit first. See #37

@szhan
Copy link
Collaborator Author

szhan commented May 14, 2024

I've deleted this branch of code, but it is probably useful to keep the discussion here.

Copy link
Owner

@astheeggeggs astheeggeggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I've not checked the tests, but the alterations in the functions absolutely make sense.

@szhan
Copy link
Collaborator Author

szhan commented May 19, 2024

Thanks @astheeggeggs ! Please do not merge the code in this branch! Please look at the code in #41 instead.

@szhan szhan closed this May 22, 2024
@szhan
Copy link
Collaborator Author

szhan commented May 22, 2024

Closing thie PR because it's replaced by PR #41.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants