Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small alignment leads to error in consensus aligner #899

Closed
CBeelen opened this issue Dec 5, 2022 · 3 comments · Fixed by #962
Closed

Small alignment leads to error in consensus aligner #899

CBeelen opened this issue Dec 5, 2022 · 3 comments · Fixed by #962
Assignees
Milestone

Comments

@CBeelen
Copy link
Contributor

CBeelen commented Dec 5, 2022

In sample MSVRNA3-HIV_S7 from run 210115_M04401, there is an alignment starting just 1 nucleotide before the end of vpr. Because there is nothing to align in amino acid space, there are no values in coord2conseq (the dictionary helping us translate conseq coordinates to reference coordinates), so the code fails in count_match when trying to find the maximum coordinate.
There are two options to solve this:

  • in the consensus aligner, in find_amino_alignments, check for the size of an alignment before adding it (we already check if it's larger than 0 for a match, we could check whether it's at least 3 nucleotides long), or
  • in the consensus aligner, in count_match, check whether the alignment is large enough to do anything.

I'd prefer to catch alignments that are too small as soon as possible (option 1), but option 2 might help catch other weird edge cases.

@CBeelen CBeelen added this to the 7.16 milestone Dec 5, 2022
@donkirkby
Copy link
Member

Option 1 sounds good to me, particularly since it's just a change to an existing check.

@CBeelen CBeelen self-assigned this Mar 21, 2023
CBeelen added a commit that referenced this issue Mar 21, 2023
@CBeelen
Copy link
Contributor Author

CBeelen commented May 9, 2023

This also happened for sample 58836A-HIV_S14 from run 150325_M01841.

@CBeelen
Copy link
Contributor Author

CBeelen commented May 12, 2023

Generally speaking, we can find the reading frame of an alignment, even if it is smaller than 3 nucleotides - we usually just round up to the nearest-larger integer number of amino acids and align. In this particular case, the error happened only because we were right at a region boundary and there was not enough sequence to align to.
I'm a little worried about very fragmented alignments if throw alignments of 1 or 2 nucleotides away - so I'm working on option 2 now, instead. I'm also double checking some cases with very bad alignments to see if we ever need these small alignments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants