Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the ability for Indels to be recovered from dangling heads #6113

Merged
merged 7 commits into from Jan 26, 2021

Conversation

jamesemery
Copy link
Collaborator

In helping @bhanugandham figure out why a particular site was failing it became apparent that merging dangling head code was failing to recover deletions in the dangling head. Furthermore there is some code in the dangling end recovery code that asserts a certain high standard of matching (usually 1 but sometimes dangling branch length/kmersize) getMaxMismatches(final int lengthOfDanglingBranch). Both of these facts seem likely to cause dangling heads to be dropped despite their being still potentially informative, particularly the indel code.

I have added the ability for the index recovery code to account for the cigar string when merging dangling ends. Addtionally rather than counting mismatches to reject the branch it simply requires a minimum matching end (which can be changed, I suspect this is where the lionshare of the differences come from). Unfortunately changing the tests is non-trivial (as this happened to change the integration test results for HaplotypeCaller at a few sites) so I wanted to get this branch up to solicit advice a to whether it is worth pursuing this fix. @davidbenjamin @ldgauthier @droazen

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly little stuff, with one comment that might simplify the prefix code.

}

// /**
// * Determine the maximum number of mismatches permitted on the branch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?

}
// if we got here then we hit the max index
return lastGoodIndex;
if (matchesSinceLastMismatch >= getMinMatchingBases) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of tracking matchesSinceLastMismatch as you go, why not just find the two offsets and then backtrack from there in both sequences, looking for the required number of exact matches?

@jamesemery jamesemery force-pushed the je_allowMoreMismatchesInDanglingHeadRecovery branch from 890383d to 39a0d04 Compare June 8, 2020 20:30
@jamesemery
Copy link
Collaborator Author

@davidbenjamin I have refactored this branch to account for changes to the codebase adjacent to this code. In the interest of not possibly harming any of the old results I have made this a toggle and I have also made the setting apply symmetrically to tails and heads and added a few simple tests in the existing framework.

@gatk-bot
Copy link

gatk-bot commented Jun 8, 2020

Travis reported job failures from build 30561
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 30561.12 logs
unit openjdk8 30561.3 logs

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose a simplification to the code. Do you think it works?

/**
* Finds the index of the best extent of the prefix match between the provided paths, for dangling head merging.
* Requires that at a minimum there are at least #getMinMatchingBases() matches between the reference and the read
* at the end in order to emit an alignmet offset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

* @return an integer pair object where the key is the offset into path1 and the value is offset into path2 (both -1 if no path is found)
*/
private Pair<Integer, Integer> bestPrefixMatch(final List<CigarElement> cigarElements, final byte[] path1, final byte[] path2) {
final int getMinMatchingBases = getMinMatchingBases();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline this?

// we want to paste the base before the last read base, thus we need to handle it in the next iteration when we find a mismatch
boolean wasLastIndexAMismatch = false;

for (final CigarElement ce : cigarElements) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this can be made much simpler by traversing from the end of the paths. I think the pseudocode in this case becomes:

  • go backwards from the right until you hit a mismatch or indel.
  • if this covered fewer than getMinMatchingBases() return (-1,-1).
  • otherwise return the read and ref position you got to.

In code this looks something like

refIdx = path1.length - 1;
readIdx = path2.length - 1;
for (final CigarElement ce : Lists.reverse(cigarElements)) {
   if (!(ce.consumesReadBases() && ce.consumesRefBases()) {
      break;
   }
   for (int j = 0; j < elementLength; j++, refIdx—; readIndx—) {
      if (path1[refIdx] != path2[readIdx]) {
         break;
      }
   }  
}

final int matches = path2.length - readIdx;
return matches < minimum ? (-1,-1) : (readIdx, refIdx);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, your previous comment got lost in the rebase. I just checked and there is a problem with this, namely that there is no guarantee that the ref path and read path match in length. Thus in order to match the bases at the end of the dangling path I have at the very least parse through the cigar once forwards and then start walking it backwards... Perhaps that is still cleaner than what I wrote though...

Copy link
Collaborator Author

@jamesemery jamesemery Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we excise the trailing deletions that might be useable for quickly calculating the end base this at on this line: return new DanglingChainMergeHelper(altPath, refPath, altBases, refBases, AlignmentUtils.removeTrailingDeletions(alignment.getCigar()));

@davidbenjamin
Copy link
Contributor

Back to @jamesemery

@davidbenjamin davidbenjamin removed their assignment Jun 9, 2020
@gatk-bot
Copy link

gatk-bot commented Jul 2, 2020

Travis reported job failures from build 30869
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 30869.13 logs
integration openjdk11 30869.12 logs
variantcalling openjdk8 30869.4 logs
unit openjdk8 30869.3 logs
integration openjdk8 30869.2 logs

@jamesemery jamesemery force-pushed the je_allowMoreMismatchesInDanglingHeadRecovery branch from 1ee14e0 to e8c4972 Compare September 30, 2020 20:02
@gatk-bot
Copy link

gatk-bot commented Sep 30, 2020

Travis reported job failures from build 31570
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 31570.13 logs
variantcalling openjdk8 31570.4 logs
unit openjdk8 31570.3 logs

@jamesemery jamesemery force-pushed the je_allowMoreMismatchesInDanglingHeadRecovery branch from e8c4972 to fa3240a Compare November 16, 2020 20:33
@gatk-bot
Copy link

gatk-bot commented Nov 16, 2020

Travis reported job failures from build 32140
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 32140.1 logs
unit openjdk8 32140.3 logs
variantcalling openjdk8 32140.4 logs

@@ -93,18 +95,25 @@ public ReadThreadingAssembler(final int maxAllowedPathsForReadThreadingAssembler
chainPruner = useAdaptivePruning ? new AdaptiveChainPruner<>(initialErrorRateForPruning, pruningLogOddsThreshold, pruningSeedingLogOddsThreshold, maxUnprunedVariants) :
new LowWeightChainPruner<>(pruneFactor);
numBestHaplotypesPerGraph = maxAllowedPathsForReadThreadingAssembler;
this.minMachingBasesToDanglngEndRecovery = minMachingBasesToDanglngEndRecovery;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix spelling. change to minMatchingBasesToDanglingEndRecovery

@jamesemery
Copy link
Collaborator Author

@davidbenjamin Do you think you can take another look at this? It would be nice to finally merge this code into the next release next week.

@gatk-bot
Copy link

gatk-bot commented Jan 22, 2021

Travis reported job failures from build 32609
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 32609.13 logs
unit openjdk8 32609.3 logs
variantcalling openjdk8 32609.4 logs

@gatk-bot
Copy link

gatk-bot commented Jan 22, 2021

Travis reported job failures from build 32612
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 32612.13 logs
unit openjdk8 32612.3 logs

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery Looks good.

@jamesemery jamesemery merged commit c305078 into master Jan 26, 2021
@jamesemery jamesemery deleted the je_allowMoreMismatchesInDanglingHeadRecovery branch January 26, 2021 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants