Tws sv local assembler #6989

tedsharpe · 2020-12-04T15:53:32Z

No description provided.

gatk-bot · 2020-12-04T16:30:01Z

Travis reported job failures from build 32329
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	32329.1	logs
cloud	openjdk11	32329.14	logs
integration	openjdk11	32329.12	logs
integration	openjdk8	32329.2	logs
cloud	openjdk8	32329.1	logs
cloud	openjdk11	32329.14	logs
integration	openjdk11	32329.12	logs
integration	openjdk8	32329.2	logs

gatk-bot · 2021-01-25T16:05:48Z

Travis reported job failures from build 32618
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	32618.1	logs
cloud	openjdk11	32618.14	logs
integration	openjdk11	32618.12	logs
integration	openjdk8	32618.2	logs

droazen · 2021-02-11T18:50:44Z

src/main/java/org/broadinstitute/hellbender/engine/ReadWalker.java

@@ -57,7 +57,7 @@ protected final void onStartup() {
     */
    void setReadTraversalBounds() {
        if ( hasUserSuppliedIntervals() ) {
-            reads.setTraversalBounds(intervalArgumentCollection.getTraversalParameters(getHeaderForReads().getSequenceDictionary()));
+            reads.setTraversalBounds(userIntervals);


@tedsharpe Not sure if this was a temporary workaround for some issue you encountered, but the previous implementation passing in intervalArgumentCollection.getTraversalParameters() is actually necessary to support -L unmapped. Hopefully this is covered by existing tests somewhere...

Oh, bummer. I'll have to rework this, then. I'm a little vague on the details right now, but it had something to do with an additional tier of padding that I'm doing to make sure that I pull in the distant mates (via the tool whose design you suggested). I had to expand the intervals over which I do read retrieval, while still having access to the underlying singly-padded intervals over which I do assembly.
I asked Louis about why we pre-calculated userIntervals if we were just going to recalculate them here. He thought it seemed an obvious bug at the time. It's so hard to change engine code. :-(

Don't despair @tedsharpe, I think we can probably fix this pretty easily! You just need to construct a new TraversalParameters here, and make sure that the user's traverseUnmapped setting gets propagated. Something like this could do for now:

void setReadTraversalBounds() { if ( hasUserSuppliedIntervals() ) { reads.setTraversalBounds(new TraversalParameters(userIntervals, intervalArgumentCollection.getTraversalParameters(getHeaderForReads().getSequenceDictionary()).traverseUnmappedReads()); } }

I've adopted your suggestion. Thanks very much for catching this.

cwhelan

Hi @tedsharpe thanks for this massive amount of work. Looking forward to experimenting with this tool.

This is a lot to review, and I've tried my best to understand what's going on, but I can't claim to have given it a close enough reading to catch every possible bug.

A lot of my comments are either requests for clarification where I might not have understood something, or minor refactoring suggestions that you could take or leave as you see fit. If I'm totally misunderstanding something just let me know, too.

cwhelan · 2021-01-22T17:27:46Z

src/main/java/org/broadinstitute/hellbender/tools/LocalAssembler.java

+
+@DocumentedFeature
+@CommandLineProgramProperties(
+        summary = "experiment",


We should fill these in with something slightly more descriptive (while keeping in mind that it is still an experiment)

cwhelan · 2021-01-22T17:32:11Z

src/main/java/org/broadinstitute/hellbender/tools/PrintDistantMates.java

+import java.util.Map;
+
+@CommandLineProgramProperties(
+        summary = "Prints reads that have distant mates using the mate's alignment information.  Yes, this is weird, but it helps in processing pairs together (see PairWalker).",


Can you clarify in this comment whether it's a requirement to run this before using PairWalker?

cwhelan · 2021-02-05T17:47:13Z

src/main/java/org/broadinstitute/hellbender/engine/ReadWalker.java

@@ -57,7 +57,7 @@ protected final void onStartup() {
     */
    void setReadTraversalBounds() {
        if ( hasUserSuppliedIntervals() ) {
-            reads.setTraversalBounds(intervalArgumentCollection.getTraversalParameters(getHeaderForReads().getSequenceDictionary()));
+            reads.setTraversalBounds(userIntervals);


Are you sure you need to change this? It seems wrong to have to change ReadWalker. This will cause any read walkers to not do any of the interval merging logic in IntervalArgumentCollection which I assume must be desired sometimes. EDIT: sounds like you and @droazen might have found a way to work this out.

cwhelan · 2021-02-12T17:38:22Z

src/main/java/org/broadinstitute/hellbender/tools/PrintDistantMates.java

+
+    private SAMFileGATKReadWriter outputWriter;
+    private Map<String, GATKRead> pendingReads;
+    private static final String MATE_CIGAR_TAG = "MC";


I think these are already defined via htsjdk at eg SAMTag.MC.name().

cwhelan · 2021-02-12T17:47:37Z

src/main/java/org/broadinstitute/hellbender/tools/PrintDistantMates.java

+        return copy;
+    }
+
+    private static String bogusCigar( final int alignLength, final int readLength ) {


I think a quick comment for this ("Provides a CIGAR that forces the alignment length to match the requested value given the actual read length. The CIGAR consists of M operations plus the necessary number of D or I operations to make alignLength equal to readLength.")

Maybe also leave a warning about not depending on the CIGAR of these altered reads after running this tool in the main documentation for the class?

All this silly jazz is gone.

cwhelan · 2021-03-19T20:00:02Z

src/main/java/org/broadinstitute/hellbender/tools/LocalAssembler.java

+                    final KmerAdjacency kmer = KmerAdjacencyImpl.find(kVal, kmerAdjacencySet);
+                    // if we fail to look up the kmer
+                    if ( kmer == null ) {
+                        if ( currentPathPart == null ) {


The first and third clauses of this if statement look identical. Maybe combine them by changing this to

currentPathPart == null || ! currentPathPart.isGap()

cwhelan · 2021-03-19T20:01:38Z

src/main/java/org/broadinstitute/hellbender/tools/LocalAssembler.java

+                                currentPathPart = new PathPartContig(contig, 0);
+                                parts.add(currentPathPart);
+                            } else {
+                                // weird:  kmer is non-contiguous.  start a new path part after a zero-length gap


I am confused about how this can happen. Should it be an error condition?

Not an error, and I've changed the comment to make it less alarming. Reads with sequencing errors that produce kmers that aren't in the graph can make the read paths appear to jump arbitrarily.

cwhelan · 2021-03-19T20:09:05Z

src/main/java/org/broadinstitute/hellbender/tools/LocalAssembler.java

+                if ( ++count >= Kmer.KSIZE ) {
+                    final KmerAdjacency kmer = KmerAdjacencyImpl.find(kVal, kmerAdjacencySet);
+                    // if we fail to look up the kmer
+                    if ( kmer == null ) {


The nested ifs get pretty hairy in this method. Any chance you could pull some of the clauses into little methods for readability. For example this could be:

if (kmer == null) { currentPathPart = addGapKmer(...) } else { ... }

I just find it hard to match up so many levels of indentation in this method.

cwhelan · 2021-03-19T20:12:45Z

src/main/java/org/broadinstitute/hellbender/tools/LocalAssembler.java

+                            }
+                        } else {
+                            final int kmerContigOffset = kmer.getContigOffset();
+                            if ( currentPathPart.isGap() ) {


This is the transition out of a gap back to a contig, right? Otherwise we wouldn't have found our kmer in the set?

Yes. Exactly. I'll sprinkle in some more comments.

cwhelan · 2021-03-19T20:25:07Z

src/test/java/org/broadinstitute/hellbender/tools/walkers/PairWalkerUnitTest.java

+        Assert.assertTrue(regionChecker.isInInterval(createRead(2, 1501)));
+    }
+
+    private final static class PairWalkerTester extends PairWalker {


You could consider using Mockito for this sort of thing.

tedsharpe

Back to you for another look, Chris. Wherever I haven't replied to your comment, it means I adopted your suggestion. This is an attempt to avoid cluttering Slack with a zillion messages saying "Done." "Done." "Done." Let me know how this works.

tedsharpe · 2021-03-25T19:04:59Z

src/main/java/org/broadinstitute/hellbender/engine/ReadWalker.java

@@ -57,7 +57,7 @@ protected final void onStartup() {
     */
    void setReadTraversalBounds() {
        if ( hasUserSuppliedIntervals() ) {
-            reads.setTraversalBounds(intervalArgumentCollection.getTraversalParameters(getHeaderForReads().getSequenceDictionary()));
+            reads.setTraversalBounds(userIntervals);


I've adopted your suggestion. Thanks very much for catching this.

tedsharpe · 2021-03-25T19:10:02Z

src/main/java/org/broadinstitute/hellbender/tools/PrintDistantMates.java

+        return copy;
+    }
+
+    private static String bogusCigar( final int alignLength, final int readLength ) {


All this silly jazz is gone.

tedsharpe · 2021-03-25T19:14:09Z

src/main/java/org/broadinstitute/hellbender/tools/PrintDistantMates.java

+                       final ReferenceContext referenceContext,
+                       final FeatureContext featureContext ) {
+        final String mateCigarString = read.getAttributeAsString(MATE_CIGAR_TAG);
+        if ( mateCigarString != null ) {


I've rewritten the tool to take a much simpler approach. It now just unmaps (but places) distant mates. The original alignment is saved in an OA tag, so we can redo the mapping in the PairWalker. We don't need mate cigars. And, even better, the resulting output BAM contains no lies (by which I mean fictitious alignments).

tedsharpe · 2021-03-25T19:20:55Z

src/main/java/org/broadinstitute/hellbender/tools/LocalAssembler.java

+)
+@BetaFeature
+public class LocalAssembler extends PairWalker {
+    public static final byte QMIN = 25;


I did fiddle a bit. The quality score bins are now so sparse in order to improve compression and save space that we'd have to lower QMIN to 20 to make any difference, and that's just too low. We're walking a narrow path between building a furry graph that we fail to clean up adequately, and building a poorly connected graph that we fail to gap-fill adequately.
I have reasonable confidence that this is the right number.

tedsharpe · 2021-04-12T19:41:06Z

src/main/java/org/broadinstitute/hellbender/tools/LocalAssembler.java

+        if ( !read.isUnmapped() && !mate.isUnmapped() &&
+                read.isReverseStrand() != mate.isReverseStrand() ) {
+            // and both start within 1 base on the ref
+            if ( Math.abs(read.getStart() - read.getMateStart()) <= 1 ) {


Occasionally (like maybe 1/4 of the time) the first base of adapter will happen to agree with the reference. So the aligned region can be extended into what's actually adapter. Unlikely to be more than a base or two, though.

tedsharpe · 2021-04-16T14:03:03Z

src/main/java/org/broadinstitute/hellbender/tools/LocalAssembler.java

+        return traversalSet;
+    }
+
+    private static int findMaxOverlap( final List<Contig> prefixes, final List<Contig> suffixes ) {


I'm thinking about this method (findMaxOverlap): it may be providing some short-circuits that might explain my overcalling deletions. But I'll document the assumptions. It's actually pretty straightforward: the last Contig in the prefix list and the first Contig in the suffix list are always the same because we walked forward and backward from some given Contig as a starting point. If the first Contig in the suffix list is not part of a cycle, that means it can appear in the suffix list only once (to appear twice would mean it is part of a cycle). Ditto with the last Contig in the prefix list. So the maximum overlap is just that single contig at the end of the prefix list and at the start of the suffix list.

tedsharpe · 2021-04-16T14:05:35Z

src/main/java/org/broadinstitute/hellbender/tools/LocalAssembler.java

+                } else {
+                    final StringBuilder sb = new StringBuilder();
+                    char prefix = '\t';
+                    for ( final Contig predecessor : predecessors ) {


This code went away when I switched to writing GFA. I've switched to functional style in that new code.

tedsharpe · 2021-04-16T14:27:36Z

src/main/java/org/broadinstitute/hellbender/tools/LocalAssembler.java

+
+    /** fixed-size, immutable kmer.  usual 2-bit encoding: ACGT->0123.  low order bits are final call. **/
+    public static class Kmer {
+        public static final int KSIZE = 31; // must be odd number less than 32


KSIZE=31 is fine for well-behaved regions of the genome, and woefully inadequate for ill-behaved, low-complexity regions. I decided to see what could be done just with graph operations, rather than do the thing that assemblers usually do (use a small K to build a graph, error correct the reads, use a larger K to build a less tangled graph). Error correction has its own hazards, and it's computationally expensive. So I guess I'm reasonably satisfied: This is fast, but it sometimes produces graphs so tangled as to be not particularly informative. But to tell you the truth, most assemblers don't produce great results on the highly repetitive regions.
It would be easy enough to revise this one Kmer class to allow larger K. What's hard is to provide an option for a variety of sizes (as we did for our first attempt at an SV tool in Spark). But with uncorrected short reads you can't start with a K much higher than 31, anyway.

tedsharpe · 2021-04-16T14:45:00Z

src/main/java/org/broadinstitute/hellbender/tools/LocalAssembler.java

+        startingContigSet.remove(firstContig);
+    }
+
+    private static void removeTriviallyDifferentTraversals(


What you say is true (that we won't detect as trivially different Traversals that don't start and end at the same place). But it isn't anything to worry about: We're just trying to reduce the number of sequences that need to be aligned and searched for SVs. We don't have to be exhaustively thorough. I couldn't dream up a way to be more thorough that wasn't computationally much more complex. The other thing is that the usual pattern is: large contig, pair of bubbles, large contig, pair of bubbles, large contig, etc. So as a practical matter, I think this usually works pretty well.
Damn fine catch, though.

tedsharpe · 2021-04-23T16:37:49Z

src/main/java/org/broadinstitute/hellbender/tools/LocalAssembler.java

+                prevTraversal = curTraversal;
+            }
+        }
+        sortedTraversals.removeIf(Traversal::isRC);


We added both strands at the beginning, and I thought I had the sort order arranged so that, for any given pair that was trivially different, at least one of the +strand guys survived removal. Alas, that wasn't true. It's rewritten to avoid that assumption.

cwhelan

Your changes look good to me after playing with the tool a little bit. It's OK with me to go ahead and merge.

tedsharpe requested a review from cwhelan December 4, 2020 15:55

droazen reviewed Feb 11, 2021

View reviewed changes

cwhelan requested changes Mar 19, 2021

View reviewed changes

tedsharpe commented Apr 23, 2021

View reviewed changes

cwhelan approved these changes Jun 3, 2021

View reviewed changes

Local Assembler for SVs

0198191

tedsharpe force-pushed the tws_sv_local_assembler branch from 8725c3a to 0198191 Compare June 3, 2021 16:31

tedsharpe merged commit ef36d6c into master Jun 3, 2021

tedsharpe deleted the tws_sv_local_assembler branch June 3, 2021 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tws sv local assembler #6989

Tws sv local assembler #6989

tedsharpe commented Dec 4, 2020

gatk-bot commented Dec 4, 2020 •

edited

gatk-bot commented Jan 25, 2021 •

edited

droazen Feb 11, 2021

tedsharpe Feb 11, 2021

droazen Feb 11, 2021

tedsharpe Mar 25, 2021

cwhelan left a comment

cwhelan Jan 22, 2021

cwhelan Jan 22, 2021

cwhelan Feb 5, 2021

cwhelan Feb 12, 2021

cwhelan Feb 12, 2021

cwhelan Feb 12, 2021

tedsharpe Mar 25, 2021

cwhelan Mar 19, 2021

cwhelan Mar 19, 2021

tedsharpe Apr 15, 2021

cwhelan Mar 19, 2021

cwhelan Mar 19, 2021

tedsharpe Apr 13, 2021

cwhelan Mar 19, 2021

tedsharpe left a comment

tedsharpe Mar 25, 2021

tedsharpe Mar 25, 2021

tedsharpe Mar 25, 2021

tedsharpe Mar 25, 2021

tedsharpe Apr 12, 2021

tedsharpe Apr 16, 2021

tedsharpe Apr 16, 2021

tedsharpe Apr 16, 2021

tedsharpe Apr 16, 2021

tedsharpe Apr 23, 2021

cwhelan left a comment

Tws sv local assembler #6989

Tws sv local assembler #6989

Conversation

tedsharpe commented Dec 4, 2020

gatk-bot commented Dec 4, 2020 • edited

gatk-bot commented Jan 25, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwhelan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tedsharpe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwhelan left a comment

Choose a reason for hiding this comment

gatk-bot commented Dec 4, 2020 •

edited

gatk-bot commented Jan 25, 2021 •

edited