scaffold local assemblies #4589

tedsharpe · 2018-03-26T22:05:56Z

No description provided.

SHuang-Broad

@tedsharpe Done with my review with mostly minor comments. As I've never done an assembly myself, I'll let others judge if the details are correct.

SHuang-Broad · 2018-03-27T15:14:33Z

...broadinstitute/hellbender/tools/spark/sv/StructuralVariationDiscoveryArgumentCollection.java

@@ -109,6 +109,15 @@
        @Argument(doc = "Write GFA representation of assemblies in fastq-dir.", fullName = "write-gfas")
        public boolean writeGFAs = false;

+        @Argument(doc = "Aggressively simplify local assemblies, ignoring small variants.", fullName = "pop-variant-bubbles")


For these parameters, do you intend to mark them as @Advanced, as most users probably don't want to mess with them?

SHuang-Broad · 2018-03-27T15:50:17Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

-import java.util.Comparator;
-import java.util.List;
+import java.io.*;
+import java.util.*;

 /** This LocalAssemblyHandler aligns assembly contigs with BWA, along with some optional writing of intermediate results. */


Do you want to update this doc?

SHuang-Broad · 2018-03-28T15:18:37Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

        this.alignerIndexFile = alignerIndexFile;
        this.maxFastqSize = maxFastqSize;
        this.fastqDir = fastqDir;
        this.writeGFAs = writeGFAs;
+        this.popVariantBubbles = popVariantBubbles;
+        this.removeShadowedContigs = removeShadowedContigs;
+        this.expandAssemblyGraph = expandAssemblyGraph;
    }

    @Override


add simple doc for this function?

SHuang-Broad · 2018-03-28T15:26:13Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+            final int MAG_F_NO_SIMPL = 0x80; // skip bubble simplification (default)
+            // add aggressive-popping flag, and remove no-simplification flag
+            assembler.setCleaningFlag(MAG_F_AGGRESSIVE | MAG_F_POPOPEN);
+        }
        final long timeStart = System.currentTimeMillis();


Not that I'm suggesting you do it in this PR, but since there's the idea of doing assembly diagnostics in the future, it might be worth thinking about extracting a "AssemblyDiagnosis" object, or something similar.

That's what the AlignedAssemblyOrExcuse class is for.

SHuang-Broad · 2018-03-28T15:27:29Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+        final FermiLiteAssembler assembler = new FermiLiteAssembler();
+        if ( popVariantBubbles ) {
+            final int MAG_F_AGGRESSIVE = 0x20; // pop variant bubbles (not default)
+            final int MAG_F_POPOPEN = 0x40; // aggresive tip trimming (default)


"aggressive"

SHuang-Broad · 2018-03-29T16:18:45Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+    }
+
+    // join the sequences of a chain of contigs to produce a single, new contig
+    private static Contig joinContigs( final Contig firstContig, final List<Connection> path ) {


it seems that this is assuming the input path is not empty. makes sense to check and return early.

SHuang-Broad · 2018-03-29T16:24:02Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+    }
+
+    // combine two contigs into one, preserving all their connections, except their connections to each other
+    private static Contig joinContigsWithConnections( final Contig firstContig,


I would recommend adding unit test for this method.

I will, if you insist, but I think it's covered by the unit test of removeUnbranchedConnections.

SHuang-Broad · 2018-03-29T16:24:17Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+        contig.setConnections(newConnections);
+    }
+
+    // join the sequences of a chain of contigs to produce a single, new contig


I would recommend adding unit test for this method as well.

If you insist, but I think the unit tests for removeUnbranchedConnections and expandAssemblyGraph exercise this method adequately.

SHuang-Broad · 2018-03-29T18:46:31Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+    }
+
+    // contig + strand info.
+    private static final class ContigStrand {


Would it make sense to name this StrandedContig?

Yuck. :-)
I'd never leave a contig stranded. The poor thing.

SHuang-Broad · 2018-03-29T19:10:35Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+                contigList.add(tig);
+                examined.add(tig);
+            } else {
+                final int nPredecessors = countPredecessors(tig);


Is my understanding correct, that after the call to removeUnbranchedConnections(), there could only be

isolated islands,

cycles, and

parallel paths (i.e. bubble, if not an abuse of terms);

and this block is to extend the parallel paths as much as possible?

The graph structure will be arbitrarily complex with isolated islands, and multiple components consisting of cycles and branching structures potentially a lot more complex than implied by "bubble". But, yeah. I think you've got the idea.

cwhelan

This looks good to me. Very nice code. Mostly a few minor comments and a question about the mismatch parameter you use for determining if contigs are "shadowed".

cwhelan · 2018-03-29T15:31:08Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+    @VisibleForTesting
+    static FermiLiteAssembly removeShadowedContigs( final FermiLiteAssembly assembly ) {
+        final int kmerSize = 31;
+        final double maxMismatchRate = .05;


5% divergence seems like quite a lot. If we choose the wrong contig out of a pair of contigs that differs by 5%, it might have a deleterious effect on contig alignment (or later genotyping). One thing that worries me is this scenario:

The evidence interval lies in a segmental duplication. Our kmer search brings in reads that actually belong to the other end of the seg dup pair. The assembler builds two contigs that actually represent the diverged sequences from either paralog. We then choose one of the two assembled contigs based on this code and throw the other away. What if we choose the wrong one (ie the one that represents the paralog of the sequence in the original interval)?

Yes, on reflection 5% does seem pretty generous. I'll test with 1%.
I need some time to ponder the deeper question of what bubble popping could do to seg dups. (My first thought is that we're never going to resolve large, nearly identical seg dups anyway, so let's not sweat it. But my second thought is that maybe I should think about it a little more.)

Yeah, I wasn't thinking that we'd be able to assemble them perfectly, but was worried that doing this might introduce misassemblies in some cases. Just speculative, though.

I tested with a 1% max mismatch rate, and the results don't differ materially, and are arguably a tiny bit better. So I'll make that change. Let's have a brief chalk talk about seg dups when we're all in the office.

cwhelan · 2018-03-29T15:40:04Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+                    final boolean isRC = canonical != contigLocation.isCanonical();
+                    final int tig2Offset =
+                            isRC ? tig2Bases.length - contigLocation.getOffset() - kmerSize : contigLocation.getOffset();
+                    if ( tigOffset > tig2Offset ||


I understand this line after noodling it over a little bit -- you only want to remove contigs that would be wholly contained inside another if the two were aligned -- but an explanatory comment might be helpful.

// if the number of bases upstream of the matching kmer is greater for tig than for tig2, then // tig2 doesn't completely cover tig and so can't shadow it if ( tigOffset > tig2Offset ) continue; // similarly for the bases downstream of the matching kmer. if tig has more of them than tig2, // then tig isn't shadowed by tig2. if ( tigBases.length - tigOffset > tig2Bases.length - tig2Offset ) continue;

cwhelan · 2018-03-29T15:43:05Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+                        if ( !isRC ) {
+                            for ( int idx = 0; idx != tigBases.length; ++idx ) {
+                                if ( tigBases[idx] != tig2Bases[tig2Start+idx] ) {
+                                    if ( (nMismatches += 1) > maxMismatches ) break;


My Java style pedant side doesn't really like assignments in if test clauses, could you split this into two lines?

Done.
But my C++ side doesn't like your Java pedantry. :-)

cwhelan · 2018-03-29T15:43:27Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+                            final int tig2RCOffset = tig2Bases.length - tig2Start - 1;
+                            for ( int idx = 0; idx != tigBases.length; ++idx ) {
+                                if ( tigBases[idx] != BaseUtils.simpleComplement(tig2Bases[tig2RCOffset-idx]) ) {
+                                    if ( (nMismatches += 1) > maxMismatches ) break;


split into two lines?

cwhelan · 2018-03-29T18:30:34Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+                        .mapToInt(conn -> conn.getTarget().getSequence().length - conn.getOverlapLen())
+                        .reduce(firstContig.getSequence().length, Integer::sum);
+        final byte[] sequence = new byte[newContigLen];
+        int dstIndex = firstContig.getSequence().length;


Is this short for destinationIndex? Can you spell it out please?

cwhelan · 2018-03-29T18:41:44Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+            }
+            dstIndex += len;
+        }
+        return new Contig(sequence, null, nSupportingReads);


It might be good to document somewhere that the coverage info is lost when the contigs are joined.

cwhelan · 2018-03-29T18:46:25Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+        return joinedContig;
+    }
+
+    private static Connection rcConnection( final Contig contig, final Connection connection ) {


This might be nicer as a method on Connection?

I agree, but that code lives in another project. I'll move the methods getSolePredecessor, getSoleSuccessor, getSingletonConnection, and rcConnection to gatk-fermilite-jni, and remove them from here once that code gets updated.

Ah, sorry, I missed that. It's not a huge deal so don't worry about it if it's a hassle to rev gatk-fermilite-jni.

cwhelan · 2018-03-29T20:45:07Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+            final boolean needsPhasing = nPredecessors > 1 && nSuccessors > 1;
+            if ( needsPhasing ) {
+                // the first time we find a contig that needs phasing info to avoid creating false joins:
+                // we end the current path at that contig, but initiate a new path from that contig.


So if the original graph has A -> B -> C and D -> B -> E, we will end up with contigs AB, DB, BC, and BE? If true maybe add that as an illustrative example to the comment, and maybe reword this line as "we end the current path at that contig, and treat this contig as a new source in the graph" or something similar?

I added more explanation, and copied the example from the test case here as well.

cwhelan · 2018-03-29T20:49:47Z

src/test/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/ReviseAssemblyUnitTest.java

+
+    @Test(groups = "sv")
+    void testNoCrossingUnphasedContigs() {
+        // test assembly has the structure A->C, B->C, C->D, C->E.  expanded contigs should be AC, BC, CD, and CE.


Ah you answered my question from above.

mwalker174

Looks good to me. I made one minor suggestion.

Since determinism has been brought up at our meeting I gave it some thought here. When I hear "deterministic" I think that a given input should produce the exact same output. But what about if two "equivalent" inputs (e.g. different order or direction of contigs) generate outputs that are not identical? We can allow this in the pipeline as long as every stage has this property. My concern is that if this does not hold anywhere along the way, we may lose determinism.

If the output from here is directly handed off to the aligner I suppose there is no reason to believe that order will matter in this case. However, it may be helpful for our sanity to enforce that "equivalent" inputs should generate identical outputs at each pipeline stage, that way it will be easier to track down determinism issues.

I'll leave it up to you whether you think it's too painful/expensive to add this or if you're confident it won't affect anything downstream.

mwalker174 · 2018-03-30T15:40:57Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FermiLiteAssemblyHandler.java

+    }
+
+    @VisibleForTesting
+    static FermiLiteAssembly removeShadowedContigs( final FermiLiteAssembly assembly ) {


Can you add a brief comment defining "shadowed" and explaining what the method does.

If we see any evidence of stochastic output from run to run, I'll provide a canonicalization of the assembly. So far, I don't think we've seen any evidence of this.

tedsharpe · 2018-03-30T17:42:18Z

Checked in my changes that respond to your helpful comments. Thanks, reviewers. Have another look if you'd like.

codecov-io · 2018-03-30T19:46:46Z

Codecov Report

Merging #4589 into master will decrease coverage by 0.003%.
The diff coverage is 87.413%.

@@               Coverage Diff               @@
##              master     #4589       +/-   ##
===============================================
- Coverage     79.857%   79.854%   -0.003%     
+ Complexity     17054     17040       -14     
===============================================
  Files           1067      1062        -5     
  Lines          62031     61948       -83     
  Branches       10039     10052       +13     
===============================================
- Hits           49536     49468       -68     
+ Misses          8582      8560       -22     
- Partials        3913      3920        +7

Impacted Files	Coverage Δ	Complexity Δ
.../sv/StructuralVariationDiscoveryPipelineSpark.java	`88.806% <ø> (+1.306%)`	`10 <0> (-2)`	⬇️
...llbender/tools/spark/sv/evidence/SVReadFilter.java	`70.588% <ø> (ø)`	`26 <0> (ø)`	⬇️
...tructuralVariationDiscoveryArgumentCollection.java	`97.222% <100%> (+0.253%)`	`0 <0> (ø)`	⬇️
...ols/spark/sv/evidence/AlignedAssemblyOrExcuse.java	`83.06% <57.143%> (-10.795%)`	`36 <0> (-2)`
...spark/sv/evidence/FindBreakpointEvidenceSpark.java	`69.583% <66.667%> (-0.145%)`	`60 <0> (ø)`
...ls/spark/sv/evidence/FermiLiteAssemblyHandler.java	`85.953% <88.278%> (+19.287%)`	`87 <85> (+84)`	⬆️
...der/tools/spark/sv/discovery/SvDiscoveryUtils.java	`8.333% <0%> (-20.897%)`	`2% <0%> (-4%)`
...lignment/AssemblyContigAlignmentsConfigPicker.java	`79.621% <0%> (-12.903%)`	`57% <0%> (-15%)`
...lbender/utils/io/HardThresholdingOutputStream.java	`70% <0%> (-10%)`	`3% <0%> (-1%)`
...g/broadinstitute/hellbender/utils/io/Resource.java	`42.857% <0%> (-9.524%)`	`4% <0%> (-1%)`
... and 42 more

tedsharpe · 2018-04-26T18:59:34Z

Progress report:
FP 14% --> 16%
FN 81% --> 76%
Variants 7085 --> 8867

* improve local assembly contiguity * remove test programs, clean up some unused code, add unit tests

SHuang-Broad added the SV label Mar 26, 2018

SHuang-Broad self-requested a review March 27, 2018 23:56

SHuang-Broad reviewed Mar 29, 2018

View reviewed changes

cwhelan approved these changes Mar 29, 2018

View reviewed changes

mwalker174 approved these changes Mar 30, 2018

View reviewed changes

tedsharpe added 2 commits March 30, 2018 15:49

improve local assembly contiguity

87f3179

remove test programs, clean up some unused code, add unit tests

31fb5f8

tedsharpe force-pushed the tws_scaffolds branch from 6663093 to 31fb5f8 Compare March 30, 2018 19:50

tedsharpe merged commit 0a89ef9 into master Apr 2, 2018

tedsharpe deleted the tws_scaffolds branch April 2, 2018 17:22

cwhelan pushed a commit to cwhelan/gatk-linked-reads that referenced this pull request May 25, 2018

scaffold local assemblies (broadinstitute#4589)

2bf7e79

* improve local assembly contiguity * remove test programs, clean up some unused code, add unit tests

scaffold local assemblies #4589

scaffold local assemblies #4589

Conversation

tedsharpe commented Mar 26, 2018

SHuang-Broad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwhelan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tedsharpe Mar 29, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwalker174 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tedsharpe commented Mar 30, 2018

codecov-io commented Mar 30, 2018 • edited

Codecov Report

tedsharpe commented Apr 26, 2018 • edited

tedsharpe Mar 29, 2018 •

edited

codecov-io commented Mar 30, 2018 •

edited

tedsharpe commented Apr 26, 2018 •

edited