Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scaffold local assemblies #4589

Merged
merged 2 commits into from
Apr 2, 2018
Merged

scaffold local assemblies #4589

merged 2 commits into from
Apr 2, 2018

Conversation

tedsharpe
Copy link
Contributor

No description provided.

@SHuang-Broad SHuang-Broad self-requested a review March 27, 2018 23:56
Copy link
Contributor

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedsharpe Done with my review with mostly minor comments. As I've never done an assembly myself, I'll let others judge if the details are correct.

@@ -109,6 +109,15 @@
@Argument(doc = "Write GFA representation of assemblies in fastq-dir.", fullName = "write-gfas")
public boolean writeGFAs = false;

@Argument(doc = "Aggressively simplify local assemblies, ignoring small variants.", fullName = "pop-variant-bubbles")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these parameters, do you intend to mark them as @Advanced, as most users probably don't want to mess with them?

import java.util.Comparator;
import java.util.List;
import java.io.*;
import java.util.*;

/** This LocalAssemblyHandler aligns assembly contigs with BWA, along with some optional writing of intermediate results. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to update this doc?

this.alignerIndexFile = alignerIndexFile;
this.maxFastqSize = maxFastqSize;
this.fastqDir = fastqDir;
this.writeGFAs = writeGFAs;
this.popVariantBubbles = popVariantBubbles;
this.removeShadowedContigs = removeShadowedContigs;
this.expandAssemblyGraph = expandAssemblyGraph;
}

@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add simple doc for this function?

final int MAG_F_NO_SIMPL = 0x80; // skip bubble simplification (default)
// add aggressive-popping flag, and remove no-simplification flag
assembler.setCleaningFlag(MAG_F_AGGRESSIVE | MAG_F_POPOPEN);
}
final long timeStart = System.currentTimeMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I'm suggesting you do it in this PR, but since there's the idea of doing assembly diagnostics in the future, it might be worth thinking about extracting a "AssemblyDiagnosis" object, or something similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what the AlignedAssemblyOrExcuse class is for.

final FermiLiteAssembler assembler = new FermiLiteAssembler();
if ( popVariantBubbles ) {
final int MAG_F_AGGRESSIVE = 0x20; // pop variant bubbles (not default)
final int MAG_F_POPOPEN = 0x40; // aggresive tip trimming (default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"aggressive"

}

// join the sequences of a chain of contigs to produce a single, new contig
private static Contig joinContigs( final Contig firstContig, final List<Connection> path ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that this is assuming the input path is not empty. makes sense to check and return early.

}

// combine two contigs into one, preserving all their connections, except their connections to each other
private static Contig joinContigsWithConnections( final Contig firstContig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend adding unit test for this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will, if you insist, but I think it's covered by the unit test of removeUnbranchedConnections.

contig.setConnections(newConnections);
}

// join the sequences of a chain of contigs to produce a single, new contig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend adding unit test for this method as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you insist, but I think the unit tests for removeUnbranchedConnections and expandAssemblyGraph exercise this method adequately.

}

// contig + strand info.
private static final class ContigStrand {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to name this StrandedContig?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yuck. :-)
I'd never leave a contig stranded. The poor thing.

contigList.add(tig);
examined.add(tig);
} else {
final int nPredecessors = countPredecessors(tig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct, that after the call to removeUnbranchedConnections(), there could only be

  • isolated islands,
  • cycles, and
  • parallel paths (i.e. bubble, if not an abuse of terms);

and this block is to extend the parallel paths as much as possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The graph structure will be arbitrarily complex with isolated islands, and multiple components consisting of cycles and branching structures potentially a lot more complex than implied by "bubble". But, yeah. I think you've got the idea.

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Very nice code. Mostly a few minor comments and a question about the mismatch parameter you use for determining if contigs are "shadowed".

@VisibleForTesting
static FermiLiteAssembly removeShadowedContigs( final FermiLiteAssembly assembly ) {
final int kmerSize = 31;
final double maxMismatchRate = .05;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5% divergence seems like quite a lot. If we choose the wrong contig out of a pair of contigs that differs by 5%, it might have a deleterious effect on contig alignment (or later genotyping). One thing that worries me is this scenario:

The evidence interval lies in a segmental duplication. Our kmer search brings in reads that actually belong to the other end of the seg dup pair. The assembler builds two contigs that actually represent the diverged sequences from either paralog. We then choose one of the two assembled contigs based on this code and throw the other away. What if we choose the wrong one (ie the one that represents the paralog of the sequence in the original interval)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on reflection 5% does seem pretty generous. I'll test with 1%.
I need some time to ponder the deeper question of what bubble popping could do to seg dups. (My first thought is that we're never going to resolve large, nearly identical seg dups anyway, so let's not sweat it. But my second thought is that maybe I should think about it a little more.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't thinking that we'd be able to assemble them perfectly, but was worried that doing this might introduce misassemblies in some cases. Just speculative, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with a 1% max mismatch rate, and the results don't differ materially, and are arguably a tiny bit better. So I'll make that change. Let's have a brief chalk talk about seg dups when we're all in the office.

final boolean isRC = canonical != contigLocation.isCanonical();
final int tig2Offset =
isRC ? tig2Bases.length - contigLocation.getOffset() - kmerSize : contigLocation.getOffset();
if ( tigOffset > tig2Offset ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this line after noodling it over a little bit -- you only want to remove contigs that would be wholly contained inside another if the two were aligned -- but an explanatory comment might be helpful.

Copy link
Contributor Author

@tedsharpe tedsharpe Mar 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        // if the number of bases upstream of the matching kmer is greater for tig than for tig2, then
        // tig2 doesn't completely cover tig and so can't shadow it
        if ( tigOffset > tig2Offset ) continue;
        // similarly for the bases downstream of the matching kmer.  if tig has more of them than tig2,
        // then tig isn't shadowed by tig2.
        if ( tigBases.length - tigOffset > tig2Bases.length - tig2Offset ) continue;

if ( !isRC ) {
for ( int idx = 0; idx != tigBases.length; ++idx ) {
if ( tigBases[idx] != tig2Bases[tig2Start+idx] ) {
if ( (nMismatches += 1) > maxMismatches ) break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My Java style pedant side doesn't really like assignments in if test clauses, could you split this into two lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
But my C++ side doesn't like your Java pedantry. :-)

final int tig2RCOffset = tig2Bases.length - tig2Start - 1;
for ( int idx = 0; idx != tigBases.length; ++idx ) {
if ( tigBases[idx] != BaseUtils.simpleComplement(tig2Bases[tig2RCOffset-idx]) ) {
if ( (nMismatches += 1) > maxMismatches ) break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split into two lines?

.mapToInt(conn -> conn.getTarget().getSequence().length - conn.getOverlapLen())
.reduce(firstContig.getSequence().length, Integer::sum);
final byte[] sequence = new byte[newContigLen];
int dstIndex = firstContig.getSequence().length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this short for destinationIndex? Can you spell it out please?

}
dstIndex += len;
}
return new Contig(sequence, null, nSupportingReads);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to document somewhere that the coverage info is lost when the contigs are joined.

return joinedContig;
}

private static Connection rcConnection( final Contig contig, final Connection connection ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be nicer as a method on Connection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but that code lives in another project. I'll move the methods getSolePredecessor, getSoleSuccessor, getSingletonConnection, and rcConnection to gatk-fermilite-jni, and remove them from here once that code gets updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I missed that. It's not a huge deal so don't worry about it if it's a hassle to rev gatk-fermilite-jni.

final boolean needsPhasing = nPredecessors > 1 && nSuccessors > 1;
if ( needsPhasing ) {
// the first time we find a contig that needs phasing info to avoid creating false joins:
// we end the current path at that contig, but initiate a new path from that contig.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the original graph has A -> B -> C and D -> B -> E, we will end up with contigs AB, DB, BC, and BE? If true maybe add that as an illustrative example to the comment, and maybe reword this line as "we end the current path at that contig, and treat this contig as a new source in the graph" or something similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more explanation, and copied the example from the test case here as well.


@Test(groups = "sv")
void testNoCrossingUnphasedContigs() {
// test assembly has the structure A->C, B->C, C->D, C->E. expanded contigs should be AC, BC, CD, and CE.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you answered my question from above.

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I made one minor suggestion.

Since determinism has been brought up at our meeting I gave it some thought here. When I hear "deterministic" I think that a given input should produce the exact same output. But what about if two "equivalent" inputs (e.g. different order or direction of contigs) generate outputs that are not identical? We can allow this in the pipeline as long as every stage has this property. My concern is that if this does not hold anywhere along the way, we may lose determinism.

If the output from here is directly handed off to the aligner I suppose there is no reason to believe that order will matter in this case. However, it may be helpful for our sanity to enforce that "equivalent" inputs should generate identical outputs at each pipeline stage, that way it will be easier to track down determinism issues.

I'll leave it up to you whether you think it's too painful/expensive to add this or if you're confident it won't affect anything downstream.

}

@VisibleForTesting
static FermiLiteAssembly removeShadowedContigs( final FermiLiteAssembly assembly ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a brief comment defining "shadowed" and explaining what the method does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we see any evidence of stochastic output from run to run, I'll provide a canonicalization of the assembly. So far, I don't think we've seen any evidence of this.

@tedsharpe
Copy link
Contributor Author

Checked in my changes that respond to your helpful comments. Thanks, reviewers. Have another look if you'd like.

@codecov-io
Copy link

codecov-io commented Mar 30, 2018

Codecov Report

Merging #4589 into master will decrease coverage by 0.003%.
The diff coverage is 87.413%.

@@               Coverage Diff               @@
##              master     #4589       +/-   ##
===============================================
- Coverage     79.857%   79.854%   -0.003%     
+ Complexity     17054     17040       -14     
===============================================
  Files           1067      1062        -5     
  Lines          62031     61948       -83     
  Branches       10039     10052       +13     
===============================================
- Hits           49536     49468       -68     
+ Misses          8582      8560       -22     
- Partials        3913      3920        +7
Impacted Files Coverage Δ Complexity Δ
.../sv/StructuralVariationDiscoveryPipelineSpark.java 88.806% <ø> (+1.306%) 10 <0> (-2) ⬇️
...llbender/tools/spark/sv/evidence/SVReadFilter.java 70.588% <ø> (ø) 26 <0> (ø) ⬇️
...tructuralVariationDiscoveryArgumentCollection.java 97.222% <100%> (+0.253%) 0 <0> (ø) ⬇️
...ols/spark/sv/evidence/AlignedAssemblyOrExcuse.java 83.06% <57.143%> (-10.795%) 36 <0> (-2)
...spark/sv/evidence/FindBreakpointEvidenceSpark.java 69.583% <66.667%> (-0.145%) 60 <0> (ø)
...ls/spark/sv/evidence/FermiLiteAssemblyHandler.java 85.953% <88.278%> (+19.287%) 87 <85> (+84) ⬆️
...der/tools/spark/sv/discovery/SvDiscoveryUtils.java 8.333% <0%> (-20.897%) 2% <0%> (-4%)
...lignment/AssemblyContigAlignmentsConfigPicker.java 79.621% <0%> (-12.903%) 57% <0%> (-15%)
...lbender/utils/io/HardThresholdingOutputStream.java 70% <0%> (-10%) 3% <0%> (-1%)
...g/broadinstitute/hellbender/utils/io/Resource.java 42.857% <0%> (-9.524%) 4% <0%> (-1%)
... and 42 more

@tedsharpe tedsharpe merged commit 0a89ef9 into master Apr 2, 2018
@tedsharpe tedsharpe deleted the tws_scaffolds branch April 2, 2018 17:22
@tedsharpe
Copy link
Contributor Author

tedsharpe commented Apr 26, 2018

Progress report:
FP 14% --> 16%
FN 81% --> 76%
Variants 7085 --> 8867

cwhelan pushed a commit to cwhelan/gatk-linked-reads that referenced this pull request May 25, 2018
* improve local assembly contiguity

* remove test programs, clean up some unused code, add unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants