-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scaffold local assemblies #4589
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tedsharpe Done with my review with mostly minor comments. As I've never done an assembly myself, I'll let others judge if the details are correct.
@@ -109,6 +109,15 @@ | |||
@Argument(doc = "Write GFA representation of assemblies in fastq-dir.", fullName = "write-gfas") | |||
public boolean writeGFAs = false; | |||
|
|||
@Argument(doc = "Aggressively simplify local assemblies, ignoring small variants.", fullName = "pop-variant-bubbles") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these parameters, do you intend to mark them as @Advanced
, as most users probably don't want to mess with them?
import java.util.Comparator; | ||
import java.util.List; | ||
import java.io.*; | ||
import java.util.*; | ||
|
||
/** This LocalAssemblyHandler aligns assembly contigs with BWA, along with some optional writing of intermediate results. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to update this doc?
this.alignerIndexFile = alignerIndexFile; | ||
this.maxFastqSize = maxFastqSize; | ||
this.fastqDir = fastqDir; | ||
this.writeGFAs = writeGFAs; | ||
this.popVariantBubbles = popVariantBubbles; | ||
this.removeShadowedContigs = removeShadowedContigs; | ||
this.expandAssemblyGraph = expandAssemblyGraph; | ||
} | ||
|
||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add simple doc for this function?
final int MAG_F_NO_SIMPL = 0x80; // skip bubble simplification (default) | ||
// add aggressive-popping flag, and remove no-simplification flag | ||
assembler.setCleaningFlag(MAG_F_AGGRESSIVE | MAG_F_POPOPEN); | ||
} | ||
final long timeStart = System.currentTimeMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I'm suggesting you do it in this PR, but since there's the idea of doing assembly diagnostics in the future, it might be worth thinking about extracting a "AssemblyDiagnosis" object, or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what the AlignedAssemblyOrExcuse class is for.
final FermiLiteAssembler assembler = new FermiLiteAssembler(); | ||
if ( popVariantBubbles ) { | ||
final int MAG_F_AGGRESSIVE = 0x20; // pop variant bubbles (not default) | ||
final int MAG_F_POPOPEN = 0x40; // aggresive tip trimming (default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"aggressive"
} | ||
|
||
// join the sequences of a chain of contigs to produce a single, new contig | ||
private static Contig joinContigs( final Contig firstContig, final List<Connection> path ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that this is assuming the input path
is not empty. makes sense to check and return early.
} | ||
|
||
// combine two contigs into one, preserving all their connections, except their connections to each other | ||
private static Contig joinContigsWithConnections( final Contig firstContig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend adding unit test for this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will, if you insist, but I think it's covered by the unit test of removeUnbranchedConnections.
contig.setConnections(newConnections); | ||
} | ||
|
||
// join the sequences of a chain of contigs to produce a single, new contig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend adding unit test for this method as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you insist, but I think the unit tests for removeUnbranchedConnections and expandAssemblyGraph exercise this method adequately.
} | ||
|
||
// contig + strand info. | ||
private static final class ContigStrand { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to name this StrandedContig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yuck. :-)
I'd never leave a contig stranded. The poor thing.
contigList.add(tig); | ||
examined.add(tig); | ||
} else { | ||
final int nPredecessors = countPredecessors(tig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding correct, that after the call to removeUnbranchedConnections()
, there could only be
- isolated islands,
- cycles, and
- parallel paths (i.e. bubble, if not an abuse of terms);
and this block is to extend the parallel paths as much as possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The graph structure will be arbitrarily complex with isolated islands, and multiple components consisting of cycles and branching structures potentially a lot more complex than implied by "bubble". But, yeah. I think you've got the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Very nice code. Mostly a few minor comments and a question about the mismatch parameter you use for determining if contigs are "shadowed".
@VisibleForTesting | ||
static FermiLiteAssembly removeShadowedContigs( final FermiLiteAssembly assembly ) { | ||
final int kmerSize = 31; | ||
final double maxMismatchRate = .05; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5% divergence seems like quite a lot. If we choose the wrong contig out of a pair of contigs that differs by 5%, it might have a deleterious effect on contig alignment (or later genotyping). One thing that worries me is this scenario:
The evidence interval lies in a segmental duplication. Our kmer search brings in reads that actually belong to the other end of the seg dup pair. The assembler builds two contigs that actually represent the diverged sequences from either paralog. We then choose one of the two assembled contigs based on this code and throw the other away. What if we choose the wrong one (ie the one that represents the paralog of the sequence in the original interval)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on reflection 5% does seem pretty generous. I'll test with 1%.
I need some time to ponder the deeper question of what bubble popping could do to seg dups. (My first thought is that we're never going to resolve large, nearly identical seg dups anyway, so let's not sweat it. But my second thought is that maybe I should think about it a little more.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't thinking that we'd be able to assemble them perfectly, but was worried that doing this might introduce misassemblies in some cases. Just speculative, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with a 1% max mismatch rate, and the results don't differ materially, and are arguably a tiny bit better. So I'll make that change. Let's have a brief chalk talk about seg dups when we're all in the office.
final boolean isRC = canonical != contigLocation.isCanonical(); | ||
final int tig2Offset = | ||
isRC ? tig2Bases.length - contigLocation.getOffset() - kmerSize : contigLocation.getOffset(); | ||
if ( tigOffset > tig2Offset || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this line after noodling it over a little bit -- you only want to remove contigs that would be wholly contained inside another if the two were aligned -- but an explanatory comment might be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// if the number of bases upstream of the matching kmer is greater for tig than for tig2, then
// tig2 doesn't completely cover tig and so can't shadow it
if ( tigOffset > tig2Offset ) continue;
// similarly for the bases downstream of the matching kmer. if tig has more of them than tig2,
// then tig isn't shadowed by tig2.
if ( tigBases.length - tigOffset > tig2Bases.length - tig2Offset ) continue;
if ( !isRC ) { | ||
for ( int idx = 0; idx != tigBases.length; ++idx ) { | ||
if ( tigBases[idx] != tig2Bases[tig2Start+idx] ) { | ||
if ( (nMismatches += 1) > maxMismatches ) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My Java style pedant side doesn't really like assignments in if test clauses, could you split this into two lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
But my C++ side doesn't like your Java pedantry. :-)
final int tig2RCOffset = tig2Bases.length - tig2Start - 1; | ||
for ( int idx = 0; idx != tigBases.length; ++idx ) { | ||
if ( tigBases[idx] != BaseUtils.simpleComplement(tig2Bases[tig2RCOffset-idx]) ) { | ||
if ( (nMismatches += 1) > maxMismatches ) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split into two lines?
.mapToInt(conn -> conn.getTarget().getSequence().length - conn.getOverlapLen()) | ||
.reduce(firstContig.getSequence().length, Integer::sum); | ||
final byte[] sequence = new byte[newContigLen]; | ||
int dstIndex = firstContig.getSequence().length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this short for destinationIndex
? Can you spell it out please?
} | ||
dstIndex += len; | ||
} | ||
return new Contig(sequence, null, nSupportingReads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to document somewhere that the coverage info is lost when the contigs are joined.
return joinedContig; | ||
} | ||
|
||
private static Connection rcConnection( final Contig contig, final Connection connection ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be nicer as a method on Connection
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but that code lives in another project. I'll move the methods getSolePredecessor, getSoleSuccessor, getSingletonConnection, and rcConnection to gatk-fermilite-jni, and remove them from here once that code gets updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, I missed that. It's not a huge deal so don't worry about it if it's a hassle to rev gatk-fermilite-jni.
final boolean needsPhasing = nPredecessors > 1 && nSuccessors > 1; | ||
if ( needsPhasing ) { | ||
// the first time we find a contig that needs phasing info to avoid creating false joins: | ||
// we end the current path at that contig, but initiate a new path from that contig. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the original graph has A -> B -> C
and D -> B -> E
, we will end up with contigs AB
, DB
, BC
, and BE
? If true maybe add that as an illustrative example to the comment, and maybe reword this line as "we end the current path at that contig, and treat this contig as a new source in the graph" or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more explanation, and copied the example from the test case here as well.
|
||
@Test(groups = "sv") | ||
void testNoCrossingUnphasedContigs() { | ||
// test assembly has the structure A->C, B->C, C->D, C->E. expanded contigs should be AC, BC, CD, and CE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you answered my question from above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I made one minor suggestion.
Since determinism has been brought up at our meeting I gave it some thought here. When I hear "deterministic" I think that a given input should produce the exact same output. But what about if two "equivalent" inputs (e.g. different order or direction of contigs) generate outputs that are not identical? We can allow this in the pipeline as long as every stage has this property. My concern is that if this does not hold anywhere along the way, we may lose determinism.
If the output from here is directly handed off to the aligner I suppose there is no reason to believe that order will matter in this case. However, it may be helpful for our sanity to enforce that "equivalent" inputs should generate identical outputs at each pipeline stage, that way it will be easier to track down determinism issues.
I'll leave it up to you whether you think it's too painful/expensive to add this or if you're confident it won't affect anything downstream.
} | ||
|
||
@VisibleForTesting | ||
static FermiLiteAssembly removeShadowedContigs( final FermiLiteAssembly assembly ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a brief comment defining "shadowed" and explaining what the method does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we see any evidence of stochastic output from run to run, I'll provide a canonicalization of the assembly. So far, I don't think we've seen any evidence of this.
Checked in my changes that respond to your helpful comments. Thanks, reviewers. Have another look if you'd like. |
Codecov Report
@@ Coverage Diff @@
## master #4589 +/- ##
===============================================
- Coverage 79.857% 79.854% -0.003%
+ Complexity 17054 17040 -14
===============================================
Files 1067 1062 -5
Lines 62031 61948 -83
Branches 10039 10052 +13
===============================================
- Hits 49536 49468 -68
+ Misses 8582 8560 -22
- Partials 3913 3920 +7
|
6663093
to
31fb5f8
Compare
Progress report: |
* improve local assembly contiguity * remove test programs, clean up some unused code, add unit tests
No description provided.