Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tws sv local assembler #6989

Merged
merged 1 commit into from Jun 3, 2021
Merged

Tws sv local assembler #6989

merged 1 commit into from Jun 3, 2021

Conversation

tedsharpe
Copy link
Contributor

No description provided.

@gatk-bot
Copy link

gatk-bot commented Dec 4, 2020

Travis reported job failures from build 32329
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 32329.1 logs
cloud openjdk11 32329.14 logs
integration openjdk11 32329.12 logs
integration openjdk8 32329.2 logs
cloud openjdk8 32329.1 logs
cloud openjdk11 32329.14 logs
integration openjdk11 32329.12 logs
integration openjdk8 32329.2 logs

@gatk-bot
Copy link

gatk-bot commented Jan 25, 2021

Travis reported job failures from build 32618
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 32618.1 logs
cloud openjdk11 32618.14 logs
integration openjdk11 32618.12 logs
integration openjdk8 32618.2 logs

@@ -57,7 +57,7 @@ protected final void onStartup() {
*/
void setReadTraversalBounds() {
if ( hasUserSuppliedIntervals() ) {
reads.setTraversalBounds(intervalArgumentCollection.getTraversalParameters(getHeaderForReads().getSequenceDictionary()));
reads.setTraversalBounds(userIntervals);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedsharpe Not sure if this was a temporary workaround for some issue you encountered, but the previous implementation passing in intervalArgumentCollection.getTraversalParameters() is actually necessary to support -L unmapped. Hopefully this is covered by existing tests somewhere...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, bummer. I'll have to rework this, then. I'm a little vague on the details right now, but it had something to do with an additional tier of padding that I'm doing to make sure that I pull in the distant mates (via the tool whose design you suggested). I had to expand the intervals over which I do read retrieval, while still having access to the underlying singly-padded intervals over which I do assembly.
I asked Louis about why we pre-calculated userIntervals if we were just going to recalculate them here. He thought it seemed an obvious bug at the time. It's so hard to change engine code. :-(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't despair @tedsharpe, I think we can probably fix this pretty easily! You just need to construct a new TraversalParameters here, and make sure that the user's traverseUnmapped setting gets propagated. Something like this could do for now:

    void setReadTraversalBounds() {
        if ( hasUserSuppliedIntervals() ) {
            reads.setTraversalBounds(new TraversalParameters(userIntervals, intervalArgumentCollection.getTraversalParameters(getHeaderForReads().getSequenceDictionary()).traverseUnmappedReads());
        }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adopted your suggestion. Thanks very much for catching this.

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tedsharpe thanks for this massive amount of work. Looking forward to experimenting with this tool.

This is a lot to review, and I've tried my best to understand what's going on, but I can't claim to have given it a close enough reading to catch every possible bug.

A lot of my comments are either requests for clarification where I might not have understood something, or minor refactoring suggestions that you could take or leave as you see fit. If I'm totally misunderstanding something just let me know, too.


@DocumentedFeature
@CommandLineProgramProperties(
summary = "experiment",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fill these in with something slightly more descriptive (while keeping in mind that it is still an experiment)

import java.util.Map;

@CommandLineProgramProperties(
summary = "Prints reads that have distant mates using the mate's alignment information. Yes, this is weird, but it helps in processing pairs together (see PairWalker).",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify in this comment whether it's a requirement to run this before using PairWalker?

@@ -57,7 +57,7 @@ protected final void onStartup() {
*/
void setReadTraversalBounds() {
if ( hasUserSuppliedIntervals() ) {
reads.setTraversalBounds(intervalArgumentCollection.getTraversalParameters(getHeaderForReads().getSequenceDictionary()));
reads.setTraversalBounds(userIntervals);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you need to change this? It seems wrong to have to change ReadWalker. This will cause any read walkers to not do any of the interval merging logic in IntervalArgumentCollection which I assume must be desired sometimes. EDIT: sounds like you and @droazen might have found a way to work this out.


private SAMFileGATKReadWriter outputWriter;
private Map<String, GATKRead> pendingReads;
private static final String MATE_CIGAR_TAG = "MC";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are already defined via htsjdk at eg SAMTag.MC.name().

return copy;
}

private static String bogusCigar( final int alignLength, final int readLength ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a quick comment for this ("Provides a CIGAR that forces the alignment length to match the requested value given the actual read length. The CIGAR consists of M operations plus the necessary number of D or I operations to make alignLength equal to readLength.")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also leave a warning about not depending on the CIGAR of these altered reads after running this tool in the main documentation for the class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this silly jazz is gone.

final KmerAdjacency kmer = KmerAdjacencyImpl.find(kVal, kmerAdjacencySet);
// if we fail to look up the kmer
if ( kmer == null ) {
if ( currentPathPart == null ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first and third clauses of this if statement look identical. Maybe combine them by changing this to

currentPathPart == null || ! currentPathPart.isGap()

currentPathPart = new PathPartContig(contig, 0);
parts.add(currentPathPart);
} else {
// weird: kmer is non-contiguous. start a new path part after a zero-length gap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused about how this can happen. Should it be an error condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an error, and I've changed the comment to make it less alarming. Reads with sequencing errors that produce kmers that aren't in the graph can make the read paths appear to jump arbitrarily.

if ( ++count >= Kmer.KSIZE ) {
final KmerAdjacency kmer = KmerAdjacencyImpl.find(kVal, kmerAdjacencySet);
// if we fail to look up the kmer
if ( kmer == null ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested ifs get pretty hairy in this method. Any chance you could pull some of the clauses into little methods for readability. For example this could be:

if (kmer == null) { currentPathPart = addGapKmer(...) } else { ... }

I just find it hard to match up so many levels of indentation in this method.

}
} else {
final int kmerContigOffset = kmer.getContigOffset();
if ( currentPathPart.isGap() ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the transition out of a gap back to a contig, right? Otherwise we wouldn't have found our kmer in the set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Exactly. I'll sprinkle in some more comments.

Assert.assertTrue(regionChecker.isInInterval(createRead(2, 1501)));
}

private final static class PairWalkerTester extends PairWalker {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider using Mockito for this sort of thing.

Copy link
Contributor Author

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to you for another look, Chris. Wherever I haven't replied to your comment, it means I adopted your suggestion. This is an attempt to avoid cluttering Slack with a zillion messages saying "Done." "Done." "Done." Let me know how this works.

@@ -57,7 +57,7 @@ protected final void onStartup() {
*/
void setReadTraversalBounds() {
if ( hasUserSuppliedIntervals() ) {
reads.setTraversalBounds(intervalArgumentCollection.getTraversalParameters(getHeaderForReads().getSequenceDictionary()));
reads.setTraversalBounds(userIntervals);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adopted your suggestion. Thanks very much for catching this.

return copy;
}

private static String bogusCigar( final int alignLength, final int readLength ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this silly jazz is gone.

final ReferenceContext referenceContext,
final FeatureContext featureContext ) {
final String mateCigarString = read.getAttributeAsString(MATE_CIGAR_TAG);
if ( mateCigarString != null ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten the tool to take a much simpler approach. It now just unmaps (but places) distant mates. The original alignment is saved in an OA tag, so we can redo the mapping in the PairWalker. We don't need mate cigars. And, even better, the resulting output BAM contains no lies (by which I mean fictitious alignments).

)
@BetaFeature
public class LocalAssembler extends PairWalker {
public static final byte QMIN = 25;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did fiddle a bit. The quality score bins are now so sparse in order to improve compression and save space that we'd have to lower QMIN to 20 to make any difference, and that's just too low. We're walking a narrow path between building a furry graph that we fail to clean up adequately, and building a poorly connected graph that we fail to gap-fill adequately.
I have reasonable confidence that this is the right number.

if ( !read.isUnmapped() && !mate.isUnmapped() &&
read.isReverseStrand() != mate.isReverseStrand() ) {
// and both start within 1 base on the ref
if ( Math.abs(read.getStart() - read.getMateStart()) <= 1 ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Occasionally (like maybe 1/4 of the time) the first base of adapter will happen to agree with the reference. So the aligned region can be extended into what's actually adapter. Unlikely to be more than a base or two, though.

return traversalSet;
}

private static int findMaxOverlap( final List<Contig> prefixes, final List<Contig> suffixes ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about this method (findMaxOverlap): it may be providing some short-circuits that might explain my overcalling deletions. But I'll document the assumptions. It's actually pretty straightforward: the last Contig in the prefix list and the first Contig in the suffix list are always the same because we walked forward and backward from some given Contig as a starting point. If the first Contig in the suffix list is not part of a cycle, that means it can appear in the suffix list only once (to appear twice would mean it is part of a cycle). Ditto with the last Contig in the prefix list. So the maximum overlap is just that single contig at the end of the prefix list and at the start of the suffix list.

} else {
final StringBuilder sb = new StringBuilder();
char prefix = '\t';
for ( final Contig predecessor : predecessors ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code went away when I switched to writing GFA. I've switched to functional style in that new code.


/** fixed-size, immutable kmer. usual 2-bit encoding: ACGT->0123. low order bits are final call. **/
public static class Kmer {
public static final int KSIZE = 31; // must be odd number less than 32
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KSIZE=31 is fine for well-behaved regions of the genome, and woefully inadequate for ill-behaved, low-complexity regions. I decided to see what could be done just with graph operations, rather than do the thing that assemblers usually do (use a small K to build a graph, error correct the reads, use a larger K to build a less tangled graph). Error correction has its own hazards, and it's computationally expensive. So I guess I'm reasonably satisfied: This is fast, but it sometimes produces graphs so tangled as to be not particularly informative. But to tell you the truth, most assemblers don't produce great results on the highly repetitive regions.
It would be easy enough to revise this one Kmer class to allow larger K. What's hard is to provide an option for a variety of sizes (as we did for our first attempt at an SV tool in Spark). But with uncorrected short reads you can't start with a K much higher than 31, anyway.

startingContigSet.remove(firstContig);
}

private static void removeTriviallyDifferentTraversals(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you say is true (that we won't detect as trivially different Traversals that don't start and end at the same place). But it isn't anything to worry about: We're just trying to reduce the number of sequences that need to be aligned and searched for SVs. We don't have to be exhaustively thorough. I couldn't dream up a way to be more thorough that wasn't computationally much more complex. The other thing is that the usual pattern is: large contig, pair of bubbles, large contig, pair of bubbles, large contig, etc. So as a practical matter, I think this usually works pretty well.
Damn fine catch, though.

prevTraversal = curTraversal;
}
}
sortedTraversals.removeIf(Traversal::isRC);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added both strands at the beginning, and I thought I had the sort order arranged so that, for any given pair that was trivially different, at least one of the +strand guys survived removal. Alas, that wasn't true. It's rewritten to avoid that assumption.

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your changes look good to me after playing with the tool a little bit. It's OK with me to go ahead and merge.

@tedsharpe tedsharpe merged commit ef36d6c into master Jun 3, 2021
@tedsharpe tedsharpe deleted the tws_sv_local_assembler branch June 3, 2021 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants