Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR series for complex SV, part 3 #3457

Merged
merged 1 commit into from Sep 6, 2017
Merged

PR series for complex SV, part 3 #3457

merged 1 commit into from Sep 6, 2017

Conversation

SHuang-Broad
Copy link
Contributor

This PR deals with long reads with exactly two alignments (no other equally good alignment configuration), mapped to the same chromosome with strand switch, but NOT significantly overlapping each other.

We used to call inversions from such alignments, but it is more appropriate to emit BND records because a lot of times such signal is actually generated from inverted segmental duplications, or simply inverted mobile element insertions. To confidently interpret and distinguish between such events, we need other types of evidence, and is better to be dealt with downstream logic units.

Inverted duplications are NOT dealt with in this PR and is going to be in the next.

NEEDS TO WAIT UNTIL PART 1 & 2 ARE IN.

@codecov-io
Copy link

codecov-io commented Aug 17, 2017

Codecov Report

Merging #3457 into master will decrease coverage by 0.174%.
The diff coverage is 41.199%.

@@               Coverage Diff               @@
##              master     #3457       +/-   ##
===============================================
- Coverage     80.073%   79.899%   -0.174%     
- Complexity     17798     17814       +16     
===============================================
  Files           1194      1198        +4     
  Lines          64540     64743      +203     
  Branches       10021     10056       +35     
===============================================
+ Hits           51679     51729       +50     
- Misses          8859      8998      +139     
- Partials        4002      4016       +14
Impacted Files Coverage Δ Complexity Δ
.../sv/discovery/prototype/InsDelVariantDetector.java 0% <ø> (ø) 0 <0> (ø) ⬇️
...ender/tools/spark/sv/utils/GATKSVVCFConstants.java 0% <ø> (ø) 0 <0> (ø) ⬇️
...tute/hellbender/tools/spark/sv/utils/RDDUtils.java 0% <0%> (ø) 0 <0> (?)
...iscoverFromLocalAssemblyContigAlignmentsSpark.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...v/evidence/experimental/FindSmallIndelRegions.java 0% <0%> (ø) 0 <0> (ø) ⬇️
.../tools/spark/sv/discovery/BreakEndVariantType.java 0% <0%> (ø) 0 <0> (?)
...er/tools/spark/sv/discovery/AlignmentInterval.java 91.057% <100%> (-0.469%) 26 <2> (+1)
...nder/tools/spark/sv/discovery/SvTypeInference.java 73.529% <100%> (ø) 9 <0> (ø) ⬇️
...te/hellbender/tools/spark/sv/discovery/SvType.java 100% <100%> (ø) 6 <1> (ø) ⬇️
...bender/tools/spark/sv/discovery/AlignedContig.java 91.489% <100%> (ø) 13 <0> (ø) ⬇️
... and 12 more

@SHuang-Broad
Copy link
Contributor Author

Step 4 towards #2703

* its name
* its sequence as produced by the assembler (no reverse complement like in the SAM record if it maps to '-' strand), and
* its stripped-down alignment information.
* its name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When formatted as javadoc this will all appear as a single paragraph of text. To write a list in javadoc you need to use some basic HTML:

/**
  * Locally assembled contig:
  * <ul><li>its name
  * <li>its sequence ...
  * <li>its stripped-down ...
  * </ul>
  */

Note that you don't need need </li> as it's automatically closed by <li>.


public final class FileUtils {

public static void writeLinesToSingleFile(final Iterator<String> linesToWrite, final String fileName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this can be replaced with Files.write()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, @pshapiro4broad. This would also us to write to local FS, hadoop, and google buckets, i.e. more general than Files.write().

public static boolean createDirInBucketToWriteTo(final String pathString) {
try {
Utils.nonNull(pathString);
if ( java.nio.file.Files.exists(java.nio.file.Paths.get(pathString)) )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, you should modify your imports so a fully qualified name isn't required for java.io.file. symbols.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to deal with name clashes between nio Files and hadoop Files.

@@ -36,9 +37,11 @@
public final int mismatches;
public final int alnScore;

// if this is true, fields "mapQual", "mismatches", "alnScore" should be viewed with care as they were simply copied from the
// original alignment (not for "mismatches"), which after the split are wrong (we didn't recompute them because that would require expensive SW re-alignment)
// if any of the following boolean fields are true, fields "mapQual", "mismatches", "alnScore" should be viewed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like at least part of this should be put in a javadoc for this field.

novelAdjacencyReferenceLocations.complication, it.next(), contigAlignments, broadcastReference);
result.add(mateRecord);
}
return result.iterator();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return List<> here? A caller can easily convert it to an iterator if needed. Returning List provides more flexibility to the caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's constraint by API in Spark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is only being invoked in a lambda as opposed to via method reference right now. I do think that @pshapiro4broad is right in that returning lists is more flexible if we have them. Why not just return the list and then call iterator on it in the lambda in dealWithSimpleStrandSwitchBkpts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SHuang-Broad
Copy link
Contributor Author

@cwhelan this is now ready. Please review. Thanks!

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good but I think it could use a little more commenting and/or descriptions for some of the meaty logic, and I'm a little unsure that I understood all of it.

* Computes overlap between reference span of the two input alignment intervals.
*/
static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) {
Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you rename the sortAlignments() method to alignmentIntervalComparator() or something similar? That method doesn't actually sort the alignments so the name doesn't seem right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*/
static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) {
Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0,
"assumption that first input AI reside a place earlier than second input is violated: \n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message might read more clearly: "assumption that the first input AI is upstream of the second is violated". But, why have this requirement? Why not swap the order of comparison depending on which AI actually comes first? The signature of the method doesn't specify that the intervals need to be ordered so this could be frustrating to others trying to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the assertion because it is really unnecessary when computing overlaps (order doesn't matter).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also added test

if ( !one.referenceSpan.getContig().equals(two.referenceSpan.getContig()) ) return 0;

// dummy number for chr to be used in constructing SVInterval, since input CA has 2 AI & both map to the same chr
final int dummyChr = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you made this -1 instead of 1? That way it can't be mistaken for a real contig ID by someone who doesn't know what's going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. though this is a temp var that calling functions couldn't reach

private static final long serialVersionUID = 1L;

public static Iterator<VariantContext> produceMultipleAnnotatedVcFromNovelAdjacency(final NovelAdjacencyReferenceLocations novelAdjacencyReferenceLocations,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to this method describing what the parameters should be? It's not clear what should be in the Iterables from the signature. And why pass in Iterables instead of eg lists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added and changed assertion to size 2 exactly

final List<VariantContext> result = new ArrayList<>();
result.add(record);
// hack for now because up to this point inferredType would have max of 2 only
while (it.hasNext()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why you're iterating over this if there are only going to be two things in the list. If there were more inferredTypes, you'd create a bunch of duplicate records that differed only in their type field. Why not: make inferredTypes a list, validate that there are only two of them, and then just grab the second one and remove the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to asserting only 2 (and checked)

final List<CigarElement> resultCEs = threeSections._1();
final int a = readBasesConsumed + ce.getLength() - clipLength;
final CigarOperator op = ce.getOperator().isAlignment() ? CigarOperator.M : CigarOperator.S;
if (clipFrom3PrimeEnd) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code might be a little more concise if you put these things in the sublist in the same order all the time, and then reverse the sublist if clipFrom3PrimeEnd with Collections.reverse(), and then append the sublist to resultCEs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
resultCEs.addAll(cigarElements.subList(idx+1, cigarElements.size()));
}
if (!threeSections._3().isEmpty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't really need this if check, you could still add it in if it's empty.

}

/**
* Taking advantage of the fact that for input read, we know it has only two alignments that map to the same reference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add an assertion to this method too if that's the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


final JavaPairRDD<ChimericAlignment, byte[]> simpleStrandSwitchBkpts =
longReads
.mapToPair(SimpleStrandSwitchVariantDetector::convertAlignmentIntervalToChimericAlignment)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me a more straightforward flow would be to filter them on splitPairStrongEnoughEvidenceForCA and then convert the remaining ones to ChimericAlignments, rather than filtering on nonNull.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


import java.io.IOException;

public class VariantDetectorFromLongReadAlignmentsForSimpleStrandSwitchUnitTest extends BaseTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a test for extractCigarElements as well. Perhaps some of the other methods in the class, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test for extractCigarElements for now. I am planning to get the prototype code in first, then test the code extensively, just in case I need to leave for several days...

Copy link
Contributor Author

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented requested changes (mostly, but not all) in 3 commits. @cwhelan please take another look. Thanks!

* Computes overlap between reference span of the two input alignment intervals.
*/
static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) {
Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*/
static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) {
Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0,
"assumption that first input AI reside a place earlier than second input is violated: \n" +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the assertion because it is really unnecessary when computing overlaps (order doesn't matter).

*/
static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) {
Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0,
"assumption that first input AI reside a place earlier than second input is violated: \n" +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also added test

if ( !one.referenceSpan.getContig().equals(two.referenceSpan.getContig()) ) return 0;

// dummy number for chr to be used in constructing SVInterval, since input CA has 2 AI & both map to the same chr
final int dummyChr = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. though this is a temp var that calling functions couldn't reach

private static final long serialVersionUID = 1L;

public static Iterator<VariantContext> produceMultipleAnnotatedVcFromNovelAdjacency(final NovelAdjacencyReferenceLocations novelAdjacencyReferenceLocations,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added and changed assertion to size 2 exactly

.noneMatch(op -> op.equals(CigarOperator.N) || op.isPadding()),
"Input alignment contains padding or skip operations, which is currently unsupported: " + input.toPackedString());

final Tuple3<List<CigarElement>, List<CigarElement>, List<CigarElement>> threeSections = extractCigarElements(input);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

else { // enough read bases would be clipped

if ( !ce.getOperator().isAlignment() && !ce.getOperator().equals(CigarOperator.I))
throw new GATKException.ShouldNeverReachHereException("Logic error, should not reach here");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if ( !ce.getOperator().isAlignment() && !ce.getOperator().equals(CigarOperator.I))
throw new GATKException.ShouldNeverReachHereException("Logic error, should not reach here");

// dead with cigar first
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should have been "deal with"....


import java.io.IOException;

public class VariantDetectorFromLongReadAlignmentsForSimpleStrandSwitchUnitTest extends BaseTest {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test for extractCigarElements for now. I am planning to get the prototype code in first, then test the code extensively, just in case I need to leave for several days...

final List<CigarElement> resultCEs = threeSections._1();
final int a = readBasesConsumed + ce.getLength() - clipLength;
final CigarOperator op = ce.getOperator().isAlignment() ? CigarOperator.M : CigarOperator.S;
if (clipFrom3PrimeEnd) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SHuang-Broad
Copy link
Contributor Author

ping @cwhelan .

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good now, just a few minor comments.

@@ -67,7 +72,8 @@ public boolean test(final AlignedContig contig) {

/**
* Removes overlap from a designated alignment interval, so that the inverted duplicated reference span is minimal.
* If the two alignment intervals are NOT overlapping, return the original read.
* If the two alignment intervals are NOT overlapping, return the original aligned contig.
* For algorithm {@see <a href="https://github.com/broadinstitute/dsde-methods-sv/pull/8"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is our private repo so I'm not sure we should reference it in comments in the public code -- it would just be frustrating for someone external.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and removed.
But we are more and more in need of a central place for documenting the algorithms used in the various stages of the pipeline.

@@ -77,6 +83,8 @@ private static AlignedContig removeOverlap(final AlignedContig contig) {
} else {
final AlignmentInterval one = contig.alignmentIntervals.get(0),
two = contig.alignmentIntervals.get(1);
// js is for "the shoot-off reference location of a jump that linked to alignment intervals", and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be "linked two alignment intervals"?

I'm not sure I understand what "shoot-off" and "landing" are -- or, I guess I can imagine, but maybe something like "jumpStart" and "jumpEnd" might be more intuitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!OoO!, yes, it should be "two" instead of "to".

I have in mind a cartoonish depiction of a tiny anime figure sliding along the reference guided by the alignments of the read, sometimes jumping back and forth. Sorry for this too-pictorial naming scheme.


// deal with cigar first
newMiddleSection.add( new CigarElement(clipLengthOnRead, CigarOperator.S) );
final int a = readBasesConsumed + ce.getLength() - clipLengthOnRead;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a more descriptive name for a be basesRemainingOnCigarElementAfterClipping? Just trying to make sure I understand the algorithm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's right and updated with documentation

@@ -90,7 +90,7 @@ static VariantContext produceAnnotatedVcFromInferredTypeAndRefLocations(final Si
.id(inferredType.getInternalVariantId())
.attribute(GATKSVVCFConstants.SVTYPE, inferredType.toString());

if (inferredType instanceof SimpleSVType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's anything wrong with using instanceof.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strangely it failed... Plus it might be better if an alt route exists that doesn't use reflection and has equal line of code works

* added BreakEndVariantType for outputing BND formatted VCF records, and a corresponding method in AnnotatedVariantProducer (no SVLEN for such records);
* added code path SimpleStrandSwitchVariantDetector for dealing with simple strand switch BND variants which we now emit; also added utility classes utility classes RDDUtils
@SHuang-Broad SHuang-Broad merged commit a8d9fd4 into master Sep 6, 2017
@SHuang-Broad SHuang-Broad deleted the sh_cpx_sv_pr_3 branch September 6, 2017 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants