Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up ReferenceRegion.scala and add thresholded overlap and covers #1484

Merged

Conversation

devin-petersohn
Copy link
Member

Partially resolves #1473 and fully resolves #1474.

@coveralls
Copy link

coveralls commented Apr 10, 2017

Coverage Status

Coverage increased (+0.1%) to 81.779% when pulling 6855c05 on devin-petersohn:issue#1473ThresholdOverlaps into 93b32c6 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1939/
Test PASSed.

@devin-petersohn
Copy link
Member Author

Once this is merged in, I will update #1324 to reflect the ability to use thresholds to join.

@coveralls
Copy link

coveralls commented Apr 10, 2017

Coverage Status

Coverage increased (+0.2%) to 81.8% when pulling af70a8d on devin-petersohn:issue#1473ThresholdOverlaps into 93b32c6 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1940/
Test PASSed.

@devin-petersohn devin-petersohn changed the title Issue#1473 threshold overlaps Clean up ReferenceRegion.scala and add thresholded overlap and covers Apr 10, 2017
result = 41 * result + (if (strand != null) strand.ordinal() else 0)
result
// add 37 here because this is the start of the hashCode
val nameHashCode = 37 + {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://docs.oracle.com/javase/8/docs/api/java/util/Objects.html#hash-java.lang.Object...- handles all of this reasonably in a one-liner.

If performance is a concern, and if ReferenceRegion is immutable, the hash code could be computed in the ctr and assigned to a final Int field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we are implementing our own hashCode here, I'm just getting rid of the var. @fnothaft any ideas why we have a custom hashCode here?

Copy link
Member

@fnothaft fnothaft Apr 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case classes have an auto-genned hashCode. I don't think we should be implementing one here. That said, what's the git blame for the hashCode function say? This may be perf sensitive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c4b23e0

It might have been a performance issue. @heuermh asked the same question in 2015. Interesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was #817

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the commit message, that method was necessary to fix issue #817, which was a regression caused by 9a318ba from PR #624. I would say it definitely was not a performance issue... If we do delete the hash code method, we need to validate that it doesn't cause #817 to reappear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fnothaft and I discussed today that a change to this should probably be in a different commit so it is easier to roll back in the case that things are broken.

@fnothaft
Copy link
Member

Hi @devin-petersohn! Is this ready for another round of review?

@devin-petersohn
Copy link
Member Author

I have actually encountered another couple of distance based operations that belong here. Everything is ready, but I want to make sure there aren't any more that I need before I ask for another review round. I will let you know as soon as this is done.

@coveralls
Copy link

coveralls commented Apr 17, 2017

Coverage Status

Coverage decreased (-0.03%) to 81.603% when pulling cdcb926 on devin-petersohn:issue#1473ThresholdOverlaps into 93b32c6 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1953/
Test PASSed.

ReferenceRegion(referenceName, max(start, region.start), min(end, region.end))
def intersection(other: ReferenceRegion): ReferenceRegion = {
require(overlaps(other), "Cannot calculate the intersection of non-overlapping regions.")
ReferenceRegion(referenceName, max(start, other.start), min(end, other.end))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should set the strand here, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, thoughts on this method just calling def intersection(other: ReferenceRegion, minOverlap: Long) with minOverlap = 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the default, and yes good catch. We weren't setting the strand before, but we definitely should be.

*/
def intersection(other: ReferenceRegion, minOverlap: Long): ReferenceRegion = {
require(overlapsBy(other).exists(_ >= minOverlap), "Other region does not meet minimum overlap provided")
ReferenceRegion(referenceName, max(start, other.start), min(end, other.end))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here WRT setting strand.

def hull(other: ReferenceRegion): ReferenceRegion = {
require(sameStrand(other), "Cannot compute convex hull of differently oriented regions.")
require(sameReferenceName(other), "Cannot compute convex hull of regions on different references.")
ReferenceRegion(referenceName, min(start, other.start), max(end, other.end))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here WRT setting strand.

/**
* Returns whether two regions are nearby.
*
* Nearby regions may overlap, and have a thresholded distance between the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording is a bit funny, suggest:

Two regions are near each other if the distance between the two is less than the user provided distanceThreshold.

* region in the reference space.
*
* @note Distance here is defined as the minimum distance between any point
* within this region, and any point within the other region we are measuring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent.

* @return True if any section of the two regions overlap.
*/
def covers(other: ReferenceRegion, threshold: Long): Boolean = {
covers(other) || disorient.isNearby(other.disorient, threshold)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to disorient the regions, then reuse those and replace the covers(other) call, e.g.:

val thisDis = disorient
val otherDis = other.disorient

thisDis.overlaps(otherDis) || thisDis.isNearby(otherDis, threshold)

Also, doesn't isNearby return true if overlaps would return true?

* @param other The other region.
* @return True if the two are on the same strand, false otherwise
*/
def sameStrand(other: ReferenceRegion): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this really warrants being pulled out as a method. First, we're just replacing a field equality comparison. Second, we call this method many times when invariant checking. It seems silly to care about, but function dispatch overhead is real.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was changed to this for readability. Won't the compiler just replace the method call with the equality statement inline anyway?

Copy link
Member

@fnothaft fnothaft Apr 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIT could, but you would have to make the method final (or private). Even then, I am not sure that JIT could inline the function since other is a ReferenceRegion which is not a final class. I'm not sure about that one though; the JVM doesn't have a clear contract around inlining.

* @param other The other region.
* @return True if the two are on the same reference name, false otherwise.
*/
def sameReferenceName(other: ReferenceRegion): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here.

result
// add 37 here because this is the start of the hashCode
val nameHashCode = 37 + {
if (referenceName != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both referenceName and strand, we should require them to be non-null on initialization and omit the checks here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about unmapped reads? Should we default the referenceName?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We throw an IllegalArgumentException on unmapped reads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change this test code then? I am getting failures associated with creation where referenceName = null, but the test is not expecting failure.

Copy link
Member

@fnothaft fnothaft Apr 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you paste the stack trace you're getting? The duplicate marker shouldn't be setting referenceName = null; we do something else to deskew unmapped reads.

Copy link
Member Author

@devin-petersohn devin-petersohn Apr 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- unmapped reads *** FAILED *** org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 (TID 5, localhost): java.lang.IllegalArgumentException: requirement failed: Failed when trying to create region null 0 1 on INDEPENDENT strand.
at scala.Predef$.require(Predef.scala:233)
at org.bdgenomics.adam.models.ReferenceRegion.(ReferenceRegion.scala:324)
at org.bdgenomics.adam.models.ReferencePosition.(ReferencePosition.scala:133)
at org.bdgenomics.adam.models.ReferencePosition$.apply(ReferencePosition.scala:111)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1.org$bdgenomics$adam$rdd$read$ReferencePositionPair$$anonfun$$getPos$1(ReferencePositionPair.scala:53)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:59)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:59)
at scala.Option.map(Option.scala:145)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:59)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:43)
at scala.Option.fold(Option.scala:157)
at org.apache.spark.rdd.Timer.time(Timer.scala:48)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$.apply(ReferencePositionPair.scala:43)
at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$markBuckets$1.apply(MarkDuplicates.scala:114)
at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$markBuckets$1.apply(MarkDuplicates.scala:114)
at org.apache.spark.rdd.RDD$$anonfun$keyBy$1$$anonfun$apply$56.apply(RDD.scala:1492)
at org.apache.spark.rdd.RDD$$anonfun$keyBy$1$$anonfun$apply$56.apply(RDD.scala:1492)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.bdgenomics.adam.rdd.read.MarkDuplicatesSuite.org$bdgenomics$adam$rdd$read$MarkDuplicatesSuite$$markDuplicates(MarkDuplicatesSuite.scala:102)
at org.bdgenomics.adam.rdd.read.MarkDuplicatesSuite$$anonfun$6.apply$mcV$sp(MarkDuplicatesSuite.scala:158)
at org.bdgenomics.utils.misc.SparkFunSuite$$anonfun$sparkTest$1.apply$mcV$sp(SparkFunSuite.scala:102)
at org.bdgenomics.utils.misc.SparkFunSuite$$anonfun$sparkTest$1.apply(SparkFunSuite.scala:98)
at org.bdgenomics.utils.misc.SparkFunSuite$$anonfun$sparkTest$1.apply(SparkFunSuite.scala:98)
at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at org.bdgenomics.adam.util.ADAMFunSuite.org$scalatest$BeforeAndAfter$$super$runTest(ADAMFunSuite.scala:24)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at org.bdgenomics.adam.util.ADAMFunSuite.runTest(ADAMFunSuite.scala:24)
at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at org.bdgenomics.adam.util.ADAMFunSuite.org$scalatest$BeforeAndAfter$$super$run(ADAMFunSuite.scala:24)
at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
at org.bdgenomics.adam.util.ADAMFunSuite.run(ADAMFunSuite.scala:24)
at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1526)
at org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:29)
at org.scalatest.Suite$class.run(Suite.scala:1421)
at org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:29)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
at org.scalatest.tools.Runner$.main(Runner.scala:860)
at org.scalatest.tools.Runner.main(Runner.scala)
Cause: java.lang.IllegalArgumentException: requirement failed: Failed when trying to create region null 0 1 on INDEPENDENT strand.
at scala.Predef$.require(Predef.scala:233)
at org.bdgenomics.adam.models.ReferenceRegion.(ReferenceRegion.scala:324)
at org.bdgenomics.adam.models.ReferencePosition.(ReferencePosition.scala:133)
at org.bdgenomics.adam.models.ReferencePosition$.apply(ReferencePosition.scala:111)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1.org$bdgenomics$adam$rdd$read$ReferencePositionPair$$anonfun$$getPos$1(ReferencePositionPair.scala:53)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:59)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:59)
at scala.Option.map(Option.scala:145)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:59)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:43)
at scala.Option.fold(Option.scala:157)
at org.apache.spark.rdd.Timer.time(Timer.scala:48)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$.apply(ReferencePositionPair.scala:43)
at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$markBuckets$1.apply(MarkDuplicates.scala:114)
at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$markBuckets$1.apply(MarkDuplicates.scala:114)
at org.apache.spark.rdd.RDD$$anonfun$keyBy$1$$anonfun$apply$56.apply(RDD.scala:1492)
at org.apache.spark.rdd.RDD$$anonfun$keyBy$1$$anonfun$apply$56.apply(RDD.scala:1492)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what's going on. The createUnmappedRead function in the test code is creating reads without sequences. Can you update that function to set the sequence on the read (doesn't matter what the sequence is, just has to be present)? We should probably add a require(read.getSequence != null right before here.

0
}

val listOfHashCodes = List(nameHashCode, start.hashCode, end.hashCode, strandHashCode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashCode is in general a performance critical operation. As such, allocating and then folding over a list is to be avoided. Why not just write out the multiply/sum chain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my functional programming side was working on this.

@coveralls
Copy link

coveralls commented Apr 24, 2017

Coverage Status

Coverage increased (+0.3%) to 81.928% when pulling 5a0e19e on devin-petersohn:issue#1473ThresholdOverlaps into 93b32c6 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1958/
Test PASSed.

@coveralls
Copy link

coveralls commented Apr 24, 2017

Coverage Status

Coverage increased (+0.3%) to 81.959% when pulling 8a2e011 on devin-petersohn:issue#1473ThresholdOverlaps into 93b32c6 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1959/
Test PASSed.

Copy link
Member

@heuermh heuermh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, suggested a few minor doc fixes

*
* @param region Region to intersect with.
* @return A smaller reference region.
* @throws IllegalArgumentException Thrown if regions are no within the distance threshold.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no → not

* overlap.
*
* @param other Region to intersect with.
* @param minOverlap The mimimum amount of positions that the two regions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum amount of positions... → Minimum overlap between the two reference regions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mimimum -> minimum

assert(referenceName == region.referenceName, "Cannot compute convex hull of regions on different references.")
ReferenceRegion(referenceName, min(start, region.start), max(end, region.end))
def hull(other: ReferenceRegion): ReferenceRegion = {
require(sameStrand(other), "Cannot compute convex hull of differently oriented regions.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For require messages, use phrases or sentences or exclamations, but not all three :)

def subtract(other: ReferenceRegion, requireStranded: Boolean = false): List[ReferenceRegion] = {
val newRegionStrand =
if (requireStranded) {
require(overlaps(other), "Region $other is not overlapped by $this!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I only mention this because of my six yr old's homework last night: Choose [., ?, !] for each of these sentences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would generally go with .

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small docs suggestions! Otherwise LGTM!

* overlap.
*
* @param other Region to intersect with.
* @param minOverlap The mimimum amount of positions that the two regions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mimimum -> minimum

*
* @param other Region to compare against.
* @return Returns an option containing the number of positions of coverage
* between two points. If the two regions do not cover, we return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cover, -> cover each other,

@@ -503,19 +600,80 @@ case class ReferenceRegion(
}

/**
* Determines if two regions are colocated on the same strand.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop the colocated -> Determines if two regions are on the same strand.

}

/**
* Determines if two regions are colocated on the same reference name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

colocated on the same reference name. -> on the same contig.

def subtract(other: ReferenceRegion, requireStranded: Boolean = false): List[ReferenceRegion] = {
val newRegionStrand =
if (requireStranded) {
require(overlaps(other), "Region $other is not overlapped by $this!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would generally go with .

* @param requireStranded Whether or not to require other be on same strand.
* @return A list containing the regions resulting from the subtraction.
*/
def subtract(other: ReferenceRegion, requireStranded: Boolean = false): List[ReferenceRegion] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer Seq (or, even better, Iterable) to List in public APIs.

Strand.INDEPENDENT
}
val first = if (other.start > start) {
List(ReferenceRegion(referenceName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, actually, just List -> Iterable these few ones. And probably List() -> Iterable.empty.

More cleanup and adding isNearby

Adding threshold logic to overlaps and covers

Replacing var with fold call

Adding overlapsBy and coversBy and related tests

Adding subtract, addressing reviewer comments

Addressing reviewer comments

Addressing reviewer comments
@devin-petersohn
Copy link
Member Author

Squashed and rebased. Thanks for the feedback @fnothaft and @heuermh!

@fnothaft
Copy link
Member

Thanks @devin-petersohn! I will merge this once tests pass.

@coveralls
Copy link

coveralls commented Apr 24, 2017

Coverage Status

Coverage decreased (-0.06%) to 81.726% when pulling ebfc510 on devin-petersohn:issue#1473ThresholdOverlaps into 1fb57a3 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1960/
Test PASSed.

@fnothaft fnothaft merged commit 5b6a109 into bigdatagenomics:master Apr 24, 2017
@fnothaft
Copy link
Member

Merged! Thanks @devin-petersohn!

@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants