Clean up ReferenceRegion.scala and add thresholded overlap and covers #1484

devin-petersohn · 2017-04-10T17:03:40Z

Partially resolves #1473 and fully resolves #1474.

coveralls · 2017-04-10T17:16:05Z

Coverage increased (+0.1%) to 81.779% when pulling 6855c05 on devin-petersohn:issue#1473ThresholdOverlaps into 93b32c6 on bigdatagenomics:master.

AmplabJenkins · 2017-04-10T17:20:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1939/
Test PASSed.

devin-petersohn · 2017-04-10T18:13:42Z

Once this is merged in, I will update #1324 to reflect the ability to use thresholds to join.

coveralls · 2017-04-10T18:20:48Z

Coverage increased (+0.2%) to 81.8% when pulling af70a8d on devin-petersohn:issue#1473ThresholdOverlaps into 93b32c6 on bigdatagenomics:master.

AmplabJenkins · 2017-04-10T18:23:52Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1940/
Test PASSed.

heuermh · 2017-04-10T20:20:05Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

-    result = 41 * result + (if (strand != null) strand.ordinal() else 0)
-    result
+    // add 37 here because this is the start of the hashCode
+    val nameHashCode = 37 + {


http://docs.oracle.com/javase/8/docs/api/java/util/Objects.html#hash-java.lang.Object...- handles all of this reasonably in a one-liner.

If performance is a concern, and if ReferenceRegion is immutable, the hash code could be computed in the ctr and assigned to a final Int field.

I don't know why we are implementing our own hashCode here, I'm just getting rid of the var. @fnothaft any ideas why we have a custom hashCode here?

Case classes have an auto-genned hashCode. I don't think we should be implementing one here. That said, what's the git blame for the hashCode function say? This may be perf sensitive.

c4b23e0

It might have been a performance issue. @heuermh asked the same question in 2015. Interesting.

The issue was #817

Reading the commit message, that method was necessary to fix issue #817, which was a regression caused by 9a318ba from PR #624. I would say it definitely was not a performance issue... If we do delete the hash code method, we need to validate that it doesn't cause #817 to reappear.

@fnothaft and I discussed today that a change to this should probably be in a different commit so it is easier to roll back in the case that things are broken.

fnothaft · 2017-04-15T21:26:05Z

Hi @devin-petersohn! Is this ready for another round of review?

devin-petersohn · 2017-04-16T05:02:35Z

I have actually encountered another couple of distance based operations that belong here. Everything is ready, but I want to make sure there aren't any more that I need before I ask for another review round. I will let you know as soon as this is done.

coveralls · 2017-04-17T22:40:38Z

Coverage decreased (-0.03%) to 81.603% when pulling cdcb926 on devin-petersohn:issue#1473ThresholdOverlaps into 93b32c6 on bigdatagenomics:master.

AmplabJenkins · 2017-04-17T22:44:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1953/
Test PASSed.

fnothaft · 2017-04-19T03:29:06Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

-    ReferenceRegion(referenceName, max(start, region.start), min(end, region.end))
+  def intersection(other: ReferenceRegion): ReferenceRegion = {
+    require(overlaps(other), "Cannot calculate the intersection of non-overlapping regions.")
+    ReferenceRegion(referenceName, max(start, other.start), min(end, other.end))


We should set the strand here, no?

Also, thoughts on this method just calling def intersection(other: ReferenceRegion, minOverlap: Long) with minOverlap = 0?

+1 on the default, and yes good catch. We weren't setting the strand before, but we definitely should be.

fnothaft · 2017-04-19T03:30:20Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+   */
+  def intersection(other: ReferenceRegion, minOverlap: Long): ReferenceRegion = {
+    require(overlapsBy(other).exists(_ >= minOverlap), "Other region does not meet minimum overlap provided")
+    ReferenceRegion(referenceName, max(start, other.start), min(end, other.end))


Ditto here WRT setting strand.

fnothaft · 2017-04-19T03:31:03Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+  def hull(other: ReferenceRegion): ReferenceRegion = {
+    require(sameStrand(other), "Cannot compute convex hull of differently oriented regions.")
+    require(sameReferenceName(other), "Cannot compute convex hull of regions on different references.")
+    ReferenceRegion(referenceName, min(start, other.start), max(end, other.end))


Ditto here WRT setting strand.

fnothaft · 2017-04-19T03:32:18Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+  /**
+   * Returns whether two regions are nearby.
+   *
+   * Nearby regions may overlap, and have a thresholded distance between the


Wording is a bit funny, suggest:

Two regions are near each other if the distance between the two is less than the user provided distanceThreshold.

fnothaft · 2017-04-19T03:34:00Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+   * region in the reference space.
+   *
+   * @note Distance here is defined as the minimum distance between any point
+   * within this region, and any point within the other region we are measuring


fnothaft · 2017-04-19T04:49:02Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+   * @return True if any section of the two regions overlap.
+   */
+  def covers(other: ReferenceRegion, threshold: Long): Boolean = {
+    covers(other) || disorient.isNearby(other.disorient, threshold)


If you're going to disorient the regions, then reuse those and replace the covers(other) call, e.g.:

val thisDis = disorient val otherDis = other.disorient thisDis.overlaps(otherDis) || thisDis.isNearby(otherDis, threshold)

Also, doesn't isNearby return true if overlaps would return true?

fnothaft · 2017-04-19T04:54:56Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+   * @param other The other region.
+   * @return True if the two are on the same strand, false otherwise
+   */
+  def sameStrand(other: ReferenceRegion): Boolean = {


I don't think this really warrants being pulled out as a method. First, we're just replacing a field equality comparison. Second, we call this method many times when invariant checking. It seems silly to care about, but function dispatch overhead is real.

It was changed to this for readability. Won't the compiler just replace the method call with the equality statement inline anyway?

JIT could, but you would have to make the method final (or private). Even then, I am not sure that JIT could inline the function since other is a ReferenceRegion which is not a final class. I'm not sure about that one though; the JVM doesn't have a clear contract around inlining.

fnothaft · 2017-04-19T04:55:02Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+   * @param other The other region.
+   * @return True if the two are on the same reference name, false otherwise.
+   */
+  def sameReferenceName(other: ReferenceRegion): Boolean = {


Ditto here.

fnothaft · 2017-04-19T05:03:07Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

-    result
+    // add 37 here because this is the start of the hashCode
+    val nameHashCode = 37 + {
+      if (referenceName != null) {


For both referenceName and strand, we should require them to be non-null on initialization and omit the checks here.

What about unmapped reads? Should we default the referenceName?

We throw an IllegalArgumentException on unmapped reads.

Should we change this test code then? I am getting failures associated with creation where referenceName = null, but the test is not expecting failure.

Can you paste the stack trace you're getting? The duplicate marker shouldn't be setting referenceName = null; we do something else to deskew unmapped reads.

- unmapped reads *** FAILED *** org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 (TID 5, localhost): java.lang.IllegalArgumentException: requirement failed: Failed when trying to create region null 0 1 on INDEPENDENT strand.
at scala.Predef$.require(Predef.scala:233)
at org.bdgenomics.adam.models.ReferenceRegion.(ReferenceRegion.scala:324)
at org.bdgenomics.adam.models.ReferencePosition.(ReferencePosition.scala:133)
at org.bdgenomics.adam.models.ReferencePosition$.apply(ReferencePosition.scala:111)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1.org$bdgenomics$adam$rdd$read$ReferencePositionPair$$anonfun$$getPos$1(ReferencePositionPair.scala:53)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:59)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:59)
at scala.Option.map(Option.scala:145)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:59)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:43)
at scala.Option.fold(Option.scala:157)
at org.apache.spark.rdd.Timer.time(Timer.scala:48)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$.apply(ReferencePositionPair.scala:43)
at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$markBuckets$1.apply(MarkDuplicates.scala:114)
at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$markBuckets$1.apply(MarkDuplicates.scala:114)
at org.apache.spark.rdd.RDD$$anonfun$keyBy$1$$anonfun$apply$56.apply(RDD.scala:1492)
at org.apache.spark.rdd.RDD$$anonfun$keyBy$1$$anonfun$apply$56.apply(RDD.scala:1492)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.bdgenomics.adam.rdd.read.MarkDuplicatesSuite.org$bdgenomics$adam$rdd$read$MarkDuplicatesSuite$$markDuplicates(MarkDuplicatesSuite.scala:102)
at org.bdgenomics.adam.rdd.read.MarkDuplicatesSuite$$anonfun$6.apply$mcV$sp(MarkDuplicatesSuite.scala:158)
at org.bdgenomics.utils.misc.SparkFunSuite$$anonfun$sparkTest$1.apply$mcV$sp(SparkFunSuite.scala:102)
at org.bdgenomics.utils.misc.SparkFunSuite$$anonfun$sparkTest$1.apply(SparkFunSuite.scala:98)
at org.bdgenomics.utils.misc.SparkFunSuite$$anonfun$sparkTest$1.apply(SparkFunSuite.scala:98)
at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at org.bdgenomics.adam.util.ADAMFunSuite.org$scalatest$BeforeAndAfter$$super$runTest(ADAMFunSuite.scala:24)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at org.bdgenomics.adam.util.ADAMFunSuite.runTest(ADAMFunSuite.scala:24)
at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at org.bdgenomics.adam.util.ADAMFunSuite.org$scalatest$BeforeAndAfter$$super$run(ADAMFunSuite.scala:24)
at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
at org.bdgenomics.adam.util.ADAMFunSuite.run(ADAMFunSuite.scala:24)
at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1526)
at org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:29)
at org.scalatest.Suite$class.run(Suite.scala:1421)
at org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:29)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
at org.scalatest.tools.Runner$.main(Runner.scala:860)
at org.scalatest.tools.Runner.main(Runner.scala)
Cause: java.lang.IllegalArgumentException: requirement failed: Failed when trying to create region null 0 1 on INDEPENDENT strand.
at scala.Predef$.require(Predef.scala:233)
at org.bdgenomics.adam.models.ReferenceRegion.(ReferenceRegion.scala:324)
at org.bdgenomics.adam.models.ReferencePosition.(ReferencePosition.scala:133)
at org.bdgenomics.adam.models.ReferencePosition$.apply(ReferencePosition.scala:111)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1.org$bdgenomics$adam$rdd$read$ReferencePositionPair$$anonfun$$getPos$1(ReferencePositionPair.scala:53)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:59)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:59)
at scala.Option.map(Option.scala:145)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:59)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:43)
at scala.Option.fold(Option.scala:157)
at org.apache.spark.rdd.Timer.time(Timer.scala:48)
at org.bdgenomics.adam.rdd.read.ReferencePositionPair$.apply(ReferencePositionPair.scala:43)
at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$markBuckets$1.apply(MarkDuplicates.scala:114)
at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$markBuckets$1.apply(MarkDuplicates.scala:114)
at org.apache.spark.rdd.RDD$$anonfun$keyBy$1$$anonfun$apply$56.apply(RDD.scala:1492)
at org.apache.spark.rdd.RDD$$anonfun$keyBy$1$$anonfun$apply$56.apply(RDD.scala:1492)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Ah, I see what's going on. The createUnmappedRead function in the test code is creating reads without sequences. Can you update that function to set the sequence on the read (doesn't matter what the sequence is, just has to be present)? We should probably add a require(read.getSequence != null right before here.

fnothaft · 2017-04-19T05:12:27Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+        0
+      }
+
+    val listOfHashCodes = List(nameHashCode, start.hashCode, end.hashCode, strandHashCode)


hashCode is in general a performance critical operation. As such, allocating and then folding over a list is to be avoided. Why not just write out the multiply/sum chain?

I guess my functional programming side was working on this.

coveralls · 2017-04-24T18:36:00Z

Coverage increased (+0.3%) to 81.928% when pulling 5a0e19e on devin-petersohn:issue#1473ThresholdOverlaps into 93b32c6 on bigdatagenomics:master.

AmplabJenkins · 2017-04-24T18:39:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1958/
Test PASSed.

coveralls · 2017-04-24T19:15:59Z

Coverage increased (+0.3%) to 81.959% when pulling 8a2e011 on devin-petersohn:issue#1473ThresholdOverlaps into 93b32c6 on bigdatagenomics:master.

AmplabJenkins · 2017-04-24T19:20:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1959/
Test PASSed.

heuermh

LGTM, suggested a few minor doc fixes

heuermh · 2017-04-24T20:38:48Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

   *
-   * @param region Region to intersect with.
-   * @return A smaller reference region.
+   * @throws IllegalArgumentException Thrown if regions are no within the distance threshold.


heuermh · 2017-04-24T20:50:58Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+   * overlap.
+   *
+   * @param other Region to intersect with.
+   * @param minOverlap The mimimum amount of positions that the two regions


The minimum amount of positions... → Minimum overlap between the two reference regions.

mimimum -> minimum

heuermh · 2017-04-24T20:54:38Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

-    assert(referenceName == region.referenceName, "Cannot compute convex hull of regions on different references.")
-    ReferenceRegion(referenceName, min(start, region.start), max(end, region.end))
+  def hull(other: ReferenceRegion): ReferenceRegion = {
+    require(sameStrand(other), "Cannot compute convex hull of differently oriented regions.")


For require messages, use phrases or sentences or exclamations, but not all three :)

heuermh · 2017-04-24T20:56:07Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+  def subtract(other: ReferenceRegion, requireStranded: Boolean = false): List[ReferenceRegion] = {
+    val newRegionStrand =
+      if (requireStranded) {
+        require(overlaps(other), "Region $other is not overlapped by $this!")


... I only mention this because of my six yr old's homework last night: Choose [., ?, !] for each of these sentences.

I would generally go with .

fnothaft

Few small docs suggestions! Otherwise LGTM!

fnothaft · 2017-04-24T21:28:30Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+   * overlap.
+   *
+   * @param other Region to intersect with.
+   * @param minOverlap The mimimum amount of positions that the two regions


mimimum -> minimum

fnothaft · 2017-04-24T21:30:41Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+   *
+   * @param other Region to compare against.
+   * @return Returns an option containing the number of positions of coverage
+   *         between two points. If the two regions do not cover, we return


cover, -> cover each other,

fnothaft · 2017-04-24T21:31:29Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

@@ -503,19 +600,80 @@ case class ReferenceRegion(
  }

  /**
+   * Determines if two regions are colocated on the same strand.


I would drop the colocated -> Determines if two regions are on the same strand.

fnothaft · 2017-04-24T21:32:26Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+  }
+
+  /**
+   * Determines if two regions are colocated on the same reference name.


colocated on the same reference name. -> on the same contig.

fnothaft · 2017-04-24T21:33:09Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+  def subtract(other: ReferenceRegion, requireStranded: Boolean = false): List[ReferenceRegion] = {
+    val newRegionStrand =
+      if (requireStranded) {
+        require(overlaps(other), "Region $other is not overlapped by $this!")


I would generally go with .

fnothaft · 2017-04-24T21:33:44Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+   * @param requireStranded Whether or not to require other be on same strand.
+   * @return A list containing the regions resulting from the subtraction.
+   */
+  def subtract(other: ReferenceRegion, requireStranded: Boolean = false): List[ReferenceRegion] = {


Prefer Seq (or, even better, Iterable) to List in public APIs.

fnothaft · 2017-04-24T21:34:51Z

adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala

+        Strand.INDEPENDENT
+      }
+    val first = if (other.start > start) {
+      List(ReferenceRegion(referenceName,


Yeah, actually, just List -> Iterable these few ones. And probably List() -> Iterable.empty.

More cleanup and adding isNearby Adding threshold logic to overlaps and covers Replacing var with fold call Adding overlapsBy and coversBy and related tests Adding subtract, addressing reviewer comments Addressing reviewer comments Addressing reviewer comments

devin-petersohn · 2017-04-24T22:16:22Z

Squashed and rebased. Thanks for the feedback @fnothaft and @heuermh!

fnothaft · 2017-04-24T22:17:11Z

Thanks @devin-petersohn! I will merge this once tests pass.

coveralls · 2017-04-24T22:27:03Z

Coverage decreased (-0.06%) to 81.726% when pulling ebfc510 on devin-petersohn:issue#1473ThresholdOverlaps into 1fb57a3 on bigdatagenomics:master.

AmplabJenkins · 2017-04-24T22:29:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1960/
Test PASSed.

fnothaft · 2017-04-24T22:30:14Z

Merged! Thanks @devin-petersohn!

devin-petersohn changed the title ~~Issue#1473 threshold overlaps~~ Clean up ReferenceRegion.scala and add thresholded overlap and covers Apr 10, 2017

heuermh requested changes Apr 10, 2017

View reviewed changes

devin-petersohn mentioned this pull request Apr 16, 2017

ReferenceRegion overlaps and covers returns false if overlap is 1 #1492

Closed

devin-petersohn requested review from heuermh and fnothaft April 18, 2017 18:40

fnothaft requested changes Apr 19, 2017

View reviewed changes

heuermh approved these changes Apr 24, 2017

View reviewed changes

fnothaft requested changes Apr 24, 2017

View reviewed changes

devin-petersohn force-pushed the issue#1473ThresholdOverlaps branch from 8a2e011 to ebfc510 Compare April 24, 2017 22:15

fnothaft approved these changes Apr 24, 2017

View reviewed changes

fnothaft merged commit 5b6a109 into bigdatagenomics:master Apr 24, 2017

heuermh added this to the 0.23.0 milestone Dec 7, 2017

Clean up ReferenceRegion.scala and add thresholded overlap and covers #1484

Clean up ReferenceRegion.scala and add thresholded overlap and covers #1484

Conversation

devin-petersohn commented Apr 10, 2017

coveralls commented Apr 10, 2017 • edited Loading

AmplabJenkins commented Apr 10, 2017

devin-petersohn commented Apr 10, 2017

coveralls commented Apr 10, 2017 • edited Loading

AmplabJenkins commented Apr 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft Apr 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft commented Apr 15, 2017

devin-petersohn commented Apr 16, 2017

coveralls commented Apr 17, 2017 • edited Loading

AmplabJenkins commented Apr 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft Apr 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft Apr 24, 2017 • edited Loading

Choose a reason for hiding this comment

devin-petersohn Apr 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Apr 24, 2017 • edited Loading

AmplabJenkins commented Apr 24, 2017

coveralls commented Apr 24, 2017 • edited Loading

AmplabJenkins commented Apr 24, 2017

heuermh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devin-petersohn commented Apr 24, 2017

fnothaft commented Apr 24, 2017

coveralls commented Apr 24, 2017 • edited Loading

AmplabJenkins commented Apr 24, 2017

fnothaft commented Apr 24, 2017

coveralls commented Apr 10, 2017 •

edited

Loading

coveralls commented Apr 10, 2017 •

edited

Loading

fnothaft Apr 11, 2017 •

edited

Loading

coveralls commented Apr 17, 2017 •

edited

Loading

fnothaft Apr 19, 2017 •

edited

Loading

fnothaft Apr 24, 2017 •

edited

Loading

devin-petersohn Apr 24, 2017 •

edited

Loading

coveralls commented Apr 24, 2017 •

edited

Loading

coveralls commented Apr 24, 2017 •

edited

Loading

coveralls commented Apr 24, 2017 •

edited

Loading