Test demonstrating region join failure #1206

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
3 participants
@jpdna
Member

jpdna commented Oct 12, 2016

In trying to apply shuffleRegionJoin to a use case in gVCF I found that I was getting an OOM error
in running a InnerShuffleRegionJoin - on a tiny tiny dataset of 3 variants and 2 variants.

In this PR I attempt to recreate the same set of ReferenceRegion intervals in two RDDs to join to demonstrate the problem, here using AlignmentRecords because I modified an existing test.

Its possible that something else is wrong with this test now because I see an ArrayIndexOutofBounds exception rather than OOM, but in any case, the test below is failing for a reason I don't understand - and not gracefully. If we can make this new test work with these intervals, then at least it will be a step towards figuring out why my gVCF join with the same intervals is failing.

- Test join that was failing 10/11/2016 *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ArrayIndexOutOfBoundsException: 5385867
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
  at scala.Option.foreach(Option.scala:236)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scal
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 12, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1532/

Build result: FAILURE

GitHub pull request #1206 of commit f3cdb50 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1206/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains c7b6acb # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1206/merge^{commit} # timeout=10Checking out Revision c7b6acb (origin/pr/1206/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c7b6acbc103a8ae8f7e8c28638cd312b6a22190aFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1532/

Build result: FAILURE

GitHub pull request #1206 of commit f3cdb50 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1206/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains c7b6acb # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1206/merge^{commit} # timeout=10Checking out Revision c7b6acb (origin/pr/1206/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c7b6acbc103a8ae8f7e8c28638cd312b6a22190aFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 12, 2016

Member

OK, cool! By any chance, is the data on the cluster? If so, perhaps let's start a thread and we can work offline to debug it there as well.

Member

fnothaft commented Oct 12, 2016

OK, cool! By any chance, is the data on the cluster? If so, perhaps let's start a thread and we can work offline to debug it there as well.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Oct 12, 2016

Member

Not on the cluster, but I provide the script for you to try to repo the gVCF join I was trying at this gist I just made:
https://gist.github.com/jpdna/a352ab9304a1885d01d3ac1c65dc77a8

which has links to the tiny input files here:
https://drive.google.com/drive/folders/0B6jh69UgixwpdDlGUkhRaW42QzA?usp=sharing

You want to start an email thread to discuss offline?

Member

jpdna commented Oct 12, 2016

Not on the cluster, but I provide the script for you to try to repo the gVCF join I was trying at this gist I just made:
https://gist.github.com/jpdna/a352ab9304a1885d01d3ac1c65dc77a8

which has links to the tiny input files here:
https://drive.google.com/drive/folders/0B6jh69UgixwpdDlGUkhRaW42QzA?usp=sharing

You want to start an email thread to discuss offline?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 12, 2016

Member

You want to start an email thread to discuss offline?

If the files are public then we don't need an offline thread; I'd assumed they weren't public. I'll take a look tomorrow.

Member

fnothaft commented Oct 12, 2016

You want to start an email thread to discuss offline?

If the files are public then we don't need an offline thread; I'd assumed they weren't public. I'll take a look tomorrow.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Oct 12, 2016

Member

I think I found the solution.
It seems that PartitionSize needs to be more like 5000000 as indeed it is the bin size in nucleotides.

val result1 = InnerShuffleRegionJoin[VariantContext, VariantContext](x.sequences, 5000000, sc).partitionAndJoin(x_with_key, y_with_key)

now works for me and seems to be giving join result I expected.

I started with a much lower number as:
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/InnerShuffleRegionJoinSuite.scala#L26
is "3".

But for working with whole chr / genome it seems like a million or more makes sense.

I'll close this PR shortly.

Member

jpdna commented Oct 12, 2016

I think I found the solution.
It seems that PartitionSize needs to be more like 5000000 as indeed it is the bin size in nucleotides.

val result1 = InnerShuffleRegionJoin[VariantContext, VariantContext](x.sequences, 5000000, sc).partitionAndJoin(x_with_key, y_with_key)

now works for me and seems to be giving join result I expected.

I started with a much lower number as:
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/InnerShuffleRegionJoinSuite.scala#L26
is "3".

But for working with whole chr / genome it seems like a million or more makes sense.

I'll close this PR shortly.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 12, 2016

Member

Oh, nice! Good catch. Perhaps we can beef up the documentation?

Member

fnothaft commented Oct 12, 2016

Oh, nice! Good catch. Perhaps we can beef up the documentation?

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Oct 12, 2016

Member

Perhaps we can beef up the documentation?
yup, tiny PR just in for that doc, thanks!

Member

jpdna commented Oct 12, 2016

Perhaps we can beef up the documentation?
yup, tiny PR just in for that doc, thanks!

@jpdna jpdna closed this Oct 12, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment