New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-916] New strategy for writing header. #917

Merged
merged 1 commit into from Jan 14, 2016

Conversation

Projects
None yet
4 participants
@fnothaft
Copy link
Member

fnothaft commented Jan 12, 2016

Resolves #916. Not fully done yet; seems to work for BAM but not SAM. The SAM file header doesn't write correctly... Still debugging that. Needs more testing.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 12, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1047/

Build result: FAILURE

GitHub pull request #917 of commit 2c38bbd automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 8ff91b477e29e28d277a812a87673b43fbbff9a1 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 8ff91b477e29e28d277a812a87673b43fbbff9a1 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 8ff91b477e29e28d277a812a87673b43fbbff9a1First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft force-pushed the fnothaft:write-separate-header branch from 2c38bbd to 4d1c406 Jan 12, 2016

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 12, 2016

This is good to go now. Ping for review and merge.

@fnothaft fnothaft added this to the 0.19.0 milestone Jan 12, 2016

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 12, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1048/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.3.0,2.11,1.2.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.2.1,centosTriggering ADAM-prb ? 1.0.4,2.11,1.2.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.3.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.3.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 1.0.4,2.10,1.2.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.2.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.2.1,centos completed with result SUCCESSADAM-prb ? 1.0.4,2.11,1.2.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 12, 2016

OK, it seems like on Spark 1.2 we got a merging order that is inconsistent with the ordering we're expecting. I was worried about that... Let me look into this more and see if there is a way we can ensure a consistent merging order.

@fnothaft fnothaft force-pushed the fnothaft:write-separate-header branch from 4d1c406 to b8ec48e Jan 12, 2016

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 12, 2016

Rebased and somewhat fixed.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 12, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1050/

Build result: FAILURE

GitHub pull request #917 of commit b8ec48e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 0a51cd7242eb066b795ff1d414b695e3eff0c6b6First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Jan 13, 2016

Single unit test failure

Couldn't find any files matching /tmp/bqsr1442238865989322732/bqsr1.bam
@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 13, 2016

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 13, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1051/

Build result: FAILURE

GitHub pull request #917 of commit b8ec48e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 0a51cd7242eb066b795ff1d414b695e3eff0c6b6First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 13, 2016

Jenkins, test this please.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 13, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1052/

Build result: FAILURE

GitHub pull request #917 of commit b8ec48e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 0a51cd7242eb066b795ff1d414b695e3eff0c6b6First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 13, 2016

Jenkins, test this please.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 13, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1053/

Build result: FAILURE

GitHub pull request #917 of commit b8ec48e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 0a51cd7242eb066b795ff1d414b695e3eff0c6b6First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft force-pushed the fnothaft:write-separate-header branch from b8ec48e to 1c9927f Jan 13, 2016

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 13, 2016

Pushed a fix; tests should pass on this run. From my side, this PR is good to go (assuming tests pass on @AmplabJenkins).

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 13, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1054/

Build result: FAILURE

[...truncated 32 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.3.0,2.11,1.3.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 1.0.4,2.10,1.2.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.2.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.2.1,centos completed with result SUCCESSADAM-prb ? 1.0.4,2.11,1.2.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft force-pushed the fnothaft:write-separate-header branch from 1c9927f to 47255c4 Jan 13, 2016

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 13, 2016

Good news: the concat method doesn't exist in org.apache.hadoop.fs.FileSystem in the Hadoop 1.x stream. I've added code that calls concat via reflection. I'm testing this on the cluster now.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 13, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1055/

Build result: FAILURE

[...truncated 32 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.3.0,2.11,1.3.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.0,centos completed with result FAILUREADAM-prb ? 1.0.4,2.10,1.2.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.2.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.2.1,centos completed with result SUCCESSADAM-prb ? 1.0.4,2.11,1.2.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft force-pushed the fnothaft:write-separate-header branch 3 times, most recently from eb165a7 to e9cedc1 Jan 14, 2016

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 14, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1057/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 14, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1058/

Build result: FAILURE

GitHub pull request #917 of commit e9cedc1 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 5dade159e8469006b153c4b151caacb23d6f6952 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 5dade159e8469006b153c4b151caacb23d6f6952 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 5dade159e8469006b153c4b151caacb23d6f6952First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 14, 2016

This all works, except for the concat operation, which fails on the cluster because the header doesn't take up an exact, full HDFS block. Anyways, we fall back on the single threaded writing code, which is not so hot from a performance perspective, but works correctly.

@fnothaft fnothaft force-pushed the fnothaft:write-separate-header branch from e9cedc1 to 386f759 Jan 14, 2016

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 14, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1059/

Build result: FAILURE

GitHub pull request #917 of commit 386f759 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 0c8180369152d9447e08c7c89edf1068950c3a00 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 0c8180369152d9447e08c7c89edf1068950c3a00 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 0c8180369152d9447e08c7c89edf1068950c3a00 > git rev-list 5dade159e8469006b153c4b151caacb23d6f6952 # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 14, 2016

Jenkins, retest this please.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 14, 2016

Also, even when we fall back on the single threaded copy code, it takes ~1sec per file to merge the files. E.g., if we have ~2000 partitions, then it takes about 2000 seconds to merge the whole file together. This is unfortunate, but is much better than the ~30hrs latency @beaunorgeot had been seeing when doing something like samtools cat on a single machine.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 14, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1060/
Test PASSed.

@ryan-williams

This comment has been minimized.

Copy link
Member

ryan-williams commented Jan 14, 2016

Thanks for tackling this @fnothaft; I'm missing state on this, can you answer a few high-level questions:

  • do you know how the missing header issue is currently happening? Last I looked at this (many months ago), we had what we thought was a high-confidence fix, and yet I still saw it happen once (but not reproducibly).
  • in this fix, you're writing out the header to a path that is specified in the hadoop job config, and having each partition read the header in from that path?
  • it seems like all the existing {add,clear}Header code is still present, but has been moved around; how does that interact with the new writing/reading-header-file stuff?
@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 14, 2016

OK, this succeeded on the cluster. Save time is slower than I'd like (~40 min to cat NA12878 together) but it worked. I'm running another test now that sorts then saves, then I'll pull the file out of HDFS and run samtools index locally. If that succeeds, then we are good to go. I'll update when that's done; I'm guessing it'll probably finish sometime late tonight.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 14, 2016

samtools index worked on the cluster, so we are both writing a concatenated file and keeping the sort order. Yippee!

@ryan-williams

do you know how the missing header issue is currently happening? Last I looked at this (many months ago), we had what we thought was a high-confidence fix, and yet I still saw it happen once (but not reproducibly).

God, I still have no idea what's causing it. I agree—I thought our fix was high confidence, but the log files are pretty unambiguous that we're getting a partition that tries to write to disk but the header is not set.

in this fix, you're writing out the header to a path that is specified in the hadoop job config, and having each partition read the header in from that path?

Exactly. It turns out that this is how Hadoop-BAM's CLI uses the Hadoop-BAM output formats. This isn't obvious though... I got the idea for this just by reading the Hadoop-BAM CLI source code.

it seems like all the existing {add,clear}Header code is still present, but has been moved around; how does that interact with the new writing/reading-header-file stuff?

I only touched the headerless output formats, so they're still vulnerable to the transient header disappearance issue. I'll need to propagate the fix to all output formats later. We'll probably have to do the same thing for VCF as well.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 14, 2016

Anyways, this is good to merge now.

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Jan 14, 2016

@ryan-williams can you review?

I'll take a look and run a few extra tests this morning. We would like to get this merged as soon as possible.


// concat files together
// we need to do this through reflection because the concat method is
// NOT in hadoop 1.x

This comment has been minimized.

@heuermh

heuermh Jan 14, 2016

Member

after this is merged, we should create an issue to revisit it once we drop Hadoop 1.x support

os.close()

// delete temp files
//fs.delete(headPath, true)

This comment has been minimized.

@heuermh

heuermh Jan 14, 2016

Member

I see temp files in tailPath are deleted above but don't see any call to delete headPath

This comment has been minimized.

@fnothaft

fnothaft Jan 14, 2016

Member

Both are removed here... Although, I commented those lines out when debugging something; let me fix that real quick.

Are you referring to the code several lines above at https://github.com/bigdatagenomics/adam/pull/917/files#diff-a0df6432a6d340425e3510af17cb8bd2R516 ? The headPath file there doesn't need to be removed because the concat merges all of the tail files into the head file. We then move the head file to the desired output path.

@ryan-williams

This comment has been minimized.

Copy link
Member

ryan-williams commented Jan 14, 2016

Unfortunately I don't have much time to get in to the weeds on this one, but my main thought is:

if we don't know why the previous fix didn't fix the issue, even though we all still believe it should have from everything we know about the code, then I don't like complicating matters in this way.

Ultimately, I'll defer to y'all since it sounds like the goal here is more to unblock specific user(s) than to have a fix that we understand, but obviously that doesn't seem like a great state of affairs longer-term, especially if this introduces a slow-down to boot.

Can we deterministically repro the crash? I agree that something is still wrong, @danvk (on #721) and I (undocumented?) have both seen the bug since the last "fix", IIRC.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 14, 2016

The perf is only bad when saving BAM as a single file, and "bad" is relative to the FileSystem 'concat' method. This is a correctness fix (block ordering + header attach) relative to the old 'FileUtil.copyMerge' approach. I agree that we need to harmonize the single file save/many file save approaches, but that's for later.

[ADAM-916] New strategy for writing header.
Resolves #916. Makes several modifications that should eliminate the header
attach issue when writing back to SAM/BAM:

* Writes the SAM/BAM header as a single file.
* Instead of trying to attach the SAM/BAM header to the output format via a
  singleton object, we pass the path to the SAM/BAM header file via the Hadoop
  configuration.
* The output format reads the header from HDFS when creating the record writer.
* At the end, once we've written the full RDD and the header file, we merge all
  via Hadoop's FsUtil.

@fnothaft fnothaft force-pushed the fnothaft:write-separate-header branch from 386f759 to 7d4b409 Jan 14, 2016

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 14, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1061/

Build result: FAILURE

GitHub pull request #917 of commit 7d4b409 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains b12af513d5cb95c54fe156912eef38bc2a37e23c # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision b12af513d5cb95c54fe156912eef38bc2a37e23c (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f b12af513d5cb95c54fe156912eef38bc2a37e23c > git rev-list 0c8180369152d9447e08c7c89edf1068950c3a00 # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 14, 2016

This is good to go from my side. @heuermh were you able to complete your testing?

Just to follow up more on @ryan-williams' comments from earlier (sorry, I was on the run so I kept my comments brief), this fix only impacts the single file save code, and does not lead to a performance decrease over the current single file save code. This fix does three things:

  1. Writes the file header to a file in HDFS (or whichever underlying file system you are using; I'll use HDFS as a placeholder here for convenience).
  2. Adds code to the headerless output formats that ensures that they read the header from HDFS. This code will never fail unless writing the header to HDFS failed in the first place, in which case you've got bigger problems and I can't help you ;)
  3. Modifies the file merging code. This introduces a fast path to merging using the concat method, and a second path to merging that manually merges the files together. The second path (manual merge) is necessary anyways, because the prior method using FileUtil.copyMerge did not guarantee the ordering of files that were being merged, which meant that we'd occasionally get files where the header was at the end, etc.

The manual merge code path is much slower than the concat method, but the concat method has harsh, undocumented preconditions (specifically, all files being merged—except for the last one—must take up a full block in HDFS) and seems to throw a NotImplementedError on file systems that aren't HDFS, which is dandy. The manual merge code path is not slower than the FileUtil.copyMerge approach we were using previously.

As for the million dollar question: why does the header detach issue keep on happening, despite our best efforts to fix it? Frankly, I'm not entirely sure why it's failing, but the root cause is something along the lines of "when we set the header in the singleton object that accompanies the output format, it doesn't get consistently set across all of the JVMs where an instance of the actual output format will be created and executed." The new approach uses the Hadoop job configuration to pass the path to the header file. This conf is passed to the OutputFormat by Spark itself, so we can guarantee that the configuration seen by the OutputFormat will be consistent across the cluster. Since the header is written to the underlying file system (which again, if your underlying file system is not accessible across the whole cluster, then you're running some horrible frankencluster and I can't help you), we can guarantee that reading the header file from the path passed in the config will succeed.

I think a TL;DR for why this failed is that the obvious way of attaching the header (I call it obvious because the method is named setHeader) to Hadoop-BAM's OutputFormats is not guaranteed to work consistently in a distributed setting. I think that the Hadoop-BAM team knows this, because the Hadoop-BAM CLI wraps their own output formats to do this exact "write header to HDFS, set path in config, read header from config path when creating record reader" pattern.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 14, 2016

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 14, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1062/
Test PASSed.

heuermh added a commit that referenced this pull request Jan 14, 2016

Merge pull request #917 from fnothaft/write-separate-header
[ADAM-916] New strategy for writing header.

@heuermh heuermh merged commit 49b614d into bigdatagenomics:master Jan 14, 2016

1 check passed

default Merged build finished.
Details
@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Jan 14, 2016

Tested ok for me, merged from the ski lift.

Thank you, @fnothaft!

@fnothaft fnothaft deleted the fnothaft:write-separate-header branch Jan 15, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment