New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1164] Add parallel file merger. #1441

Merged
merged 1 commit into from Mar 20, 2017

Conversation

Projects
None yet
4 participants
@fnothaft
Member

fnothaft commented Mar 19, 2017

Resolves #1164. I'd actually love to get this into 0.22.0 as well. Thoughts?

@fnothaft fnothaft added this to the 0.22.0 milestone Mar 19, 2017

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 19, 2017

Coverage Status

Coverage increased (+0.03%) to 76.509% when pulling 1f5e03b on fnothaft:issues/1164-parallel-merge into cf39e6c on bigdatagenomics:master.

coveralls commented Mar 19, 2017

Coverage Status

Coverage increased (+0.03%) to 76.509% when pulling 1f5e03b on fnothaft:issues/1164-parallel-merge into cf39e6c on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 19, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1884/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1884/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 20, 2017

Member

Just following up with runtime numbers. With this, saving the NA12878 234GB BAM back to a single BAM from Parquet runs in 4.4 minutes on 833 cores (2.4 minutes to go ADAM->BAM, 2.0 minutes to do the merge). Without this, it takes 44 minutes (2.3 minutes to go ADAM->BAM, the remainder to merge).

Member

fnothaft commented Mar 20, 2017

Just following up with runtime numbers. With this, saving the NA12878 234GB BAM back to a single BAM from Parquet runs in 4.4 minutes on 833 cores (2.4 minutes to go ADAM->BAM, 2.0 minutes to do the merge). Without this, it takes 44 minutes (2.3 minutes to go ADAM->BAM, the remainder to merge).

//
// ideally, this would be a directory, however, fs.concat has the
// undocumented contract that the paths being merged must live in
// the same directory as the path they are being merged to

This comment has been minimized.

@heuermh

heuermh Mar 20, 2017

Member

what? hope you didn't have to find that out the hard way

@heuermh

heuermh Mar 20, 2017

Member

what? hope you didn't have to find that out the hard way

// UNDOCUMENTED in hadoop fs API:
// all paths passed to the concat method must be qualified with
// full scheme and name node URI
val outputPaths = (0 until numBlocksToWrite).map(idx => {

This comment has been minimized.

@heuermh

heuermh Mar 20, 2017

Member

...and sigh

@heuermh

heuermh Mar 20, 2017

Member

...and sigh

@heuermh heuermh merged commit 98b263f into bigdatagenomics:master Mar 20, 2017

2 of 3 checks passed

codacy/pr Not so good... This pull request quality could be better.
Details
coverage/coveralls Coverage increased (+0.03%) to 76.509%
Details
default Merged build finished.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 20, 2017

Member

Thank you, @fnothaft!

Member

heuermh commented Mar 20, 2017

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment