New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging files should be multithreaded #1164

Closed
fnothaft opened this Issue Sep 11, 2016 · 1 comment

Comments

Projects
1 participant
@fnothaft
Member

fnothaft commented Sep 11, 2016

Related to #1161. Right now, we merge files as a big ol' single threaded hunk of code. However, this can be parallelized.

@fnothaft fnothaft self-assigned this Sep 11, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 12, 2016

Member

I'm not going to tag this with a specific milestone; this'll be a best effort feature. Unfortunately, half of the org.apache.hadoop.fs.FileSystem interface is unofficially optional, which makes it a real PITA to do a general implementation. That being said, I'm thinking the implementation would look something like:

  • A block/"balancing" approach. Here, we do a mapPartitions call with a given number of partitions that resizes all things into fixed size chunks:
    • For scheme = HDFS, we write the fixed size chunks and then call concat and life is good, all is well, etc.
    • For scheme = S3, we do something like what conductor does and do a big ol' multipart upload.
  • For scheme = file, we should be able to do random writes at fixed offsets and that should be OK.
  • For whatever else, we drop back on the current single threaded functionality.
Member

fnothaft commented Sep 12, 2016

I'm not going to tag this with a specific milestone; this'll be a best effort feature. Unfortunately, half of the org.apache.hadoop.fs.FileSystem interface is unofficially optional, which makes it a real PITA to do a general implementation. That being said, I'm thinking the implementation would look something like:

  • A block/"balancing" approach. Here, we do a mapPartitions call with a given number of partitions that resizes all things into fixed size chunks:
    • For scheme = HDFS, we write the fixed size chunks and then call concat and life is good, all is well, etc.
    • For scheme = S3, we do something like what conductor does and do a big ol' multipart upload.
  • For scheme = file, we should be able to do random writes at fixed offsets and that should be OK.
  • For whatever else, we drop back on the current single threaded functionality.

@fnothaft fnothaft modified the milestones: 0.22.0, 0.23.0 Mar 3, 2017

@heuermh heuermh added this to Triage in Release 0.23.0 Mar 8, 2017

fnothaft added a commit to fnothaft/adam that referenced this issue Mar 18, 2017

fnothaft added a commit to fnothaft/adam that referenced this issue Mar 19, 2017

@fnothaft fnothaft modified the milestones: 0.22.0, 0.23.0 Mar 19, 2017

@heuermh heuermh closed this in #1441 Mar 20, 2017

heuermh added a commit that referenced this issue Mar 20, 2017

@heuermh heuermh moved this from Triage to Completed in Release 0.23.0 Mar 21, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment