Provide low-impact alternative to `transform -repartition` for reducing partition size #594

Closed
hannes-ucsc opened this Issue Mar 1, 2015 · 4 comments

Comments

Projects
3 participants
@hannes-ucsc
Contributor

hannes-ucsc commented Mar 1, 2015

Disclaimer: ADAM noob here, so this may be bogus.

Steps to reproduce:

  1. Run adam-submit transform -mark_duplicate_reads ...
  2. Open the driver webapp
  3. Note the number of partitions, e.g. 2306
  4. Hit Ctrl-C
  5. Run adam-submit transform -repartition 4612 -mark_duplicate_reads ... (i.e. twice the # of partitions)

Notice how the repartition incurs a shuffle, spilling to disk. With a 288GB BAM, we saw 1.5 TB of shuffle data being spilled to disk before the second, productive stage began. The reason is that RDD.repartition() effectively scatters consecutive elements all over the partitions in the result RDD by creating a pair RDD where each key is a randomized index of the target partition, and applying a HashPartitioner. I think the intent is to produce partitions of approximately equal size even when the input RDD is unevenly partitioned. However, in the transform case, we can assume that the input RDD is already very evenly partitioned.

If the purpose of -repartition is to allow users to work around resource limitations by reducing the partition size for their workload, then it shouldn't incur an additional strain on resources. I found that I can achieve the same effect—reducing the partition size—by specifying the spark.hadoop.mapred.max.split.size property when running adam-submit. Should transform get another parameter, e.g. -max-partition-size N that explicitly sets aforementioned property to N?

I also wonder why there isn't an RDD.split( N ) that simply splits each parent partition into the given number of child partitions.

@fnothaft fnothaft added the discussion label Mar 1, 2015

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 1, 2015

Member

That's fair. We've got both -repartition N and -coalesce N for transform. I've generally conceived the use case for -repartition N to be when you have a small file on a large cluster, where you want to force increased parallelism, but the cost of doing a shuffle isn't too big (since the dataset is small).

I'd generally prefer to avoid modifying Spark config via command line parameters. We currently support modifying configuration via ADAM_OPTS, and I don't think there is a clean way to set a configuration parameter inside of an ADAMCommand.

Member

fnothaft commented Mar 1, 2015

That's fair. We've got both -repartition N and -coalesce N for transform. I've generally conceived the use case for -repartition N to be when you have a small file on a large cluster, where you want to force increased parallelism, but the cost of doing a shuffle isn't too big (since the dataset is small).

I'd generally prefer to avoid modifying Spark config via command line parameters. We currently support modifying configuration via ADAM_OPTS, and I don't think there is a clean way to set a configuration parameter inside of an ADAMCommand.

@hannes-ucsc

This comment has been minimized.

Show comment
Hide comment
@hannes-ucsc

hannes-ucsc Mar 1, 2015

Contributor

sc.hadoopConfiguration.set("mapred.max.split.size","12345") should do it.

I guess this is merely a usability concern. I'm under the impression that the partition size is an important variable in the heap space vs. computation time trade-off. It may deserve more prominence, in either the UI or documentation. Maybe just add a note about spark.hadoop.mapred.max.split.size in the usage screen?

I agree wrt the small file / large cluster scenario. I was not aware of that use case.

Contributor

hannes-ucsc commented Mar 1, 2015

sc.hadoopConfiguration.set("mapred.max.split.size","12345") should do it.

I guess this is merely a usability concern. I'm under the impression that the partition size is an important variable in the heap space vs. computation time trade-off. It may deserve more prominence, in either the UI or documentation. Maybe just add a note about spark.hadoop.mapred.max.split.size in the usage screen?

I agree wrt the small file / large cluster scenario. I was not aware of that use case.

@hannes-ucsc hannes-ucsc changed the title from Utility of transform -repartition is questionable to Provide low-impact alternative to `transform -repartition` for reducing partition size Mar 1, 2015

@fnothaft fnothaft added this to the 0.21.0 milestone Jul 20, 2016

@heuermh heuermh modified the milestones: 0.21.0, 0.22.0 Oct 13, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 2, 2017

Member

I think we should drop this ticket as unsupported. Thoughts?

Member

fnothaft commented Mar 2, 2017

I think we should drop this ticket as unsupported. Thoughts?

@fnothaft fnothaft added the wontfix label Mar 3, 2017

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 3, 2017

Member

Consensus was to close this as unsupported.

Member

fnothaft commented Mar 3, 2017

Consensus was to close this as unsupported.

@fnothaft fnothaft closed this Mar 3, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Mar 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment