Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide low-impact alternative to `transform -repartition` for reducing partition size #594

hannes-ucsc opened this issue Mar 1, 2015 · 4 comments


Copy link

@hannes-ucsc hannes-ucsc commented Mar 1, 2015

Disclaimer: ADAM noob here, so this may be bogus.

Steps to reproduce:

  1. Run adam-submit transform -mark_duplicate_reads ...
  2. Open the driver webapp
  3. Note the number of partitions, e.g. 2306
  4. Hit Ctrl-C
  5. Run adam-submit transform -repartition 4612 -mark_duplicate_reads ... (i.e. twice the # of partitions)

Notice how the repartition incurs a shuffle, spilling to disk. With a 288GB BAM, we saw 1.5 TB of shuffle data being spilled to disk before the second, productive stage began. The reason is that RDD.repartition() effectively scatters consecutive elements all over the partitions in the result RDD by creating a pair RDD where each key is a randomized index of the target partition, and applying a HashPartitioner. I think the intent is to produce partitions of approximately equal size even when the input RDD is unevenly partitioned. However, in the transform case, we can assume that the input RDD is already very evenly partitioned.

If the purpose of -repartition is to allow users to work around resource limitations by reducing the partition size for their workload, then it shouldn't incur an additional strain on resources. I found that I can achieve the same effect—reducing the partition size—by specifying the spark.hadoop.mapred.max.split.size property when running adam-submit. Should transform get another parameter, e.g. -max-partition-size N that explicitly sets aforementioned property to N?

I also wonder why there isn't an RDD.split( N ) that simply splits each parent partition into the given number of child partitions.

@fnothaft fnothaft added the discussion label Mar 1, 2015
Copy link

@fnothaft fnothaft commented Mar 1, 2015

That's fair. We've got both -repartition N and -coalesce N for transform. I've generally conceived the use case for -repartition N to be when you have a small file on a large cluster, where you want to force increased parallelism, but the cost of doing a shuffle isn't too big (since the dataset is small).

I'd generally prefer to avoid modifying Spark config via command line parameters. We currently support modifying configuration via ADAM_OPTS, and I don't think there is a clean way to set a configuration parameter inside of an ADAMCommand.

Copy link
Contributor Author

@hannes-ucsc hannes-ucsc commented Mar 1, 2015

sc.hadoopConfiguration.set("mapred.max.split.size","12345") should do it.

I guess this is merely a usability concern. I'm under the impression that the partition size is an important variable in the heap space vs. computation time trade-off. It may deserve more prominence, in either the UI or documentation. Maybe just add a note about spark.hadoop.mapred.max.split.size in the usage screen?

I agree wrt the small file / large cluster scenario. I was not aware of that use case.

@hannes-ucsc hannes-ucsc changed the title Utility of transform -repartition is questionable Provide low-impact alternative to `transform -repartition` for reducing partition size Mar 1, 2015
@fnothaft fnothaft added this to the 0.21.0 milestone Jul 20, 2016
@heuermh heuermh modified the milestones: 0.21.0, 0.22.0 Oct 13, 2016
Copy link

@fnothaft fnothaft commented Mar 2, 2017

I think we should drop this ticket as unsupported. Thoughts?

@fnothaft fnothaft added the wontfix label Mar 3, 2017
Copy link

@fnothaft fnothaft commented Mar 3, 2017

Consensus was to close this as unsupported.

@fnothaft fnothaft closed this Mar 3, 2017
@heuermh heuermh added this to Completed in Release 0.23.0 Mar 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.