Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
Provide low-impact alternative to `transform -repartition` for reducing partition size #594
Disclaimer: ADAM noob here, so this may be bogus.
Steps to reproduce:
Notice how the repartition incurs a shuffle, spilling to disk. With a 288GB BAM, we saw 1.5 TB of shuffle data being spilled to disk before the second, productive stage began. The reason is that RDD.repartition() effectively scatters consecutive elements all over the partitions in the result RDD by creating a pair RDD where each key is a randomized index of the target partition, and applying a HashPartitioner. I think the intent is to produce partitions of approximately equal size even when the input RDD is unevenly partitioned. However, in the
If the purpose of
I also wonder why there isn't an RDD.split( N ) that simply splits each parent partition into the given number of child partitions.
That's fair. We've got both
I'd generally prefer to avoid modifying Spark config via command line parameters. We currently support modifying configuration via
I guess this is merely a usability concern. I'm under the impression that the partition size is an important variable in the heap space vs. computation time trade-off. It may deserve more prominence, in either the UI or documentation. Maybe just add a note about
I agree wrt the small file / large cluster scenario. I was not aware of that use case.