Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Characterize impact of partition size on pileup creation #163
The performance of pileup creation seems to be strongly impacted by partition size/count. This isn't entirely surprising, as we need to do a large groupBy that has fairly good data locality. This performance needs to be characterized, possibly so that we can come up with guidelines for storage parameters.
After some more debugging and reasoning about the pileup conversion algorithm, here are some thoughts:
This should be pretty performant. However, we don't see this performance, because the Spark shuffle engine apparently shuffles the entire dataset when doing a repartition. Depending on the dataset size, we should be shuffling approximately 0.01% of pileups; so we are incurring 10,000x more shuffle traffic than is necessary...
To fix this, I plan to revise the pileup conversion engine to aggregate the pileups that have been identified as needing to be shuffled (via a mapPartitionsWithIndex), followed by a mapPartitionsWithIndex that performs the "move".
Pileup creation also appears to experience memory back-pressure; I am looking into ways to fix this. A simple approach may allow users to specify "projections" for conversion.