Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Use of dataset api in ADAM #1018
Here is prototype using Spark Dataset API for the ADAM markduplicates operation:
The dataset API version of ADAM markduplicates above is now at least 10% faster than the current RDD implementation.
Prior to the modification described below using the Dataset API was 30-50% slower than the current RDD version.
I describe below the modification that produced this improvement in our use of the Dataset API:
A seemingly minor modification to return from a mapGroup function a tuple containing one Seq as opposed to the same data split into three different Seqs (either individually or nested within a case class) makes large (2x) performance difference in total processing time.
This magnitude of performance gain from passing the same amount of data as one instead of three Seqs seems surprising given that the total amount of data being returned and being next shuffled in a groupBy would seem to be the same. The difference just being the packaging as a single Seq or split into three Seqs.
While I'd imagine some additional overhead due to multiple Seqs vs. one, am surprised by the large difference, and curious to know if there is a best practice that would suggest to avoid Seqs of case classes in a dataset due to this behavior.
Thanks for looking at this @rxin
Other than the duration metric itself - I don't see any metric which is capturing the duration difference. It doesn't appear to be stragglers - but rather evenly distributed as from the min to median to max, the tasks are all taking around twice as long to complete in the slow version.
@hubertp - thanks for your interest in contributing!. The cause of performance oddity between the two different way to arrange the data prior to running groupBy as described above, with links to google docs and github, is still not understood. It would be interesting if you want to try to replicate in a general simple example. Otherwise - on general dataset API front, I plan to return to that shortly, though you are certainly welcome to have a look at what next steps make sense to you and make a PR - is there a specific area you are interested in being involved with?
One idea for where you could help - similar to https://github.com/jpdna/adam/blob/dataset_api_markdups_v4/adam-core/src/main/scala/org/bdgenomics/adam/dataset/AlignmentRecordLimitProjDS.scala we need to make a dataset API case class for the VCF based variant data like https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/variation and then explore using that for variant related operations - if you want to take a look at that a PR on that subject might help push along efforts. We haven't yet merged my own dataset api work yet to main ADAM repo due to the need to bump to scala 2_11 to allow larger case classes, I'll plan to today rebase the work I was doing at https://github.com/jpdna/adam/tree/dataset_api_markdups_v4 on the current ADAM repo so that will be a better place for you to start from if you want to work on dataset api stuff in ADAM, I'll update this issue with a link to that when ready.