[SPARK-22098][CORE] Add new method aggregateByKeyLocally in RDD by ConeyLiu · Pull Request #19317 · apache/spark

ConeyLiu · 2017-09-22T04:05:34Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-22096

NaiveBayes currently takes aggreateByKey followed by a collect to calculate frequency for each feature/label. We can implement a new function 'aggregateByKeyLocally' in RDD that merges locally on each mapper before sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~20% performance gain with these changes.

This is a subtask of our improvement.

How was this patch tested?

New UT.

ConeyLiu · 2017-09-22T04:06:36Z

cc @VinceShieh

AmplabJenkins · 2017-09-22T04:06:58Z

Can one of the admins verify this patch?

jiangxb1987 · 2017-09-22T18:12:50Z

cc @WeichenXu123 mind take a look?

WeichenXu123

Current implementation of treeAggregate already aggregate locally on each partition and then send result to driver. So I don't understand what's the difference of your impl between current impl and why your impl is faster. Can you verify the perf on more dataset and post the perf result?

WeichenXu123 · 2017-09-23T01:32:15Z

And I have to point out that your impl have high risk causing OOM. The current impl will auto spill when local hashmap is too large and can take advantage of spark unified memory management mechanism.
Another thing is the JHashmap will be slow perf and it is better to use org.apache.spark.util.collection.OpenHashSet, in the case the hashmap is append-only.

VinceShieh · 2017-09-23T01:44:27Z

Nice catch. thanks. the perf gain is truly narrow.
I believe this impl just tried to align with the impl of 'reduceByKeyLocally'.
@ConeyLiu maybe we should revisit the code, along with the 'reduceByKeyLocally' impl.

WeichenXu123 · 2017-09-23T01:48:16Z

Yes. I guess the perf gain is because, this PR use local hashmap which can use unlimited memory, but current spark aggregation impl, will auto spill local hashmap when exceeding a threshold.
Memory management is a difficult thing we should not try to do it in user code I think, because we're hard to estimate how huge the hashmap will grow to.

ConeyLiu · 2017-09-23T02:31:23Z

@jiangxb1987 ,@WeichenXu123, thanks for your reviewing. This change is inspired by the TODO List. You can see the follow code snippet:

    // TODO: Calling aggregateByKey and collect creates two stages, we can implement something
    // TODO: similar to reduceByKeyLocally to save one stage.
    val aggregated = dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd
      .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
      }.aggregateByKey[(Double, DenseVector)]((0.0, Vectors.zeros(numFeatures).toDense))(
      seqOp = {
         case ((weightSum: Double, featureSum: DenseVector), (weight, features)) =>
           requireValues(features)
           BLAS.axpy(weight, features, featureSum)
           (weightSum + weight, featureSum)
      },
      combOp = {
         case ((weightSum1, featureSum1), (weightSum2, featureSum2)) =>
           BLAS.axpy(1.0, featureSum2, featureSum1)
           (weightSum1 + weightSum2, featureSum1)
      }).collect().sortBy(_._1)

The code aggregateByKeyLocally we implemented is similar to the reduceByKeyLocally .
I agree with your suggestion for using OpenHashSet instead of JHashMap. I could change it, and also the reduceByKeyLocally maybe need a change too.
Because here we collect all the aggregated data to driver and sort it. I think the data could be small. And the collected data of two implements could be almost equally.

WeichenXu123 · 2017-09-23T04:20:51Z

Oh I get your point. This is different from RDD.aggregate, it directly return Map and avoid shuffling. it seems useful when numKeys is small.
But, I check the final reduce step, it seems can be optimized using treeAggregate, and we can add a depth parameter.
And using OpenHashSet instead of JHashMap looks better, but we need test first.

ConeyLiu · 2017-09-24T04:56:07Z

Does not treeAggregate will introduce another Shuffle?

WeichenXu123 · 2017-09-24T11:31:44Z

@ConeyLiu Yes tree aggregate introduce extra shuffle. But it is possible to improve perf when driver total collecting data size from executors are large and there're many partitions.
But I think we can keep the same with reduceByKeyLocally for now. This is possible optimization which can be done in future.

ConeyLiu · 2017-09-24T13:44:22Z

OK, just keep it. Does this need more test or more improvements ?

WeichenXu123 · 2017-09-24T14:19:40Z

It is better adding more perf test for OpenHashSet replacement to avoid perf regression. And I found reduceByKeyLocally also use JHashSet, I am not sure whether there is some special reason. ping @cloud-fan Can you help confirm this ? I cannot find the original author for that.

ConeyLiu · 2017-09-25T02:48:27Z

Test case:

test("performance of aggregateByKeyLocally ") {
    val random = new Random(1)

    val pairs = sc.parallelize(0 until 10000000, 20)
      .map(p => (random.nextInt(100), p))
      .persist(StorageLevel.MEMORY_ONLY)

    pairs.count()

    val start = System.currentTimeMillis()
//    val jHashMap = pairs.aggregateByKeyLocallyWithJHashMap(new HashSet[Int]())(_ += _, _ ++= _).toArray
    val openHashMap = pairs.aggregateByKeyLocally(new HashSet[Int]())(_ += _, _ ++= _).toArray
    println(System.currentTimeMillis() - start)
  }

Test result:

map	1	2	3	4	5	6	7	8	9	10	avg
JHashMap	2921	2920	2843	2950	2898	3316	2770	2994	3016	3005	2963.3
OpenHashMap	3029	2884	3064	3023	3108	3194	3003	2961	3115	3023	3040.4

Looks almost the same performance.

ConeyLiu · 2017-10-16T02:17:54Z

Hi @WeichenXu123, any comments on this?

cloud-fan · 2017-10-16T04:38:30Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala


+  /**
+   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
+   * This function can return a different result type, U, than the type of the values in this RDD,


doesn't aggregateByKey perform map side combine?

Yeah, it will. Here the 'difference' means it directly returns a map to the driver rather than an RDD.

what's the difference between aggregateByKeyLocally and aggregateByKey(...).toLocalIterator?

aggregateByKey(...).toLocalIterator need a shuffle for aggregateByKey and then collect the RDD to driver as a iterator. But aggregateByKeyLocally seems like the aggregateByKey, while there isn't a shuffle. It calculates the combines in each task and then collect all the map direcly to driver and do the finally combines on driver.

collect all the map directly to driver

technically it's a shuffle too, and generally aggregateByKey is better for the following reasons:

The final combine is executed on multiple reducers, which has better parallelism than doing it on the driver.

We should always prefer doing computation on executors instead of the driver, because the driver is responsible for scheduling and has a high cost of failure recovery.

using spark shuffle is better for fault tolerant. If one reducer failed, you don't need to rerun all mappers.

So I'm -1 on this API. If there are special cases we wanna to local aggregate, just call RDD.map.collect and do the local aggregate manually.

BTW collect all the map directly to driver may easily OOM the driver, while shuffling to multiple reducers can reduce the memory pressure. Even if only shuffle to one reducer, it still better as the executor usually have more memory than the driver.

thansk for the advance, I'll close it and try mapPartitions(...).collect in NaiveBayes.

add new method 'aggregateByKeyLocally'

73a85dc

WeichenXu123 reviewed Sep 23, 2017

View reviewed changes

WeichenXu123 mentioned this pull request Sep 23, 2017

[WIP][SPARK-22096][ML] use aggregateByKeyLocally in feature frequency calc… #19318

Closed

Merge branch 'master' into aggregatebykeylocally

bf8bfa2

using OpenHashMap stead of JHashMap

c08c70e

small fix

69ae451

cloud-fan reviewed Oct 16, 2017

View reviewed changes

ConeyLiu closed this Oct 18, 2017

Conversation

ConeyLiu commented Sep 22, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ConeyLiu commented Sep 22, 2017

Uh oh!

AmplabJenkins commented Sep 22, 2017

Uh oh!

jiangxb1987 commented Sep 22, 2017

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Sep 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VinceShieh commented Sep 23, 2017

Uh oh!

WeichenXu123 commented Sep 23, 2017

Uh oh!

ConeyLiu commented Sep 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeichenXu123 commented Sep 23, 2017

Uh oh!

ConeyLiu commented Sep 24, 2017

Uh oh!

WeichenXu123 commented Sep 24, 2017

Uh oh!

ConeyLiu commented Sep 24, 2017

Uh oh!

WeichenXu123 commented Sep 24, 2017

Uh oh!

ConeyLiu commented Sep 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ConeyLiu commented Oct 16, 2017

Uh oh!

cloud-fan Oct 16, 2017

Choose a reason for hiding this comment

Uh oh!

ConeyLiu Oct 16, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 16, 2017

Choose a reason for hiding this comment

Uh oh!

ConeyLiu Oct 16, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 16, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 16, 2017

Choose a reason for hiding this comment

Uh oh!

ConeyLiu Oct 18, 2017

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

WeichenXu123 commented Sep 23, 2017 •

edited

Loading

ConeyLiu commented Sep 23, 2017 •

edited

Loading

ConeyLiu commented Sep 25, 2017 •

edited

Loading