Skip to content

[SPARK-22098][CORE] Add new method aggregateByKeyLocally in RDD#19317

Closed
ConeyLiu wants to merge 4 commits intoapache:masterfrom
ConeyLiu:aggregatebykeylocally
Closed

[SPARK-22098][CORE] Add new method aggregateByKeyLocally in RDD#19317
ConeyLiu wants to merge 4 commits intoapache:masterfrom
ConeyLiu:aggregatebykeylocally

Conversation

@ConeyLiu
Copy link
Contributor

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-22096

NaiveBayes currently takes aggreateByKey followed by a collect to calculate frequency for each feature/label. We can implement a new function 'aggregateByKeyLocally' in RDD that merges locally on each mapper before sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~20% performance gain with these changes.

This is a subtask of our improvement.

How was this patch tested?

New UT.

@ConeyLiu
Copy link
Contributor Author

cc @VinceShieh

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@jiangxb1987
Copy link
Contributor

cc @WeichenXu123 mind take a look?

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation of treeAggregate already aggregate locally on each partition and then send result to driver. So I don't understand what's the difference of your impl between current impl and why your impl is faster. Can you verify the perf on more dataset and post the perf result?

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Sep 23, 2017

And I have to point out that your impl have high risk causing OOM. The current impl will auto spill when local hashmap is too large and can take advantage of spark unified memory management mechanism.
Another thing is the JHashmap will be slow perf and it is better to use org.apache.spark.util.collection.OpenHashSet, in the case the hashmap is append-only.

@VinceShieh
Copy link

Nice catch. thanks. the perf gain is truly narrow.
I believe this impl just tried to align with the impl of 'reduceByKeyLocally'.
@ConeyLiu maybe we should revisit the code, along with the 'reduceByKeyLocally' impl.

@WeichenXu123
Copy link
Contributor

Yes. I guess the perf gain is because, this PR use local hashmap which can use unlimited memory, but current spark aggregation impl, will auto spill local hashmap when exceeding a threshold.
Memory management is a difficult thing we should not try to do it in user code I think, because we're hard to estimate how huge the hashmap will grow to.

@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented Sep 23, 2017

@jiangxb1987 ,@WeichenXu123, thanks for your reviewing. This change is inspired by the TODO List. You can see the follow code snippet:

    // TODO: Calling aggregateByKey and collect creates two stages, we can implement something
    // TODO: similar to reduceByKeyLocally to save one stage.
    val aggregated = dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd
      .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
      }.aggregateByKey[(Double, DenseVector)]((0.0, Vectors.zeros(numFeatures).toDense))(
      seqOp = {
         case ((weightSum: Double, featureSum: DenseVector), (weight, features)) =>
           requireValues(features)
           BLAS.axpy(weight, features, featureSum)
           (weightSum + weight, featureSum)
      },
      combOp = {
         case ((weightSum1, featureSum1), (weightSum2, featureSum2)) =>
           BLAS.axpy(1.0, featureSum2, featureSum1)
           (weightSum1 + weightSum2, featureSum1)
      }).collect().sortBy(_._1)
  • The code aggregateByKeyLocally we implemented is similar to the reduceByKeyLocally .
  • I agree with your suggestion for using OpenHashSet instead of JHashMap. I could change it, and also the reduceByKeyLocally maybe need a change too.
  • Because here we collect all the aggregated data to driver and sort it. I think the data could be small. And the collected data of two implements could be almost equally.

@WeichenXu123
Copy link
Contributor

Oh I get your point. This is different from RDD.aggregate, it directly return Map and avoid shuffling. it seems useful when numKeys is small.
But, I check the final reduce step, it seems can be optimized using treeAggregate, and we can add a depth parameter.
And using OpenHashSet instead of JHashMap looks better, but we need test first.

@ConeyLiu
Copy link
Contributor Author

Does not treeAggregate will introduce another Shuffle?

@WeichenXu123
Copy link
Contributor

@ConeyLiu Yes tree aggregate introduce extra shuffle. But it is possible to improve perf when driver total collecting data size from executors are large and there're many partitions.
But I think we can keep the same with reduceByKeyLocally for now. This is possible optimization which can be done in future.

@ConeyLiu
Copy link
Contributor Author

OK, just keep it. Does this need more test or more improvements ?

@WeichenXu123
Copy link
Contributor

It is better adding more perf test for OpenHashSet replacement to avoid perf regression. And I found reduceByKeyLocally also use JHashSet, I am not sure whether there is some special reason. ping @cloud-fan Can you help confirm this ? I cannot find the original author for that.

@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented Sep 25, 2017

Test case:

test("performance of aggregateByKeyLocally ") {
    val random = new Random(1)

    val pairs = sc.parallelize(0 until 10000000, 20)
      .map(p => (random.nextInt(100), p))
      .persist(StorageLevel.MEMORY_ONLY)

    pairs.count()

    val start = System.currentTimeMillis()
//    val jHashMap = pairs.aggregateByKeyLocallyWithJHashMap(new HashSet[Int]())(_ += _, _ ++= _).toArray
    val openHashMap = pairs.aggregateByKeyLocally(new HashSet[Int]())(_ += _, _ ++= _).toArray
    println(System.currentTimeMillis() - start)
  }

Test result:

map 1 2 3 4 5 6 7 8 9 10 avg
JHashMap 2921 2920 2843 2950 2898 3316 2770 2994 3016 3005 2963.3
OpenHashMap 3029 2884 3064 3023 3108 3194 3003 2961 3115 3023 3040.4

Looks almost the same performance.

@ConeyLiu
Copy link
Contributor Author

Hi @WeichenXu123, any comments on this?


/**
* Aggregate the values of each key, using given combine functions and a neutral "zero value".
* This function can return a different result type, U, than the type of the values in this RDD,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't aggregateByKey perform map side combine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it will. Here the 'difference' means it directly returns a map to the driver rather than an RDD.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between aggregateByKeyLocally and aggregateByKey(...).toLocalIterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggregateByKey(...).toLocalIterator need a shuffle for aggregateByKey and then collect the RDD to driver as a iterator. But aggregateByKeyLocally seems like the aggregateByKey, while there isn't a shuffle. It calculates the combines in each task and then collect all the map direcly to driver and do the finally combines on driver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect all the map directly to driver

technically it's a shuffle too, and generally aggregateByKey is better for the following reasons:

  1. The final combine is executed on multiple reducers, which has better parallelism than doing it on the driver.
  2. We should always prefer doing computation on executors instead of the driver, because the driver is responsible for scheduling and has a high cost of failure recovery.
  3. using spark shuffle is better for fault tolerant. If one reducer failed, you don't need to rerun all mappers.

So I'm -1 on this API. If there are special cases we wanna to local aggregate, just call RDD.map.collect and do the local aggregate manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW collect all the map directly to driver may easily OOM the driver, while shuffling to multiple reducers can reduce the memory pressure. Even if only shuffle to one reducer, it still better as the executor usually have more memory than the driver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thansk for the advance, I'll close it and try mapPartitions(...).collect in NaiveBayes.

@ConeyLiu ConeyLiu closed this Oct 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants