[SPARK-31182][CORE][ML] PairRDD support aggregateByKeyWithinPartitions by zhengruifeng · Pull Request #27947 · apache/spark

zhengruifeng · 2020-03-18T09:17:23Z

What changes were proposed in this pull request?

1, impl aggregateByKeyWithinPartitions and reduceByKeyWithinPartitions
2, use aggregateByKeyWithinPartitions in RobustScaler

Why are the changes needed?

When implementing RobustScaler, I was looking for a way to guarantee that the QuantileSummaries in aggregateByKey are compressed at the map side.

(before merge and qurey, the QuantileSummaries must be compressed)

Then I only found a tricky method to work around (yet not applied), and there was no method for this.
previous discussions were here

Does this PR introduce any user-facing change?

Yes, add new methods for PairRDD

How was this patch tested?

added testsuites and existing ones

nit nit

zhengruifeng · 2020-03-18T09:19:07Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

+   * @note V and C can be different -- for example, one might group an RDD of type
+   * (Int, Int) into an RDD of type (Int, Seq[Int]).
+   */
+  def combineByKeyWithClassTagWithinPartitions[C](


this impl follows combineByKeyWithClassTag (treat as if self.partitioner == Some(partitioner))

def combineByKeyWithClassTag[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope { require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0 if (keyClass.isArray) { if (mapSideCombine) { throw new SparkException("Cannot use map-side combining with array keys.") } if (partitioner.isInstanceOf[HashPartitioner]) { throw new SparkException("HashPartitioner cannot partition array keys.") } } val aggregator = new Aggregator[K, V, C]( self.context.clean(createCombiner), self.context.clean(mergeValue), self.context.clean(mergeCombiners)) if (self.partitioner == Some(partitioner)) { self.mapPartitions(iter => { val context = TaskContext.get() new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context)) }, preservesPartitioning = true) } else { new ShuffledRDD[K, V, C](self, partitioner) .setSerializer(serializer) .setAggregator(aggregator) .setMapSideCombine(mapSideCombine) } }

zhengruifeng · 2020-03-18T09:20:39Z

mllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala

-      }.reduceByKey { case (s1, s2) => s1.compress.merge(s2.compress) }
+      ).mapPartitionsWithIndex { case (pid, iter) =>
+        val p = pid % scale
+        iter.map { case (col, s) => ((p, col), s.compress) }


here we can trigger compression at the map side

zhengruifeng · 2020-03-18T09:25:33Z

mllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala

+      ).mapPartitionsWithIndex { case (pid, iter) =>
+        val p = pid % scale
+        iter.map { case (col, s) => ((p, col), s.compress) }
+      }.reduceByKey { case (s1, s2) => s1.merge(s2)


then no longer to confirm compression like s1.compress.merge(s2.compress)

SparkQA · 2020-03-18T12:10:31Z

Test build #119982 has finished for PR 27947 at commit 964b9f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-03-20T12:51:40Z

friendly ping @srowen

wetneb · 2020-04-22T12:09:00Z

Naive question from a newbie: you are introducing new methods for PairRDDs, so would it make sense to also expose them in the Java API (as methods of JavaPairRDD)? Or is it generally expected that the Java API is updated separately?

srowen · 2020-04-22T13:52:10Z

Well, I think that's part of the issue here - if it's public you kind of need to support it everywhere and for a long time. I don't know if it's worth it but I've lost the thread on this PR and would have to recall the motivation.

zhengruifeng · 2020-04-28T03:38:57Z

I tend to close it, since I can always workaround it. Maybe it is not necessary.

zhengruifeng added 2 commits March 18, 2020 16:35

init pr

8c147e1

nit nit

use in robust_scale

964b9f8

zhengruifeng commented Mar 18, 2020

View reviewed changes

zhengruifeng added ML SPARK CORE labels Mar 18, 2020

zhengruifeng commented Mar 18, 2020

View reviewed changes

zhengruifeng closed this Apr 29, 2020

zhengruifeng deleted the aggByKeyWithinPartitions branch April 29, 2020 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31182][CORE][ML] PairRDD support aggregateByKeyWithinPartitions#27947

[SPARK-31182][CORE][ML] PairRDD support aggregateByKeyWithinPartitions#27947
zhengruifeng wants to merge 2 commits intoapache:masterfrom
zhengruifeng:aggByKeyWithinPartitions

zhengruifeng commented Mar 18, 2020 •

edited

Loading

Uh oh!

zhengruifeng Mar 18, 2020

Uh oh!

zhengruifeng Mar 18, 2020 •

edited

Loading

Uh oh!

zhengruifeng Mar 18, 2020

Uh oh!

SparkQA commented Mar 18, 2020

Uh oh!

zhengruifeng commented Mar 20, 2020

Uh oh!

wetneb commented Apr 22, 2020

Uh oh!

srowen commented Apr 22, 2020

Uh oh!

zhengruifeng commented Apr 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

zhengruifeng commented Mar 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng Mar 18, 2020

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Mar 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Mar 18, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 18, 2020

Uh oh!

zhengruifeng commented Mar 20, 2020

Uh oh!

wetneb commented Apr 22, 2020

Uh oh!

srowen commented Apr 22, 2020

Uh oh!

zhengruifeng commented Apr 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

zhengruifeng commented Mar 18, 2020 •

edited

Loading

zhengruifeng Mar 18, 2020 •

edited

Loading