add foldLeftByKey to PairRDDFunctions for reduce algorithms that by key ... #2963

koertkuipers · 2014-10-27T20:41:30Z

...need to process values in a particular order

see:
https://issues.apache.org/jira/browse/SPARK-3655

this is the second of 2 competing pullreqs that try to address this issue. this one does so without making changes to core spark sorting routines. it is based on this suggestion by patrick wendell:

Map your RDD[(K, V)] to an RDD[((K, V), null)]
Write a custom partitioner that partitions based only on the K component of the key.
Call repartitionAndSortWithinPartition with your custom partitioner
Map the RDD back into RDD[(K, V)]

the downsides of this approach are that

a little more data goes through the shuffle (one extra object per row), i am not sure if this matters at all
the sorting by value is not generalized

the upside is that it's a much simpler and more self-contained change than the other pullreq.

…ey need to process values in a particular order

AmplabJenkins · 2014-10-27T20:42:10Z

Can one of the admins verify this patch?

zsxwing · 2014-12-04T03:26:59Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

+   * Note: this operation may be expensive, since there is no map-side combine, so all values are
+   * send through the shuffle.
+   */
+  def foldLeftByKey[U: ClassTag](valueOrdering: Ordering[V], zeroValue: U,


Why not an implicit Ordering?

def foldLeftByKey(zeroValue: U, partitioner: Partitioner)(func: (U, V) => U)(implicit vt: ClassTag[U], valueOrdering: Ordering[V]): RDD[(K, U)]

Does this make it harder for the user to provide an ordering other than the natural ordering?

zsxwing · 2014-12-04T03:30:24Z

I suggest a new name for foldLeftByKey. The semantics of foldLeft means aggregate the values from left to right which does not imply sorting the values.

koertkuipers · 2014-12-05T18:01:46Z

Hey @zsxwing,
In Scala Seq the order in which the values get processed in foldLeft is well defined.
But can we make any assumptions at all about the ordering of the values if you do not sort them in Spark? And if not, is foldLeft without sorting still useful?

If so, i guess we can make the sorting optional. Or rename this function to make it clear it sorts.

koertkuipers · 2014-12-20T23:51:49Z

i am going to close this pullreq. i hope to pick up foldLeft later again (together with a proper java version), but for SPARK-3655 the focus for now is on:
#3632

add foldLeftByKey to PairRDDFunctions for reduce algorithms that by k…

4aa5acf

…ey need to process values in a particular order

zsxwing reviewed Dec 4, 2014
View reviewed changes

groupByKeyAndSortValues operation

133ceff

koertkuipers closed this Dec 20, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add foldLeftByKey to PairRDDFunctions for reduce algorithms that by key ... #2963

add foldLeftByKey to PairRDDFunctions for reduce algorithms that by key ... #2963

koertkuipers commented Oct 27, 2014

AmplabJenkins commented Oct 27, 2014

zsxwing Dec 4, 2014

koertkuipers Dec 5, 2014

zsxwing commented Dec 4, 2014

koertkuipers commented Dec 5, 2014

koertkuipers commented Dec 20, 2014

add foldLeftByKey to PairRDDFunctions for reduce algorithms that by key ... #2963

add foldLeftByKey to PairRDDFunctions for reduce algorithms that by key ... #2963

Conversation

koertkuipers commented Oct 27, 2014

AmplabJenkins commented Oct 27, 2014

zsxwing Dec 4, 2014

Choose a reason for hiding this comment

koertkuipers Dec 5, 2014

Choose a reason for hiding this comment

zsxwing commented Dec 4, 2014

koertkuipers commented Dec 5, 2014

koertkuipers commented Dec 20, 2014