Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add foldLeftByKey to PairRDDFunctions for reduce algorithms that by key ... #2963

Closed

Conversation

koertkuipers
Copy link
Contributor

...need to process values in a particular order

see:
https://issues.apache.org/jira/browse/SPARK-3655

this is the second of 2 competing pullreqs that try to address this issue. this one does so without making changes to core spark sorting routines. it is based on this suggestion by patrick wendell:

  1. Map your RDD[(K, V)] to an RDD[((K, V), null)]
  2. Write a custom partitioner that partitions based only on the K component of the key.
  3. Call repartitionAndSortWithinPartition with your custom partitioner
  4. Map the RDD back into RDD[(K, V)]

the downsides of this approach are that

  1. a little more data goes through the shuffle (one extra object per row), i am not sure if this matters at all
  2. the sorting by value is not generalized

the upside is that it's a much simpler and more self-contained change than the other pullreq.

…ey need to process values in a particular order
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

* Note: this operation may be expensive, since there is no map-side combine, so all values are
* send through the shuffle.
*/
def foldLeftByKey[U: ClassTag](valueOrdering: Ordering[V], zeroValue: U,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not an implicit Ordering?

def foldLeftByKey(zeroValue: U, partitioner: Partitioner)(func: (U, V) => U)(implicit vt: ClassTag[U], valueOrdering: Ordering[V]): RDD[(K, U)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make it harder for the user to provide an ordering other than the natural ordering?

@zsxwing
Copy link
Member

zsxwing commented Dec 4, 2014

I suggest a new name for foldLeftByKey. The semantics of foldLeft means aggregate the values from left to right which does not imply sorting the values.

@koertkuipers
Copy link
Contributor Author

Hey @zsxwing,
In Scala Seq the order in which the values get processed in foldLeft is well defined.
But can we make any assumptions at all about the ordering of the values if you do not sort them in Spark? And if not, is foldLeft without sorting still useful?

If so, i guess we can make the sorting optional. Or rename this function to make it clear it sorts.

@koertkuipers
Copy link
Contributor Author

i am going to close this pullreq. i hope to pick up foldLeft later again (together with a proper java version), but for SPARK-3655 the focus for now is on:
#3632

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants