New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-3655 GroupByKeyAndSortValues #3632
SPARK-3655 GroupByKeyAndSortValues #3632
Conversation
…instead of RDD[(K, Iterable[V]). i dont think the Iterable version can be implemented efficiently
…he values (the iterables) are in-memory arrays
On a first pass, this doesn't look right. If you are providing additional methods that should be available for Please read this comment: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala#L26 ...and look at how the |
hey @markhamstra i assume you are referring to the one method groupByKeyAndSortValues that has an implicit Ordering[V] parameter, since the other groupByKeyAndSortValues methods do not require an implicit Ordering to be available for the values (it needs to be explicitly provided and is typically use-case specific and not the natural ordering of values). what the benefit is of an implicit conversion to a class that will have this one method without the implicit parameter, versus a single method with an implicit parameter? it seems to me this makes the api harder to understand. is this for java perhaps? |
The reason for separate classes is to cleanly segregate the available/supportable functionality. Not every Things aren't as cleanly separated in the Java API because of the lack of support for implicits there, but that doesn't mean that we should abandon the separation between I really think that we want to repeat the pattern of |
mhhh i dont really agree with you. i find OrderedRDD confusing because:
anyhow, you are right that what i did does not confirm with OrderedRDD. so if others agree with you i will rewrite it like that, no problem! good point on the corner case of K and V having same type... let me think about that. |
woops sorry i hit the wrong button there. didnt mean to close this pullreq. @markhamstra |
@koertkuipers Don't get me wrong, I'm not arguing that the way that Changing to a different pattern is something we could consider for Spark 2.0 when we can break the established public API. |
@markhamstra take a look now. |
ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]] | ||
) extends Ordering[Product2[A, B]] { | ||
private val ord1 = ordering1.getOrElse(new HashOrdering[A]) | ||
private val ord2 = ordering2.getOrElse(new NoOrdering[B]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expected scenario in which a KeyValueOrdering
is called for with B
unordered? You're setting up KeyValueOrdering
to be more general than your needs for its only current usage in OrderedValueRDDFunctions
, but I'm not quite grasping how and where else you are expecting KeyValueOrdering
to be used.
It's seeming to me that KeyValueOrdering
should have two ctors:
KeyValueOrdering[A, B](keyOrdering: Ordering[A], valueOrdering: Ordering[B])
...
this(valueOrdering: Ordering[B]) = this(new HashOrdering[A], valueOrdering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah thats right i copied it from another pullreq by me that needed a more general version. i can simplify it.
…ith ordering by hascode not being a total by comparing elements
Can one of the admins verify this patch? |
See https://issues.apache.org/jira/browse/SPARK-3655
This pullreq is based on the approach that uses repartitionAndSortWithinPartition, but only implements GroupByKeyAndSortValues and not foldLeft.