-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLlib] SPARK-5954: Top by key #5075
Conversation
Test build #28735 has started for PR 5075 at commit
|
test("topByKey") { | ||
val pairs = sc.parallelize(Array((1, 1), (1, 2), (3, 2), (3, 7), (3, 5), (5, 1), (5, 3)), 2) | ||
|
||
val sets = pairs.topByKey(2).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use colelctAsMap
.
Test build #28735 has finished for PR 5075 at commit
|
Test FAILed. |
Test build #28746 has started for PR 5075 at commit
|
Test build #28746 has finished for PR 5075 at commit
|
Test PASSed. |
queue1 ++= queue2 | ||
queue1 | ||
} | ||
).mapValues(_.toArray.sorted(ord.reverse)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the toArray
not already give you the top k in order? this seems to be the behavior already as it returns the array formed from the iterator in the underlying PriorityQueue
. Worth testing I think. (That said, I suppose sorting an already-sorted array is pretty fast.)
Nit: a few lines above, an extra space in front of =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Removing sorted
will fail the test as the toArray
does not generate a sorted sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm OK that surprises me but if you verified it is required, leave it of course
Looking good, but what about the Java / Python APIs too? |
Test build #28815 has started for PR 5075 at commit
|
Thanks for the contribution. I think we should first put this in MLlib, and then consider moving it to RDD in the future. The RDD API is fairly crowded, and adding more functions to it will make it harder to learn/read/understand. |
Test build #28815 has finished for PR 5075 at commit
|
Test PASSed. |
Test build #28891 has started for PR 5075 at commit
|
Test build #28897 has started for PR 5075 at commit
|
Test build #28898 has started for PR 5075 at commit
|
Test build #28891 has finished for PR 5075 at commit
|
Test PASSed. |
Test build #28897 has finished for PR 5075 at commit
|
Test PASSed. |
Test build #28898 has finished for PR 5075 at commit
|
Test PASSed. |
val topMap = pairs.topByKey(2).collectAsMap() | ||
|
||
assert(topMap.size === 3) | ||
val valuesFor1 = topMap(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this variable. Just use topMap(1)
in assert.
Test build #28906 has started for PR 5075 at commit
|
Test build #28906 has finished for PR 5075 at commit
|
Test PASSed. |
LGTM. Merged into master. Thanks! |
@mengxr Thanks! |
This PR implements two functions
topByKey(num: Int): RDD[(K, Array[V])]
finds the top-k values for each key in a pair RDD. This can be used, e.g., in computing top recommendations.takeOrderedByKey(num: Int): RDD[(K, Array[V])]
does the opposite oftopByKey
The
sorted
is used here as thetoArray
method of the PriorityQueue does not return a necessarily sorted array.