Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLlib] SPARK-5954: Top by key #5075

Closed
wants to merge 10 commits into from
Closed

[MLlib] SPARK-5954: Top by key #5075

wants to merge 10 commits into from

Conversation

coderxiang
Copy link
Contributor

This PR implements two functions

  • topByKey(num: Int): RDD[(K, Array[V])] finds the top-k values for each key in a pair RDD. This can be used, e.g., in computing top recommendations.
  • takeOrderedByKey(num: Int): RDD[(K, Array[V])] does the opposite of topByKey

The sorted is used here as the toArray method of the PriorityQueue does not return a necessarily sorted array.

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28735 has started for PR 5075 at commit 0895c17.

  • This patch merges cleanly.

test("topByKey") {
val pairs = sc.parallelize(Array((1, 1), (1, 2), (3, 2), (3, 7), (3, 5), (5, 1), (5, 3)), 2)

val sets = pairs.topByKey(2).collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use colelctAsMap.

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28735 has finished for PR 5075 at commit 0895c17.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28735/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28746 has started for PR 5075 at commit 70c6e35.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28746 has finished for PR 5075 at commit 70c6e35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28746/
Test PASSed.

queue1 ++= queue2
queue1
}
).mapValues(_.toArray.sorted(ord.reverse))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the toArray not already give you the top k in order? this seems to be the behavior already as it returns the array formed from the iterator in the underlying PriorityQueue. Worth testing I think. (That said, I suppose sorting an already-sorted array is pretty fast.)

Nit: a few lines above, an extra space in front of =>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Removing sorted will fail the test as the toArray does not generate a sorted sequence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm OK that surprises me but if you verified it is required, leave it of course

@srowen
Copy link
Member

srowen commented Mar 18, 2015

Looking good, but what about the Java / Python APIs too?

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28815 has started for PR 5075 at commit 901b0af.

  • This patch merges cleanly.

@rxin
Copy link
Contributor

rxin commented Mar 18, 2015

Thanks for the contribution.

I think we should first put this in MLlib, and then consider moving it to RDD in the future. The RDD API is fairly crowded, and adding more functions to it will make it harder to learn/read/understand.

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28815 has finished for PR 5075 at commit 901b0af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28815/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 19, 2015

Test build #28891 has started for PR 5075 at commit 82dded9.

  • This patch merges cleanly.

@coderxiang
Copy link
Contributor Author

@rxin @mengxr per the comments, I created MLPairRDDFunctions.scala and moved the function there in the update.

@coderxiang coderxiang changed the title [Core] SPARK-5954: Top by key [MLlib] SPARK-5954: Top by key Mar 19, 2015
@SparkQA
Copy link

SparkQA commented Mar 19, 2015

Test build #28897 has started for PR 5075 at commit a80e0ec.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 19, 2015

Test build #28898 has started for PR 5075 at commit 6f565c0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 19, 2015

Test build #28891 has finished for PR 5075 at commit 82dded9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MLPairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) extends Serializable
    • class CheckpointWriteHandler(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28891/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 20, 2015

Test build #28897 has finished for PR 5075 at commit a80e0ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MLPairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) extends Serializable
    • class CheckpointWriteHandler(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28897/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 20, 2015

Test build #28898 has finished for PR 5075 at commit 6f565c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MLPairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) extends Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28898/
Test PASSed.

val topMap = pairs.topByKey(2).collectAsMap()

assert(topMap.size === 3)
val valuesFor1 = topMap(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this variable. Just use topMap(1) in assert.

@SparkQA
Copy link

SparkQA commented Mar 20, 2015

Test build #28906 has started for PR 5075 at commit 1611c37.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 20, 2015

Test build #28906 has finished for PR 5075 at commit 1611c37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MLPairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) extends Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28906/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Mar 20, 2015

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 5e6ad24 Mar 20, 2015
@coderxiang
Copy link
Contributor Author

@mengxr Thanks!

@coderxiang coderxiang deleted the topByKey branch March 20, 2015 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants