-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-4060] [MLlib] exposing special rdd functions to the public #2907
Conversation
Can one of the admins verify this patch? |
At best, this would become an "Experimental" API and marked as such, and need a unit test or two more maybe. What's the use case that makes it worth committing to support these externally? |
@srowen Unit tests are in I think we can mark it |
Cool, you would know best if it's ready for external use. Looks good on unit tests. What if it returned |
Sounds good. @numbnut Could you update the PR and change the following?
Thanks! |
RE: use case. We are considering to use the treeAggregate function within the implementation of SpectralClustering. In addition it was noted that the EigenvalueDecomposition.symmetricEigs is private: it is likely we would like to use that too |
Yes - treeAggregate is very useful -- In fact I was going to suggest moving it to the core RDD API. Any reasons to not do that ? |
I updated the pull request like proposed. Please review it carefully because I'm new to Spark and Scala. |
@shivaram For common RDD operations in core/sql, each task is small (including the result) and there are more partitions than executors. @pwendell Are you comfortable with adding those RDD functions to core? |
Agree that using aggregate vs. treeAggregate depends on the computation, reduction function -- but I don't think its specific to MLLib per se. Any Spark application that has CPU intensive code can benefit from treeAggregate. My view is that we shouldn't replace |
@@ -29,7 +30,7 @@ import org.apache.spark.util.Utils | |||
* Machine learning specific RDD functions. | |||
*/ | |||
private[mllib] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work if we still leave the class as private[mllib]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the tests it works fine. I just wanted to expose as less as possible.
Shall I replace "private[mllib]" with "@DeveloperAPI"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is necessary to replace private[mllib]
with @DeveloperApi
. Otherwise, the methods under RDDFunctions
won't show up in the generated doc.
ok to test |
test this please |
Test build #22784 has started for PR 2907 at commit
|
Test build #22784 has finished for PR 2907 at commit
|
Test FAILed. |
Test build #22875 has started for PR 2907 at commit
|
I'm sorry for breaking the tests. I thought they had run cleanly on my machine but I found the mistake and corrected it. Can't explain that. I also changed "private[mllib]" to "@DeveloperAPI" to make it visible in the docs. Am I supposed to rebase to the branch-1.2 or what can I do to simplify the merge process? |
Test build #22875 has finished for PR 2907 at commit
|
Test PASSed. |
Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de> Closes #2907 from numbnut/master and squashes the following commits: 7f7c767 [Niklas Wilcke] [Spark-4060] [MLlib] exposing special rdd functions to the public, #2907 (cherry picked from commit f90ad5d) Signed-off-by: Xiangrui Meng <meng@databricks.com>
LGTM. There is a minor issue with @DeveloperAPI annotation, where we also need |
No description provided.