Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-4060] [MLlib] exposing special rdd functions to the public #2907

Closed
wants to merge 1 commit into from
Closed

Conversation

numbnut
Copy link

@numbnut numbnut commented Oct 23, 2014

No description provided.

@numbnut numbnut changed the title MLlib, exposing special rdd functions to the public [Spark-4060] [MLlib] exposing special rdd functions to the public Oct 23, 2014
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Oct 24, 2014

At best, this would become an "Experimental" API and marked as such, and need a unit test or two more maybe. What's the use case that makes it worth committing to support these externally?

@mengxr
Copy link
Contributor

mengxr commented Oct 24, 2014

@srowen Unit tests are in

https://github.com/numbnut/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/rdd/RDDFunctionsSuite.scala

I think we can mark it @DeveloperApi. I'm a little concerned about the return type of sliding, which is not Java-friendly. Any suggestions?

@srowen
Copy link
Member

srowen commented Oct 24, 2014

Cool, you would know best if it's ready for external use. Looks good on unit tests.

What if it returned RDD[Array[T]]? I experimented briefly with making that change and it looked like it would work out fairly simple. That's Java friendlier?

@mengxr
Copy link
Contributor

mengxr commented Oct 24, 2014

Sounds good. @numbnut Could you update the PR and change the following?

  1. add @DeveloperAPI to RDDFunctions
  2. change the return type of sliding to RDD[Array[T]] and update the code in other places

Thanks!

@javadba
Copy link
Contributor

javadba commented Oct 29, 2014

RE: use case. We are considering to use the treeAggregate function within the implementation of SpectralClustering. In addition it was noted that the EigenvalueDecomposition.symmetricEigs is private: it is likely we would like to use that too

@shivaram
Copy link
Contributor

Yes - treeAggregate is very useful -- In fact I was going to suggest moving it to the core RDD API. Any reasons to not do that ?

@numbnut
Copy link
Author

numbnut commented Oct 30, 2014

I updated the pull request like proposed. Please review it carefully because I'm new to Spark and Scala.

@mengxr
Copy link
Contributor

mengxr commented Oct 30, 2014

@shivaram For common RDD operations in core/sql, each task is small (including the result) and there are more partitions than executors. treeAggregate creates a shuffle stage and holds data there, while aggregate can start working when partial results are available.

@pwendell Are you comfortable with adding those RDD functions to core?

@shivaram
Copy link
Contributor

Agree that using aggregate vs. treeAggregate depends on the computation, reduction function -- but I don't think its specific to MLLib per se. Any Spark application that has CPU intensive code can benefit from treeAggregate. My view is that we shouldn't replace aggregate with this -- we should just allow users to choose the right aggregation strategy based on what they need

@@ -29,7 +30,7 @@ import org.apache.spark.util.Utils
* Machine learning specific RDD functions.
*/
private[mllib]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work if we still leave the class as private[mllib]?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tests it works fine. I just wanted to expose as less as possible.
Shall I replace "private[mllib]" with "@DeveloperAPI"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is necessary to replace private[mllib] with @DeveloperApi. Otherwise, the methods under RDDFunctions won't show up in the generated doc.

@mengxr
Copy link
Contributor

mengxr commented Nov 3, 2014

ok to test

@mengxr
Copy link
Contributor

mengxr commented Nov 3, 2014

test this please

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22784 has started for PR 2907 at commit 0840e6e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22784 has finished for PR 2907 at commit 0840e6e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RDDFunctions[T: ClassTag](self: RDD[T]) extends Serializable

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22784/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22875 has started for PR 2907 at commit 7f7c767.

  • This patch merges cleanly.

@numbnut
Copy link
Author

numbnut commented Nov 4, 2014

I'm sorry for breaking the tests. I thought they had run cleanly on my machine but I found the mistake and corrected it. Can't explain that.

I also changed "private[mllib]" to "@DeveloperAPI" to make it visible in the docs.

Am I supposed to rebase to the branch-1.2 or what can I do to simplify the merge process?

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22875 has finished for PR 2907 at commit 7f7c767.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RDDFunctions[T: ClassTag](self: RDD[T]) extends Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22875/
Test PASSed.

asfgit pushed a commit that referenced this pull request Nov 4, 2014
Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de>

Closes #2907 from numbnut/master and squashes the following commits:

7f7c767 [Niklas Wilcke] [Spark-4060] [MLlib] exposing special rdd functions to the public, #2907

(cherry picked from commit f90ad5d)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in f90ad5d Nov 4, 2014
@mengxr
Copy link
Contributor

mengxr commented Nov 4, 2014

LGTM. There is a minor issue with @DeveloperAPI annotation, where we also need :: DeveloperApi :: at the beginning of the doc. I will fix that later. I've merged this into master and branch-1.2. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants