[Spark-4060] [MLlib] exposing special rdd functions to the public #2907

numbnut · 2014-10-23T10:08:54Z

No description provided.

AmplabJenkins · 2014-10-23T10:12:11Z

Can one of the admins verify this patch?

srowen · 2014-10-24T14:25:33Z

At best, this would become an "Experimental" API and marked as such, and need a unit test or two more maybe. What's the use case that makes it worth committing to support these externally?

mengxr · 2014-10-24T20:39:34Z

@srowen Unit tests are in

https://github.com/numbnut/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/rdd/RDDFunctionsSuite.scala

I think we can mark it @DeveloperApi. I'm a little concerned about the return type of sliding, which is not Java-friendly. Any suggestions?

srowen · 2014-10-24T21:49:07Z

Cool, you would know best if it's ready for external use. Looks good on unit tests.

What if it returned RDD[Array[T]]? I experimented briefly with making that change and it looked like it would work out fairly simple. That's Java friendlier?

mengxr · 2014-10-24T22:45:26Z

Sounds good. @numbnut Could you update the PR and change the following?

add @DeveloperAPI to RDDFunctions
change the return type of sliding to RDD[Array[T]] and update the code in other places

Thanks!

javadba · 2014-10-29T05:20:34Z

RE: use case. We are considering to use the treeAggregate function within the implementation of SpectralClustering. In addition it was noted that the EigenvalueDecomposition.symmetricEigs is private: it is likely we would like to use that too

shivaram · 2014-10-29T05:59:08Z

Yes - treeAggregate is very useful -- In fact I was going to suggest moving it to the core RDD API. Any reasons to not do that ?

numbnut · 2014-10-30T15:21:16Z

I updated the pull request like proposed. Please review it carefully because I'm new to Spark and Scala.

mengxr · 2014-10-30T17:49:48Z

@shivaram For common RDD operations in core/sql, each task is small (including the result) and there are more partitions than executors. treeAggregate creates a shuffle stage and holds data there, while aggregate can start working when partial results are available.

@pwendell Are you comfortable with adding those RDD functions to core?

shivaram · 2014-10-30T17:56:11Z

Agree that using aggregate vs. treeAggregate depends on the computation, reduction function -- but I don't think its specific to MLLib per se. Any Spark application that has CPU intensive code can benefit from treeAggregate. My view is that we shouldn't replace aggregate with this -- we should just allow users to choose the right aggregation strategy based on what they need

mengxr · 2014-10-30T22:50:43Z

mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala

@@ -29,7 +30,7 @@ import org.apache.spark.util.Utils
 * Machine learning specific RDD functions.
 */
 private[mllib]


Does it work if we still leave the class as private[mllib]?

In the tests it works fine. I just wanted to expose as less as possible.
Shall I replace "private[mllib]" with "@DeveloperAPI"?

I think it is necessary to replace private[mllib] with @DeveloperApi. Otherwise, the methods under RDDFunctions won't show up in the generated doc.

mengxr · 2014-11-03T00:54:45Z

ok to test

mengxr · 2014-11-03T00:54:50Z

test this please

SparkQA · 2014-11-03T01:00:22Z

Test build #22784 has started for PR 2907 at commit 0840e6e.

This patch merges cleanly.

SparkQA · 2014-11-03T02:01:22Z

Test build #22784 has finished for PR 2907 at commit 0840e6e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RDDFunctions[T: ClassTag](self: RDD[T]) extends Serializable

AmplabJenkins · 2014-11-03T02:01:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22784/
Test FAILed.

SparkQA · 2014-11-04T10:19:56Z

Test build #22875 has started for PR 2907 at commit 7f7c767.

This patch merges cleanly.

numbnut · 2014-11-04T10:25:38Z

I'm sorry for breaking the tests. I thought they had run cleanly on my machine but I found the mistake and corrected it. Can't explain that.

I also changed "private[mllib]" to "@DeveloperAPI" to make it visible in the docs.

Am I supposed to rebase to the branch-1.2 or what can I do to simplify the merge process?

SparkQA · 2014-11-04T11:44:46Z

Test build #22875 has finished for PR 2907 at commit 7f7c767.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RDDFunctions[T: ClassTag](self: RDD[T]) extends Serializable

AmplabJenkins · 2014-11-04T11:44:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22875/
Test PASSed.

Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de> Closes #2907 from numbnut/master and squashes the following commits: 7f7c767 [Niklas Wilcke] [Spark-4060] [MLlib] exposing special rdd functions to the public, #2907 (cherry picked from commit f90ad5d) Signed-off-by: Xiangrui Meng <meng@databricks.com>

mengxr · 2014-11-04T17:59:04Z

LGTM. There is a minor issue with @DeveloperAPI annotation, where we also need :: DeveloperApi :: at the beginning of the doc. I will fix that later. I've merged this into master and branch-1.2. Thanks!

numbnut changed the title ~~MLlib, exposing special rdd functions to the public~~ [Spark-4060] [MLlib] exposing special rdd functions to the public Oct 23, 2014

mengxr reviewed Oct 30, 2014
View reviewed changes

[Spark-4060] [MLlib] exposing special rdd functions to the public, #2907

7f7c767

asfgit closed this in f90ad5d Nov 4, 2014

[Spark-4060] [MLlib] exposing special rdd functions to the public #2907

[Spark-4060] [MLlib] exposing special rdd functions to the public #2907

Uh oh!

Conversation

numbnut commented Oct 23, 2014

Uh oh!

AmplabJenkins commented Oct 23, 2014

Uh oh!

srowen commented Oct 24, 2014

Uh oh!

mengxr commented Oct 24, 2014

Uh oh!

srowen commented Oct 24, 2014

Uh oh!

mengxr commented Oct 24, 2014

Uh oh!

javadba commented Oct 29, 2014

Uh oh!

shivaram commented Oct 29, 2014

Uh oh!

numbnut commented Oct 30, 2014

Uh oh!

mengxr commented Oct 30, 2014

Uh oh!

shivaram commented Oct 30, 2014

Uh oh!

mengxr Oct 30, 2014

Choose a reason for hiding this comment

Uh oh!

numbnut Oct 31, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr Nov 3, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr commented Nov 3, 2014

Uh oh!

mengxr commented Nov 3, 2014

Uh oh!

SparkQA commented Nov 3, 2014

Uh oh!

SparkQA commented Nov 3, 2014

Uh oh!

AmplabJenkins commented Nov 3, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

numbnut commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

AmplabJenkins commented Nov 4, 2014

Uh oh!

mengxr commented Nov 4, 2014

Uh oh!

Uh oh!