[spark] [SPARK-6168] Expose some of the collection classes as experimental #5084

mridulm · 2015-03-18T20:04:47Z

Currently these are private[spark]

SparkQA · 2015-03-18T20:08:04Z

Test build #28831 has started for PR 5084 at commit 174826c.

This patch merges cleanly.

SparkQA · 2015-03-18T21:29:30Z

Test build #28831 has finished for PR 5084 at commit 174826c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class BoundedPriorityQueue[A](maxSize: Int)(implicit ord: Ordering[A])
- class CompactBuffer[T: ClassTag] extends Seq[T] with Serializable

AmplabJenkins · 2015-03-18T21:29:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28831/
Test PASSed.

mridulm · 2015-03-18T21:42:50Z

@pwendell Can we merge this into 1.3 as well ? Else we will have to wait for 1.4 ...

sryza · 2015-03-20T16:24:22Z

Would we ever expect to expose these classes in a stable fashion? Exposing generic Scala collection functionality seems somewhat orthogonal to Spark's goals. Unless this kind of functionality is uniquely useful in the context of Spark?

mridulm · 2015-03-21T10:50:00Z

These are not generic scala collections - but specific to using spark at scale.
Since we already have them in spark core, it is better to expose them as experimental (so no gaurantees of stability) - alternative is to replicate these in each project which is suboptimal at best.

srowen · 2015-03-21T15:46:39Z

I think Utils was not meant to be exposed at all and don't know that Spark should just open up all of that even as experimental. Likewise I'd like the flexibility to replace some of these custom collections with standard ones, rather than open them up. I believe these should remain private.

pwendell · 2015-03-21T20:08:33Z

I agree with Sean and Sandy here. I don't think we should just expose internal utilities like this. The collections classes are simple enough that someone can just copy our code if they want to use it.

The reason we have @Experimetnal and @DeveloperAPI are to provide public API's we expect to eventually stabilize or that we want feedback on from the community. They still come with a reasonable expectation of stability and an implicit endorsement that we expect people to use them.

[EDIT - an earlier version mistakenly thought this exposed our entire Utils object, not the collections utils]

srowen · 2015-03-21T20:13:58Z

(Oh I also didn't see that this is collection.Utils)

andrewor14 · 2015-03-24T02:38:43Z

(Edit: I also thought it was the util.Utils object. Please disregard my latest comment)

JoshRosen · 2015-03-24T17:33:32Z

I'm also in favor of not exposing this and having users copy these classes themselves, which should be pretty easy since these files are more-or-less self-contained.

srowen · 2015-03-25T20:46:14Z

I think this is a WontFix then and this PR can be closed.

mridulm · 2015-03-25T21:20:28Z

(I have reopened the JIRA - and would strongly vote for pushing this in)
Looks like I was not getting notifications for this PR - so could not participate in the discussion : sorry for the delay !

There are a few issues to be considered here :

a) I am perfectly fine with the classes going away or modified in incompatible ways in future across releases - which is why I was ok with it being Experimental and not DeveloperApi/etc.
If these are not the right annotation designation for exposing these - then we should add something which does allow it. Writing high performant code on top of spark sometimes requires using its internals collection classes (see below) - and advanced users must have the ability to do so with the corresponding instability expectation which comes with it (where documented).
Duplicating everything is highly suboptimal - and creates forks when enhancements are made (either to the private version or to the spark version).

b) Relying on generic collection api is not an option always - which is why spark has CompactBuffer, etc to begin with. And these are applicable in our usecases here as well. Given our scale of processing, even very small enhancements have marked impact on gc, etc.

c) There is also the issue of shading - which is why Utils was exposed.
The gauva bundled version is shaded away without a way to access it - and it is suboptimal to include another version of it just to get to a functionality which needed to be coded in because of limitation in how spark does some of its operations (top K was the specific example here).

While I am all for exposing reasonable api and minimizing the surface exposed to users to ensure api stability across versions; it should not be done at the cost of having power users having to needlessly fork/duplicate the code. The latter category already has the right expectation from the codebase already.

srowen · 2015-03-25T21:48:28Z

I think a lot of that is already clear, and I read this as several vetoes, which is why it was closed. Sure, another round of discussion, but unless that changes everyone's views, I think that has to be reflected as a resolution rather than keep this open as a wish, even if one person really wants it. There are lots of things lots of people really want Spark to do for them.

Why would you need something from Spark, but not mind if it were taken away?
The right place for this is an external library, which all consumers could consume. There are already such libraries, and using them isn't at all constrained by what Spark does or doesn't do.
I don't understand the point about shading, since Guava was shaded exactly so you could bring your own version freely independent of what Spark has.

mridulm · 2015-03-25T22:37:52Z

@srowen
This is not a "wish" - having lead (and leading) multiple efforts which have been nontrivial use of spark, I do think this is required change : the ability to expose internals of spark so that power users can experiment with them - leading to subsequent enhancements to spark later on. This specific PR is just one example of it - and this idiom is not limited to collections btw. This is how all of our changes which made it back to spark started off - ofcourse spark was much more open when we started with very little private[spark].

private[spark] for things which are not volatile/core spark internals/accesses spark datastructures & state is simply shutting off access to a large body of good code. Not everything can and should make it back to core spark - we should allow advanced users the ability to play around with them for own experiments - to iterate and arrive at best solution/enhancement - and this usually leads to good contributions back to spark : without needing to build private spark forks for everything.

To address specifics in your comment.

a) Changes in spark do not happen in isolation - but due to something else having changed - for example change in gc pattern, serde changes, bugs, etc. It is use at own risk with worst case being needing to duplicate when versions change.
Closing them off with private[spark] will mean there is no way to get to them and duplicating everything upfront.

b) If the intent is that Experimental is the wrong annotation classification and sets the wrong expectation, I am fine with that mod. We can add a classification to spark annotation, subsequently circle back to this PR to change it appropriately.

c) Clarifying about gauva is slightly involved - but tl;dr is - 1) we do not use gauva in our code, 2) we are having to use it to workaround limitation in spark's top 3) we actually only need Utils since that does what spark (and we) need without using gauva.

Having said that, even though PMC member, I have not been active on spark for a while now and so might have missed out on how changes are managed nowadays .. I believe it is extremely suboptimal to maintain private copies/forks for our use for obvious reasons.

srowen · 2015-03-25T23:46:01Z

I don't think the culture of deciding on changes is different here than in any other Apache-like project. I don't think it helps to declare this a "required" change, and I find overriding the apparent conclusion of, what, four other committers aggressive. I'm sure it wasn't quite intended that way, just sayin'. OK, it can't hurt to revisit a little more, but also important that everywhere we try to rapidly converge, and often the converged answer is 'not this change, not now.'

Forking is not a good answer to anything, yes. If Guava does what you want then use Guava, rather than digging it out of Spark's Utils. Why is a purpose-built utility library not an option? I also need low-memory, high-performance collections and there are plenty of libs for that already; I prefer Koloboke at the moment. What about those? Probably faster than Spark's utility in the general case.

I disagree that there's a new state of code in which it's important enough to be exposed publicly for wide reuse, but which nobody can depend on even existing tomorrow. That's why I don't think a new or existing annotation would help. You are actually depending on this stuff. I believe what you want is simply a way to collaborate on one upstream copy of these classes.

A number of classes or methods could be dusted off for reuse. If there were enough to reach a critical mass for spark-collections or spark-util (Spark Core Core), that would be the way to do. I don't see enough of a coherent chunk or desire to maintain these for general use, versus just freely messing with them with only Spark's needs in mind internally.

pwendell · 2015-03-27T05:55:14Z

Yeah I agree with Sean and what pretty much everyone else said.

My feeling is that we don't want typical spark users to be relying on unstable API's, else it creates a lot of upgrade friction for users and/or pressure on maintainers to stabilize this stuff. We annotate things as unstable and expose them for two reasons (a) to get feedback on new API's we intend to stabilize shortly (b) to expose internal hooks that are meant for third party integrations (similar to kernel modules) such as data sources, but not expected to be needed by end users.

Exposing internal utilities for end users, IMO, it's just not a good idea. I don't think slapping an annotation means that there is no cost to doing this. Users will rely on this stuff and it will create upgrade friction, and this is why we are careful about what we expose.

mridulm · 2015-03-27T18:41:59Z

Closing based on internal discussions

Expose some of the collection classes as experimental

174826c

mridulm closed this Mar 27, 2015

[spark] [SPARK-6168] Expose some of the collection classes as experimental #5084

[spark] [SPARK-6168] Expose some of the collection classes as experimental #5084

Uh oh!

Conversation

mridulm commented Mar 18, 2015

Uh oh!

SparkQA commented Mar 18, 2015

Uh oh!

SparkQA commented Mar 18, 2015

Uh oh!

AmplabJenkins commented Mar 18, 2015

Uh oh!

mridulm commented Mar 18, 2015

Uh oh!

sryza commented Mar 20, 2015

Uh oh!

mridulm commented Mar 21, 2015

Uh oh!

srowen commented Mar 21, 2015

Uh oh!

pwendell commented Mar 21, 2015

Uh oh!

srowen commented Mar 21, 2015

Uh oh!

andrewor14 commented Mar 24, 2015

Uh oh!

JoshRosen commented Mar 24, 2015

Uh oh!

srowen commented Mar 25, 2015

Uh oh!

mridulm commented Mar 25, 2015

Uh oh!

srowen commented Mar 25, 2015

Uh oh!

mridulm commented Mar 25, 2015

Uh oh!

srowen commented Mar 25, 2015

Uh oh!

pwendell commented Mar 27, 2015

Uh oh!

mridulm commented Mar 27, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants