-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[spark] [SPARK-6168] Expose some of the collection classes as experimental #5084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #28831 has started for PR 5084 at commit
|
|
Test build #28831 has finished for PR 5084 at commit
|
|
Test PASSed. |
|
@pwendell Can we merge this into 1.3 as well ? Else we will have to wait for 1.4 ... |
|
Would we ever expect to expose these classes in a stable fashion? Exposing generic Scala collection functionality seems somewhat orthogonal to Spark's goals. Unless this kind of functionality is uniquely useful in the context of Spark? |
|
These are not generic scala collections - but specific to using spark at scale. |
|
I think |
|
I agree with Sean and Sandy here. I don't think we should just expose internal utilities like this. The collections classes are simple enough that someone can just copy our code if they want to use it. The reason we have @Experimetnal and @DeveloperAPI are to provide public API's we expect to eventually stabilize or that we want feedback on from the community. They still come with a reasonable expectation of stability and an implicit endorsement that we expect people to use them. [EDIT - an earlier version mistakenly thought this exposed our entire Utils object, not the collections utils] |
|
(Oh I also didn't see that this is |
|
(Edit: I also thought it was the |
|
I'm also in favor of not exposing this and having users copy these classes themselves, which should be pretty easy since these files are more-or-less self-contained. |
|
I think this is a WontFix then and this PR can be closed. |
|
(I have reopened the JIRA - and would strongly vote for pushing this in) There are a few issues to be considered here : a) I am perfectly fine with the classes going away or modified in incompatible ways in future across releases - which is why I was ok with it being Experimental and not DeveloperApi/etc. b) Relying on generic collection api is not an option always - which is why spark has CompactBuffer, etc to begin with. And these are applicable in our usecases here as well. Given our scale of processing, even very small enhancements have marked impact on gc, etc. c) There is also the issue of shading - which is why Utils was exposed. While I am all for exposing reasonable api and minimizing the surface exposed to users to ensure api stability across versions; it should not be done at the cost of having power users having to needlessly fork/duplicate the code. The latter category already has the right expectation from the codebase already. |
|
I think a lot of that is already clear, and I read this as several vetoes, which is why it was closed. Sure, another round of discussion, but unless that changes everyone's views, I think that has to be reflected as a resolution rather than keep this open as a wish, even if one person really wants it. There are lots of things lots of people really want Spark to do for them. Why would you need something from Spark, but not mind if it were taken away? |
|
@srowen private[spark] for things which are not volatile/core spark internals/accesses spark datastructures & state is simply shutting off access to a large body of good code. Not everything can and should make it back to core spark - we should allow advanced users the ability to play around with them for own experiments - to iterate and arrive at best solution/enhancement - and this usually leads to good contributions back to spark : without needing to build private spark forks for everything. To address specifics in your comment. a) Changes in spark do not happen in isolation - but due to something else having changed - for example change in gc pattern, serde changes, bugs, etc. It is use at own risk with worst case being needing to duplicate when versions change. b) If the intent is that Experimental is the wrong annotation classification and sets the wrong expectation, I am fine with that mod. We can add a classification to spark annotation, subsequently circle back to this PR to change it appropriately. c) Clarifying about gauva is slightly involved - but tl;dr is - 1) we do not use gauva in our code, 2) we are having to use it to workaround limitation in spark's top 3) we actually only need Utils since that does what spark (and we) need without using gauva. Having said that, even though PMC member, I have not been active on spark for a while now and so might have missed out on how changes are managed nowadays .. I believe it is extremely suboptimal to maintain private copies/forks for our use for obvious reasons. |
|
I don't think the culture of deciding on changes is different here than in any other Apache-like project. I don't think it helps to declare this a "required" change, and I find overriding the apparent conclusion of, what, four other committers aggressive. I'm sure it wasn't quite intended that way, just sayin'. OK, it can't hurt to revisit a little more, but also important that everywhere we try to rapidly converge, and often the converged answer is 'not this change, not now.' Forking is not a good answer to anything, yes. If Guava does what you want then use Guava, rather than digging it out of Spark's I disagree that there's a new state of code in which it's important enough to be exposed publicly for wide reuse, but which nobody can depend on even existing tomorrow. That's why I don't think a new or existing annotation would help. You are actually depending on this stuff. I believe what you want is simply a way to collaborate on one upstream copy of these classes. A number of classes or methods could be dusted off for reuse. If there were enough to reach a critical mass for |
|
Yeah I agree with Sean and what pretty much everyone else said. My feeling is that we don't want typical spark users to be relying on unstable API's, else it creates a lot of upgrade friction for users and/or pressure on maintainers to stabilize this stuff. We annotate things as unstable and expose them for two reasons (a) to get feedback on new API's we intend to stabilize shortly (b) to expose internal hooks that are meant for third party integrations (similar to kernel modules) such as data sources, but not expected to be needed by end users. Exposing internal utilities for end users, IMO, it's just not a good idea. I don't think slapping an annotation means that there is no cost to doing this. Users will rely on this stuff and it will create upgrade friction, and this is why we are careful about what we expose. |
|
Closing based on internal discussions |
Currently these are private[spark]