Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 1271 (1320) cogroup and groupby should pass iterator[x] #242

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented Mar 26, 2014

No description provided.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13484/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13492/

@holdenk
Copy link
Contributor Author

holdenk commented Mar 27, 2014

Jenkins, retest this please

On Wednesday, March 26, 2014, UCB AMPLab notifications@github.com wrote:

Merged build finished.

Reply to this email directly or view it on GitHubhttps://github.com//pull/242#issuecomment-38765654
.

Cell : 425-233-8271

@mridulm
Copy link
Contributor

mridulm commented Mar 27, 2014

what is the immediate need for this change ?
i am not disagreeing with it, and this would definitely be needed for disk-backed data, etc : but what is the current need for this ?

@aarondav
Copy link
Contributor

We are trying to cement the API for 1.0. This will allow us to provide an actual iterator-based implementation later that does not materialize all the values for a single group-by key at once, for instance. With our current Seq-based solution, we have basically no choice but to OOM now and for any future implementation if the values are too large to fit in memory.

@mridulm
Copy link
Contributor

mridulm commented Mar 27, 2014

fair enough, just wanted to ensure it was being done in preparation for disk backed iterator (or atleast that would be something which can be implemented using this).
Actually, instead of Iterator, would be good if we exposed a restartable iterator - this allows for interesting usecases.

Btw, incompatible api change right ? Any attempts to preserve (with deprecated warning) existing behavior ?

@holdenk
Copy link
Contributor Author

holdenk commented Mar 27, 2014

It would be a bit difficult to preserve the other return type since the call signatures are the same. Its probably worth calling out in the release notes given that I had to make changes to some code which compiled fine against the changes but failed at run-time (mostly in Scala the iterator type has a size method which while it works consumes all of the elements).

@AmplabJenkins
Copy link

Merged build triggered. One or more automated tests failed

@AmplabJenkins
Copy link

Merged build started. One or more automated tests failed

@pwendell
Copy link
Contributor

@mridulm I think type erasure will make it difficult to preserve compatiblity.

@holdenk holdenk changed the title [WIP] Spark 1271 (1320) cogroup and groupby should pass iterator[x] Spark 1271 (1320) cogroup and groupby should pass iterator[x] Mar 27, 2014
@holdenk holdenk changed the title Spark 1271 (1320) cogroup and groupby should pass iterator[x] [WIP] Spark 1271 (1320) cogroup and groupby should pass iterator[x] Mar 27, 2014
@holdenk
Copy link
Contributor Author

holdenk commented Mar 27, 2014

Do we also want these changes for the python API?

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13529/

@mridulm
Copy link
Contributor

mridulm commented Mar 28, 2014

@pwendell agree, I was not referring to method overloading.
There is a lot of code already written which will require rewrite if interface changes (how significant, I am not sure : but cost would be same as Seq does today unfortunately).
I am thinking of how to minimize cost of migration compatibility.

@pwendell
Copy link
Contributor

@mridulm if not method overloading - do you have a specific proposal in mind? Downstream consumers will have to add toSeq which is a bit unfortunate (and in Java it will be clunkier).

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@holdenk
Copy link
Contributor Author

holdenk commented Mar 28, 2014

So I made the python API match as well.

@holdenk
Copy link
Contributor Author

holdenk commented Mar 28, 2014

@mridulm Did you want something like named legacyGroupByKey which would have the old behaviour?

@AmplabJenkins
Copy link

Merged build triggered. One or more automated tests failed.

@pwendell
Copy link
Contributor

I think moving your function to legacyGroupByKey is more annoying than writing toSeq :)

@AmplabJenkins
Copy link

Merged build started. One or more automated tests failed.

@holdenk
Copy link
Contributor Author

holdenk commented Mar 28, 2014

@pwendell true, but its also probably easier to do quickly if you have a lot of code, and the Java people can't use toSeq. That being said I'd rather not add the legacy functions unless there is a strong use case for them.

@pwendell
Copy link
Contributor

pwendell commented Apr 8, 2014

@holdenk this needs to be up-merged to master

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13910/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13911/

@pwendell
Copy link
Contributor

pwendell commented Apr 9, 2014

Thanks @holdenk - merged!

@asfgit asfgit closed this in ce8ec54 Apr 9, 2014
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
Author: Holden Karau <holden@pigscanfly.ca>

Closes apache#242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits:

f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator
77048f8 [Holden Karau] Fix merge up to master
d3fe909 [Holden Karau] use toSeq instead
7a092a3 [Holden Karau] switch resultitr to resultiterable
eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables
c5075aa [Holden Karau] If guava 14 had iterables
2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API
11e730c [Holden Karau] Fix streaming tests
66b583d [Holden Karau] Fix the core test suite to compile
4ed579b [Holden Karau] Refactor from iterator to iterable
d052c07 [Holden Karau] Python tests now pass with iterator pandas
3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work"
cd1e81c [Holden Karau] Try and make pickling list iterators work
c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well
88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming
a5ee714 [Holden Karau] oops, was checking wrong iterator
e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming
ec8cc3e [Holden Karau] Fix test issues\!
4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions
fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD"
ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas"
b692868 [Holden Karau] Revert
7e533f7 [Holden Karau] Fix the bug
8a5153a [Holden Karau] Revert me, but we have some stuff to debug
b4e86a9 [Holden Karau] Add a join based on the problem in SVD
c4510e2 [Holden Karau] Revert this but for now put things in list pandas
b4e0b1d [Holden Karau] Fix style issues
71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness.
b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work
37888ec [Holden Karau] core/tests now pass
249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes
6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy"
fe992fe [Holden Karau] hmmm try and fix up basic operation suite
172705c [Holden Karau] Fix Java API suite
caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy
88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator
4991af6 [Holden Karau] Fix some tests
be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after
687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures
davies pushed a commit to davies/spark that referenced this pull request Apr 14, 2015
[SPARKR-92] Phase 2: implement sum(rdd)
asfgit pushed a commit that referenced this pull request Apr 17, 2015
This PR pulls in recent changes in SparkR-pkg, including

cartesian, intersection, sampleByKey, subtract, subtractByKey, except, and some API for StructType and StructField.

Author: cafreeman <cfreeman@alteryx.com>
Author: Davies Liu <davies@databricks.com>
Author: Zongheng Yang <zongheng.y@gmail.com>
Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Sun Rui <rui.sun@intel.com>

Closes #5436 from davies/R3 and squashes the following commits:

c2b09be [Davies Liu] SQLTypes -> schema
a5a02f2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R3
168b7fe [Davies Liu] sort generics
b1fe460 [Davies Liu] fix conflict in README.md
e74c04e [Davies Liu] fix schema.R
4f5ac09 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R5
41f8184 [Davies Liu] rm man
ae78312 [Davies Liu] Merge pull request #237 from sun-rui/SPARKR-154_3
1bdcb63 [Zongheng Yang] Updates to README.md.
5a553e7 [cafreeman] Use object attribute instead of argument
71372d9 [cafreeman] Update docs and examples
8526d2e [cafreeman] Remove `tojson` functions
6ef5f2d [cafreeman] Fix spacing
7741d66 [cafreeman] Rename the SQL DataType function
141efd8 [Shivaram Venkataraman] Merge pull request #245 from hqzizania/upstream
9387402 [Davies Liu] fix style
40199eb [Shivaram Venkataraman] Move except into sorted position
07d0dbc [Sun Rui] [SPARKR-244] Fix test failure after integration of subtract() and subtractByKey() for RDD.
7e8caa3 [Shivaram Venkataraman] Merge pull request #246 from hlin09/fixCombineByKey
ed66c81 [cafreeman] Update `subtract` to work with `generics.R`
f3ba785 [cafreeman] Fixed duplicate export
275deb4 [cafreeman] Update `NAMESPACE` and tests
1a3b63d [cafreeman] new version of `CreateDF`
836c4bf [cafreeman] Update `createDataFrame` and `toDF`
be5d5c1 [cafreeman] refactor schema functions
40338a4 [Zongheng Yang] Merge pull request #244 from sun-rui/SPARKR-154_5
20b97a6 [Zongheng Yang] Merge pull request #234 from hqzizania/assist
ba54e34 [Shivaram Venkataraman] Merge pull request #238 from sun-rui/SPARKR-154_4
c9497a3 [Shivaram Venkataraman] Merge pull request #208 from lythesia/master
b317aa7 [Zongheng Yang] Merge pull request #243 from hqzizania/master
136a07e [Zongheng Yang] Merge pull request #242 from hqzizania/stats
cd66603 [cafreeman] new line at EOF
8b76e81 [Shivaram Venkataraman] Merge pull request #233 from redbaron/fail-early-on-missing-dep
7dd81b7 [cafreeman] Documentation
0e2a94f [cafreeman] Define functions for schema and fields
liancheng pushed a commit to liancheng/spark that referenced this pull request Mar 17, 2017
….sql

## What changes were proposed in this pull request?

This patch moves all Vacuum-related code from `org.apache.spark` to `com.databricks.sql` as part of the general task of clearly separating Edge issues in order to reduce merge conflicts with OSS.

`AclCommandParser` is renamed to a more general `DatabricksSqlParser`, to be used for all DB-specific syntax and is moved to a new package called `com.databricks.sql.parser`.

`VacuumTableCommand` is moved from `org.apache.spark.sql.execution.command` to `com.databricks.sql.transaction`.

## How was this patch tested?

Tests in project `spark-sql` pass.

Author: Adrian Ionescu <adrian@databricks.com>

Closes apache#242 from adrian-ionescu/SC-5840.
liancheng pushed a commit to liancheng/spark that referenced this pull request Mar 17, 2017
The problem with PR apache#242 was that it renamed, but didn't completely decouple `DatabricksSqlParser` from Acls. As such, the Vacuum command was only recognized if Acl support was enabled (via `spark.session.extensions = AclExtensions` and `spark.databricks.acl.enabled = true`)

## What changes were proposed in this pull request?

- extract out all Acl-related commands from `DatabricksSqlCommandBuilder` into `AclCommandBuilder`
  - separate related test suites accordingly
- make Acl client optional for `DatabricksSqlParser`
- create new `DatabricksExtensions` class that injects `DatabricksSqlParser` without Acl support
- apply `DatabricksExtensions` by default

## How was this patch tested?

Ran all tests in `spark-sql`
Manually tested `Vacuum` from `sparkShell` and `sparkShellAcl`

Author: Adrian Ionescu <adrian@databricks.com>

Closes apache#248 from adrian-ionescu/db-parser.
jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018
* MapR [SPARK-194] Redirect to Spark History server
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants