Spark 1271 (1320) cogroup and groupby should pass iterator[x] #242

holdenk · 2014-03-26T23:38:21Z

No description provided.

AmplabJenkins · 2014-03-27T00:13:54Z

Merged build triggered.

AmplabJenkins · 2014-03-27T00:13:54Z

Merged build started.

AmplabJenkins · 2014-03-27T01:12:34Z

Merged build finished.

AmplabJenkins · 2014-03-27T01:12:34Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13484/

AmplabJenkins · 2014-03-27T03:14:55Z

Merged build triggered.

AmplabJenkins · 2014-03-27T03:14:55Z

Merged build started.

AmplabJenkins · 2014-03-27T03:25:02Z

Merged build finished.

AmplabJenkins · 2014-03-27T03:25:02Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13492/

holdenk · 2014-03-27T03:26:46Z

Jenkins, retest this please

On Wednesday, March 26, 2014, UCB AMPLab notifications@github.com wrote:

Merged build finished.

Reply to this email directly or view it on GitHubhttps://github.com//pull/242#issuecomment-38765654
.

Cell : 425-233-8271

mridulm · 2014-03-27T04:15:22Z

what is the immediate need for this change ?
i am not disagreeing with it, and this would definitely be needed for disk-backed data, etc : but what is the current need for this ?

aarondav · 2014-03-27T05:19:29Z

We are trying to cement the API for 1.0. This will allow us to provide an actual iterator-based implementation later that does not materialize all the values for a single group-by key at once, for instance. With our current Seq-based solution, we have basically no choice but to OOM now and for any future implementation if the values are too large to fit in memory.

mridulm · 2014-03-27T05:54:28Z

fair enough, just wanted to ensure it was being done in preparation for disk backed iterator (or atleast that would be something which can be implemented using this).
Actually, instead of Iterator, would be good if we exposed a restartable iterator - this allows for interesting usecases.

Btw, incompatible api change right ? Any attempts to preserve (with deprecated warning) existing behavior ?

holdenk · 2014-03-27T06:53:35Z

It would be a bit difficult to preserve the other return type since the call signatures are the same. Its probably worth calling out in the release notes given that I had to make changes to some code which compiled fine against the changes but failed at run-time (mostly in Scala the iterator type has a size method which while it works consumes all of the elements).

AmplabJenkins · 2014-03-27T21:42:24Z

Merged build triggered. One or more automated tests failed

AmplabJenkins · 2014-03-27T21:42:32Z

Merged build started. One or more automated tests failed

pwendell · 2014-03-27T21:43:01Z

@mridulm I think type erasure will make it difficult to preserve compatiblity.

holdenk · 2014-03-27T22:30:27Z

Do we also want these changes for the python API?

AmplabJenkins · 2014-03-27T22:35:47Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-03-27T22:35:47Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13529/

mridulm · 2014-03-28T01:25:35Z

@pwendell agree, I was not referring to method overloading.
There is a lot of code already written which will require rewrite if interface changes (how significant, I am not sure : but cost would be same as Seq does today unfortunately).
I am thinking of how to minimize cost of migration compatibility.

pwendell · 2014-03-28T01:26:29Z

@mridulm if not method overloading - do you have a specific proposal in mind? Downstream consumers will have to add toSeq which is a bit unfortunate (and in Java it will be clunkier).

AmplabJenkins · 2014-03-28T01:54:31Z

Can one of the admins verify this patch?

holdenk · 2014-03-28T03:59:43Z

So I made the python API match as well.

holdenk · 2014-03-28T04:01:00Z

@mridulm Did you want something like named legacyGroupByKey which would have the old behaviour?

AmplabJenkins · 2014-03-28T04:02:22Z

Merged build triggered. One or more automated tests failed.

pwendell · 2014-03-28T04:02:26Z

I think moving your function to legacyGroupByKey is more annoying than writing toSeq :)

AmplabJenkins · 2014-03-28T04:02:32Z

Merged build started. One or more automated tests failed.

holdenk · 2014-03-28T04:13:18Z

@pwendell true, but its also probably easier to do quickly if you have a lot of code, and the Java people can't use toSeq. That being said I'd rather not add the legacy functions unless there is a strong use case for them.

…iterables

pwendell · 2014-04-08T21:10:44Z

@holdenk this needs to be up-merged to master

AmplabJenkins · 2014-04-08T21:52:23Z

Merged build triggered.

AmplabJenkins · 2014-04-08T21:52:33Z

Merged build started.

AmplabJenkins · 2014-04-08T22:28:56Z

Merged build finished.

AmplabJenkins · 2014-04-08T22:28:56Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13910/

AmplabJenkins · 2014-04-08T23:02:23Z

Merged build triggered.

AmplabJenkins · 2014-04-08T23:02:33Z

Merged build started.

AmplabJenkins · 2014-04-08T23:40:49Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-08T23:40:49Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13911/

pwendell · 2014-04-09T01:17:13Z

Thanks @holdenk - merged!

Author: Holden Karau <holden@pigscanfly.ca> Closes apache#242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits: f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator 77048f8 [Holden Karau] Fix merge up to master d3fe909 [Holden Karau] use toSeq instead 7a092a3 [Holden Karau] switch resultitr to resultiterable eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables c5075aa [Holden Karau] If guava 14 had iterables 2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API 11e730c [Holden Karau] Fix streaming tests 66b583d [Holden Karau] Fix the core test suite to compile 4ed579b [Holden Karau] Refactor from iterator to iterable d052c07 [Holden Karau] Python tests now pass with iterator pandas 3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work" cd1e81c [Holden Karau] Try and make pickling list iterators work c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well 88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming a5ee714 [Holden Karau] oops, was checking wrong iterator e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming ec8cc3e [Holden Karau] Fix test issues\! 4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD" ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas" b692868 [Holden Karau] Revert 7e533f7 [Holden Karau] Fix the bug 8a5153a [Holden Karau] Revert me, but we have some stuff to debug b4e86a9 [Holden Karau] Add a join based on the problem in SVD c4510e2 [Holden Karau] Revert this but for now put things in list pandas b4e0b1d [Holden Karau] Fix style issues 71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness. b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work 37888ec [Holden Karau] core/tests now pass 249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes 6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy" fe992fe [Holden Karau] hmmm try and fix up basic operation suite 172705c [Holden Karau] Fix Java API suite caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy 88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator 4991af6 [Holden Karau] Fix some tests be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after 687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures

[SPARKR-92] Phase 2: implement sum(rdd)

This PR pulls in recent changes in SparkR-pkg, including cartesian, intersection, sampleByKey, subtract, subtractByKey, except, and some API for StructType and StructField. Author: cafreeman <cfreeman@alteryx.com> Author: Davies Liu <davies@databricks.com> Author: Zongheng Yang <zongheng.y@gmail.com> Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com> Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Author: Sun Rui <rui.sun@intel.com> Closes #5436 from davies/R3 and squashes the following commits: c2b09be [Davies Liu] SQLTypes -> schema a5a02f2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R3 168b7fe [Davies Liu] sort generics b1fe460 [Davies Liu] fix conflict in README.md e74c04e [Davies Liu] fix schema.R 4f5ac09 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R5 41f8184 [Davies Liu] rm man ae78312 [Davies Liu] Merge pull request #237 from sun-rui/SPARKR-154_3 1bdcb63 [Zongheng Yang] Updates to README.md. 5a553e7 [cafreeman] Use object attribute instead of argument 71372d9 [cafreeman] Update docs and examples 8526d2e [cafreeman] Remove `tojson` functions 6ef5f2d [cafreeman] Fix spacing 7741d66 [cafreeman] Rename the SQL DataType function 141efd8 [Shivaram Venkataraman] Merge pull request #245 from hqzizania/upstream 9387402 [Davies Liu] fix style 40199eb [Shivaram Venkataraman] Move except into sorted position 07d0dbc [Sun Rui] [SPARKR-244] Fix test failure after integration of subtract() and subtractByKey() for RDD. 7e8caa3 [Shivaram Venkataraman] Merge pull request #246 from hlin09/fixCombineByKey ed66c81 [cafreeman] Update `subtract` to work with `generics.R` f3ba785 [cafreeman] Fixed duplicate export 275deb4 [cafreeman] Update `NAMESPACE` and tests 1a3b63d [cafreeman] new version of `CreateDF` 836c4bf [cafreeman] Update `createDataFrame` and `toDF` be5d5c1 [cafreeman] refactor schema functions 40338a4 [Zongheng Yang] Merge pull request #244 from sun-rui/SPARKR-154_5 20b97a6 [Zongheng Yang] Merge pull request #234 from hqzizania/assist ba54e34 [Shivaram Venkataraman] Merge pull request #238 from sun-rui/SPARKR-154_4 c9497a3 [Shivaram Venkataraman] Merge pull request #208 from lythesia/master b317aa7 [Zongheng Yang] Merge pull request #243 from hqzizania/master 136a07e [Zongheng Yang] Merge pull request #242 from hqzizania/stats cd66603 [cafreeman] new line at EOF 8b76e81 [Shivaram Venkataraman] Merge pull request #233 from redbaron/fail-early-on-missing-dep 7dd81b7 [cafreeman] Documentation 0e2a94f [cafreeman] Define functions for schema and fields

….sql ## What changes were proposed in this pull request? This patch moves all Vacuum-related code from `org.apache.spark` to `com.databricks.sql` as part of the general task of clearly separating Edge issues in order to reduce merge conflicts with OSS. `AclCommandParser` is renamed to a more general `DatabricksSqlParser`, to be used for all DB-specific syntax and is moved to a new package called `com.databricks.sql.parser`. `VacuumTableCommand` is moved from `org.apache.spark.sql.execution.command` to `com.databricks.sql.transaction`. ## How was this patch tested? Tests in project `spark-sql` pass. Author: Adrian Ionescu <adrian@databricks.com> Closes apache#242 from adrian-ionescu/SC-5840.

The problem with PR apache#242 was that it renamed, but didn't completely decouple `DatabricksSqlParser` from Acls. As such, the Vacuum command was only recognized if Acl support was enabled (via `spark.session.extensions = AclExtensions` and `spark.databricks.acl.enabled = true`) ## What changes were proposed in this pull request? - extract out all Acl-related commands from `DatabricksSqlCommandBuilder` into `AclCommandBuilder` - separate related test suites accordingly - make Acl client optional for `DatabricksSqlParser` - create new `DatabricksExtensions` class that injects `DatabricksSqlParser` without Acl support - apply `DatabricksExtensions` by default ## How was this patch tested? Ran all tests in `spark-sql` Manually tested `Vacuum` from `sparkShell` and `sparkShellAcl` Author: Adrian Ionescu <adrian@databricks.com> Closes apache#248 from adrian-ionescu/db-parser.

* MapR [SPARK-194] Redirect to Spark History server

…pache#242) (apache#247)

Add install-k8s role

holdenk changed the title ~~[WIP] Spark 1271 (1320) cogroup and groupby should pass iterator[x]~~ Spark 1271 (1320) cogroup and groupby should pass iterator[x] Mar 27, 2014

holdenk changed the title ~~Spark 1271 (1320) cogroup and groupby should pass iterator[x]~~ [WIP] Spark 1271 (1320) cogroup and groupby should pass iterator[x] Mar 27, 2014

holdenk added 6 commits April 8, 2014 13:42

Fix streaming tests

11e730c

Fix Java 8 cogroup tests for the new API

2d06e10

If guava 14 had iterables

c5075aa

maybe I should have had a coffee first. use correct import for guava …

eb06216

…iterables

switch resultitr to resultiterable

7a092a3

use toSeq instead

d3fe909

Fix merge up to master

77048f8

Fix bad merge, should have been Iterable rather than Iterator

f289536

asfgit closed this in ce8ec54 Apr 9, 2014

davies pushed a commit to davies/spark that referenced this pull request Apr 14, 2015

Merge pull request apache#242 from hqzizania/stats

136a07e

[SPARKR-92] Phase 2: implement sum(rdd)

jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018

MapR [SPARK-194] Redirect to Spark History server (apache#242)

71cd6d2

* MapR [SPARK-194] Redirect to Spark History server

Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018

[TESTS] Made the Spark package name configurable in test_rpc_auth() (a…

f217b59

…pache#242) (apache#247)

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#242 from mrhillsman/installk8s

f4f1c0d

Add install-k8s role

Spark 1271 (1320) cogroup and groupby should pass iterator[x] #242

Spark 1271 (1320) cogroup and groupby should pass iterator[x] #242

Conversation

holdenk commented Mar 26, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

holdenk commented Mar 27, 2014

mridulm commented Mar 27, 2014

aarondav commented Mar 27, 2014

mridulm commented Mar 27, 2014

holdenk commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

pwendell commented Mar 27, 2014

holdenk commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

mridulm commented Mar 28, 2014

pwendell commented Mar 28, 2014

AmplabJenkins commented Mar 28, 2014

holdenk commented Mar 28, 2014

holdenk commented Mar 28, 2014

AmplabJenkins commented Mar 28, 2014

pwendell commented Mar 28, 2014

AmplabJenkins commented Mar 28, 2014

holdenk commented Mar 28, 2014

pwendell commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

pwendell commented Apr 9, 2014