Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20371][R] Add wrappers for collect_list and collect_set #17672

Closed
wants to merge 1 commit into from

Conversation

zero323
Copy link
Member

@zero323 zero323 commented Apr 18, 2017

What changes were proposed in this pull request?

Adds wrappers for collect_list and collect_set.

How was this patch tested?

Unit tests, check-cran.sh

@SparkQA
Copy link

SparkQA commented Apr 18, 2017

Test build #75906 has finished for PR 17672 at commit 3b580ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2017

Test build #75909 has finished for PR 17672 at commit e5bcb67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2017

Test build #75911 has finished for PR 17672 at commit a02a119.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Apr 18, 2017

cc @felixcheung


#' collect_list
#'
#' Aggregate function: returns a list of objects with duplicates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does other function has Aggregate function: in the description or this is carried over from Scala doc?
if latter we could go without - there's already a @family tag

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Scala docs. Removed.

#'
#' @rdname collect_list
#' @name collect_list
#' @family aggregate_functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an existing agg_funcs in R

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, corrected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: my other comments, my preference would be full text without underscore like @family aggregate functions. This isn't a key or id and it shows up in generated doc text. Anyway, that's a bigger change all around.

@@ -918,6 +918,14 @@ setGeneric("cbrt", function(x) { standardGeneric("cbrt") })
#' @export
setGeneric("ceil", function(x) { standardGeneric("ceil") })

#' @rdname collect_list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are under ###################### Expression Function Methods ##########################
which doesn't seem like the right group

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue here #17674 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it's good.

agg(gd3, collect_set(df8$age), collect_list(df8$age))
)

testthat::expect_equal(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why testthat::?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a habit. Corrected.

@SparkQA
Copy link

SparkQA commented Apr 19, 2017

Test build #75940 has finished for PR 17672 at commit fdc0395.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#'
#' @rdname collect_list
#' @name collect_list
#' @family agg_func
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has an 's' though agg_funcs - not to nit pick but it needs to match exactly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad.

On a side note, do you think we should provide detailed examples for each function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really should - it's a long standing concern that something like

@examples \dontrun{collect_list(df$x)}

is not useful at all. At least we should include how this goes into a select statement, and in some cases what the output looks like. However, with the number of sql functions we have it's a bit of ongoing work to improve this and we definitely can use more help on that

@SparkQA
Copy link

SparkQA commented Apr 20, 2017

Test build #75982 has finished for PR 17672 at commit a089495.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2017

Test build #75988 has finished for PR 17672 at commit 944f5b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Apr 20, 2017

btw, fyi, you don't have to rebase or squash each time you push - it's actually easier to review if you don't - so reviewer can track comments and see the diff from the last time.

but I get that you might be rebasing because of conflict.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zero323
Copy link
Member Author

zero323 commented Apr 20, 2017

btw, fyi, you don't have to rebase or squash each time you push

Noted :) But yeah, there was a conflict after SPARK-20375.

At least we should include how this goes into a select statement, and in some cases what the output looks like.

It would be great.

@felixcheung
Copy link
Member

merged to master.

@asfgit asfgit closed this in fd648bf Apr 21, 2017
@zero323
Copy link
Member Author

zero323 commented Apr 21, 2017

Thanks @felixcheung

@zero323
Copy link
Member Author

zero323 commented Apr 22, 2017

BTW @felixcheung - is there any deeper reason behind current stat of generics.R? I mean:

  • Inconsistent usage of standard and roxygen comments.
  • Marking functions which are not to be exported with @export.
  • Slightly mixed up order (both in groups and between groups).
  • Some minor inconsistencies (like marking asc as @rdname columnfunctions).

@felixcheung
Copy link
Member

felixcheung commented Apr 22, 2017 via email

@zero323
Copy link
Member Author

zero323 commented Apr 22, 2017

Yeah, I have this feeling that it could be deliberate, but I cannot figure out what is the purpose. Removing @exports should be enough, shouldn't it?

I thought about cleaning this up, but I wonder if it is better to wait for SPARK-16693.

@felixcheung
Copy link
Member

felixcheung commented Apr 22, 2017 via email

@zero323 zero323 deleted the SPARK-20371 branch April 26, 2017 13:18
@zero323
Copy link
Member Author

zero323 commented May 15, 2017

@felixcheung Do you know by any chance what is the policy about adding new datasets to Spark? License restrictions, file size and such?

@felixcheung
Copy link
Member

@zero323 I think that its license needs to be compatible with Apache 2.0 and it can't be big (since example data is in the release; no more than a few MB?) https://www.apache.org/licenses/

peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
## What changes were proposed in this pull request?

Adds wrappers for `collect_list` and `collect_set`.

## How was this patch tested?

Unit tests, `check-cran.sh`

Author: zero323 <zero323@users.noreply.github.com>

Closes apache#17672 from zero323/SPARK-20371.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants