[SPARK-20532][SPARKR] Implement grouping and grouping_id #17807

zero323 · 2017-04-29T05:12:52Z

What changes were proposed in this pull request?

Adds R wrappers for:

o.a.s.sql.functions.grouping as o.a.s.sql.functions.is_grouping (to avoid shading base::grouping
o.a.s.sql.functions.grouping_id

How was this patch tested?

Existing unit tests, additional unit tests. check-cran.sh.

SparkQA · 2017-04-29T05:17:38Z

Test build #76294 has finished for PR 17807 at commit 4f13b61.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-29T06:12:35Z

Test build #76296 has started for PR 17807 at commit 10fdaa5.

zero323 · 2017-04-29T07:36:01Z

Jenkins retest this please.

SparkQA · 2017-04-29T08:10:35Z

Test build #76299 has finished for PR 17807 at commit 10fdaa5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-04-29T17:34:59Z

R/pkg/NAMESPACE

please sort this. you might have forgotten to update this after changing the name

felixcheung · 2017-04-29T17:35:41Z

R/pkg/R/functions.R

is it not more intuitive if it is TRUE/FALSE? I guess this is the behavior in Scala?

As far as I remember this is actually SQL:1999 standard. And has direct relation to binary encoding used for grouping id.

Maybe is_grouping is not the most fortunate choice of name, but I don't have a better idea. Theoretically we can use just grouping and let users worry about disambiguation.

Or maybe grouping_col?

Let's go with grouping_col. It doesn't suggest boolean type, and doesn't conflict with built-ins.

felixcheung · 2017-04-29T17:37:21Z

R/pkg/R/functions.R

I'd simplify this - since this is the in @family agg_funcs generated doc already has links to most of these. unless there is a very direct relation, say to grouping_id, I don't think we need extra links here

... because it gets very messy (which is something we need to clean up actually)

If you think it is better... Though @family based lists grow pretty fast and it is hard to find anything useful there.

which is something we need to clean up actually

There are like four functions left for full parity so we can try to figure this out after that. I would like to take a look at the overall structure as well. There is a lot of boilerplate there.

I'm ok either way. We have a pending item to clean up @family though, we are seeing duplicated entries (which is why it's so long) because of the use of @aliases to get CRAN check happy

felixcheung · 2017-04-29T17:38:00Z

R/pkg/R/generics.R

felixcheung · 2017-04-29T17:38:11Z

R/pkg/R/functions.R

ditto links

felixcheung · 2017-04-29T17:38:38Z

R/pkg/R/functions.R

minor: add (optional)?

Technically speaking it is true, but grouping_id with single column is just grouping :)

felixcheung · 2017-04-29T17:39:30Z

R/pkg/R/functions.R

does <<; gets handled properly?

also perhaps it's more appropriate to use the R-equivalent operator instead?

Actually I am not aware of any base implementation. Is there one?

We could use LaTeX

#' Equals to \eqn{c_1 2^{n - 1} + c_2 2^{n -1} + ... + c_n}

but it no rendered in HTML:

or bitwShiftL? https://stat.ethz.ch/R-manual/R-devel/library/base/html/bitwise.html
it's not as readable I agree, but it's syntax correct R code..

If we prefer code then this should be exactly what we need.

grouping_col(c1) * 2^(n - 1) + grouping_col(c2) * 2^(n - 2) + ... + grouping_col(cn)

It is a valid code R / SparkR code, and arguably pretty clear.

SparkQA · 2017-04-29T19:36:04Z

Test build #76304 has finished for PR 17807 at commit 93e1c10.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-04-29T21:14:13Z

Didn't we added something_col recently? I worry if the meaning diverge from the earlier case. Also if grouping_id(a$col) == grouping(a$col) maybe we just need one, grouping_id is good enough?

zero323 · 2017-04-30T00:48:43Z

Didn't we added something_col recently?

I don't think we did. There was itemsCol in fpGrowth, but it is not relevant here. git grep "_col" | grep "\.R" shows only a bunch of _colors, _collections and _collected used for tests and examples.

is_grouping is probably not so fortunate choice, because if column is used for grouping, it is set to 0.

Also if grouping_id(a$col) == grouping(a$col) maybe we just need one, grouping_id is good enough?

This is certainly a good point, but we won't get away with it. I don't know if is standard or not but grouping_id(...) has to match groupBy(df, ....).

Luckily we don't have to worry about group_id :)

felixcheung · 2017-04-30T05:39:31Z

right, there's a bunch of *Col in ml. I think people would associate col to a reference to a/another column in the dataframe

I do see your point about is_grouping, and also that grouping_id is not a substitute. It is unfortunate R's grouping is conflicting with this. I'm not sure if I have a good name suggestion. We in occasion have workaround issue like this - since that base::grouping is S3 method we could add a generic etc to route the call to get it to be compatible. do think in this case it is worthwhile to do though?

zero323 · 2017-04-30T06:42:39Z

What happens if we just mask base::grouping? It is just a gut feeling but I don't think it is the most widely used function ever. I doubt I ever used it in my own code, and cannot recall seeing it anywhere else.

We won't brake any packages, and accessing base version via it's full qualified name will work just fine. And it won't be the the first function we mask.

felixcheung · 2017-04-30T19:09:56Z

I understanding what you are saying and for frequent R users this might not really be a big issue, so far we are taking the view of avoiding masking any inconvenience, if we if at all could.

Ideally, I'd prefer to eliminate any conflict with cov, filter, and sample as well.
This is what I'm referring to re: the workaround for drop. If the signature isn't too strange or different, we should try to export a generic + stub that makes the base:: function still callable without a namespace prefix.

From a quick check I think what we have for drop could apply to grouping as well - would you like to give it a shot?

zero323 · 2017-04-30T19:59:28Z

I thought about it before, but there are two or three problems:

base::grouping has ... signature so it is not possible without changing behavior or adding nullary variant.
It makes impossible to add SparkR::grouping which accepts characterOrColumn (and this something that we should do to achieve consistent API).
Finally it won't work that well if we decide to use some form of NSE.

There can be some other problems I am not aware of.

In general I fully agree that we should avoid conflicts when possible, but I am skeptical about dispatching hacks, which in some border cases could actually brake user code.

There are some other options we can explore:

Adding common prefix like sql.some_name - a bit to verbose for my taste.
Using _ suffix. This is fine but can be confusing for advanced user who may expect standard vs. non standard evaluation.

I trust your judgment so if you think that generic option is the best I'll go with it.

SparkQA · 2017-04-30T21:35:45Z

Test build #76330 has finished for PR 17807 at commit abee444.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-04-30T22:12:05Z

To keep this option open zero323@a67c77e.

One caveat is that it is messing docs a bit:

felixcheung · 2017-05-01T00:49:26Z

yea, I agree it's not ideal and I'm not in favor of dispatch hacking either.

I think the latest you have is reasonable... but as you say, perhaps this isn't worthwhile to iterate that much with. Either that or what if we name this something else, like perhaps grouping_bit to distinguish with grouping_id?

felixcheung · 2017-05-01T00:49:40Z

strange error with AppVeyor..

zero323 · 2017-05-01T01:43:30Z

strange error with AppVeyor..

Indeed strange. tempfile conflict is not likely, is it?

zero323 · 2017-05-01T01:50:25Z

What I like about masking is that is relatively explicit:

> library(SparkR)

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

    as.data.frame, colnames, colnames<-, drop, endsWith, grouping,
    intersect, rank, rbind, sample, startsWith, subset, summary,
    transform, union

When we use generic tricks things get fuzzy. Like drop above. Is it masked? Based on what I see above I would expect it is.

While grouping is probably of the least concern, I think this is actually an important discussion.

felixcheung · 2017-05-01T04:57:30Z

I'm not sure there is one rule for this - we would need to look at it on a case-by-case basis. Generally we should try to avoid conflict, because it is inconvenient and breaks any existing code users might have.

The "masked" message unfortunately isn't that accurate either - in reality, we only masked 3 methods (and we have to document them and so on) and not the 23 listed there. For many, there isn't any hack - just the mere act of adding a generic is triggering inclusion in this message, and in which case nothing is "wrong" - say, predict for example. That's why we have tests here and here to make sure it is not broken.

zero323 · 2017-05-01T05:12:57Z

Don't take away my hope :)

i think that grouping_bitis OK. It is descriptive, doesn't introduce generic _col and we can avoid masking. I'll resolve the conflicts and update PR in a moment.

felixcheung · 2017-05-01T05:36:48Z

let's just say, happy to discuss :)

SparkQA · 2017-05-01T06:00:15Z

Test build #76340 has finished for PR 17807 at commit b4cccd9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

LGTM

felixcheung · 2017-05-02T04:39:57Z

merged to master, thanks!

felixcheung reviewed Apr 29, 2017

View reviewed changes

zero323 force-pushed the SPARK-20532 branch from cf8e0e4 to 93e1c10 Compare April 29, 2017 19:01

zero323 force-pushed the SPARK-20532 branch from 93e1c10 to abee444 Compare April 30, 2017 20:58

zero323 added 8 commits May 1, 2017 07:14

Implement grouping and grouping_id

42df216

Add missing rdnames

ae47854

Remove notes

5729e65

Rename is_grouping to grouping_col

fa21d6c

Note that additional columns are optional

7c2b12a

Adjust comments

5b7f9bf

Rewrite grouping_id formula as a SparkR expression

927a628

Rename grouping_col to grouping_bit

b4cccd9

zero323 force-pushed the SPARK-20532 branch from abee444 to b4cccd9 Compare May 1, 2017 05:24

felixcheung approved these changes May 2, 2017

View reviewed changes

asfgit closed this in 90d77e9 May 2, 2017

zero323 deleted the SPARK-20532 branch May 8, 2017 09:08

[SPARK-20532][SPARKR] Implement grouping and grouping_id #17807

[SPARK-20532][SPARKR] Implement grouping and grouping_id #17807

Uh oh!

Conversation

zero323 commented Apr 29, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 29, 2017

Uh oh!

SparkQA commented Apr 29, 2017

Uh oh!

zero323 commented Apr 29, 2017

Uh oh!

SparkQA commented Apr 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zero323 Apr 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 29, 2017

Uh oh!

felixcheung commented Apr 29, 2017 via email

Uh oh!

zero323 commented Apr 30, 2017

Uh oh!

felixcheung commented Apr 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zero323 commented Apr 30, 2017

Uh oh!

felixcheung commented Apr 30, 2017

Uh oh!

zero323 commented Apr 30, 2017

Uh oh!

SparkQA commented Apr 30, 2017

Uh oh!

zero323 commented Apr 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zero323 Apr 30, 2017 •

edited

Loading

felixcheung commented Apr 30, 2017 •

edited

Loading

zero323 commented Apr 30, 2017 •

edited

Loading