Skip to content

Conversation

@zero323
Copy link
Member

@zero323 zero323 commented Apr 29, 2017

What changes were proposed in this pull request?

Adds R wrappers for:

  • o.a.s.sql.functions.grouping as o.a.s.sql.functions.is_grouping (to avoid shading base::grouping
  • o.a.s.sql.functions.grouping_id

How was this patch tested?

Existing unit tests, additional unit tests. check-cran.sh.

@SparkQA
Copy link

SparkQA commented Apr 29, 2017

Test build #76294 has finished for PR 17807 at commit 4f13b61.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2017

Test build #76296 has started for PR 17807 at commit 10fdaa5.

@zero323
Copy link
Member Author

zero323 commented Apr 29, 2017

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Apr 29, 2017

Test build #76299 has finished for PR 17807 at commit 10fdaa5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/NAMESPACE Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please sort this. you might have forgotten to update this after changing the name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not more intuitive if it is TRUE/FALSE? I guess this is the behavior in Scala?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember this is actually SQL:1999 standard. And has direct relation to binary encoding used for grouping id.

Maybe is_grouping is not the most fortunate choice of name, but I don't have a better idea. Theoretically we can use just grouping and let users worry about disambiguation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe grouping_col?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with grouping_col. It doesn't suggest boolean type, and doesn't conflict with built-ins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd simplify this - since this is the in @family agg_funcs generated doc already has links to most of these. unless there is a very direct relation, say to grouping_id, I don't think we need extra links here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... because it gets very messy (which is something we need to clean up actually)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think it is better... Though @family based lists grow pretty fast and it is hard to find anything useful there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is something we need to clean up actually

There are like four functions left for full parity so we can try to figure this out after that. I would like to take a look at the overall structure as well. There is a lot of boilerplate there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok either way. We have a pending item to clean up @family though, we are seeing duplicated entries (which is why it's so long) because of the use of @aliases to get CRAN check happy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto links

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: add (optional)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically speaking it is true, but grouping_id with single column is just grouping :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does <<; gets handled properly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also perhaps it's more appropriate to use the R-equivalent operator instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I am not aware of any base implementation. Is there one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK:

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use LaTeX

#' Equals to \eqn{c_1 2^{n - 1} + c_2 2^{n -1} + ... + c_n}

but it no rendered in HTML:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or bitwShiftL? https://stat.ethz.ch/R-manual/R-devel/library/base/html/bitwise.html
it's not as readable I agree, but it's syntax correct R code..

Copy link
Member Author

@zero323 zero323 Apr 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we prefer code then this should be exactly what we need.

grouping_col(c1) * 2^(n - 1) + grouping_col(c2) * 2^(n - 2)  + ... +  grouping_col(cn)

It is a valid code R / SparkR code, and arguably pretty clear.

@SparkQA
Copy link

SparkQA commented Apr 29, 2017

Test build #76304 has finished for PR 17807 at commit 93e1c10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Apr 29, 2017 via email

@zero323
Copy link
Member Author

zero323 commented Apr 30, 2017

Didn't we added something_col recently?

I don't think we did. There was itemsCol in fpGrowth, but it is not relevant here. git grep "_col" | grep "\.R" shows only a bunch of _colors, _collections and _collected used for tests and examples.

is_grouping is probably not so fortunate choice, because if column is used for grouping, it is set to 0.

Also if grouping_id(a$col) == grouping(a$col) maybe we just need one, grouping_id is good enough?

This is certainly a good point, but we won't get away with it. I don't know if is standard or not but grouping_id(...) has to match groupBy(df, ....).

Luckily we don't have to worry about group_id :)

@felixcheung
Copy link
Member

felixcheung commented Apr 30, 2017

right, there's a bunch of *Col in ml. I think people would associate col to a reference to a/another column in the dataframe

I do see your point about is_grouping, and also that grouping_id is not a substitute. It is unfortunate R's grouping is conflicting with this. I'm not sure if I have a good name suggestion. We in occasion have workaround issue like this - since that base::grouping is S3 method we could add a generic etc to route the call to get it to be compatible. do think in this case it is worthwhile to do though?

@zero323
Copy link
Member Author

zero323 commented Apr 30, 2017

What happens if we just mask base::grouping? It is just a gut feeling but I don't think it is the most widely used function ever. I doubt I ever used it in my own code, and cannot recall seeing it anywhere else.

We won't brake any packages, and accessing base version via it's full qualified name will work just fine. And it won't be the the first function we mask.

@felixcheung
Copy link
Member

I understanding what you are saying and for frequent R users this might not really be a big issue, so far we are taking the view of avoiding masking any inconvenience, if we if at all could.

Ideally, I'd prefer to eliminate any conflict with cov, filter, and sample as well.
This is what I'm referring to re: the workaround for drop. If the signature isn't too strange or different, we should try to export a generic + stub that makes the base:: function still callable without a namespace prefix.

From a quick check I think what we have for drop could apply to grouping as well - would you like to give it a shot?

@zero323
Copy link
Member Author

zero323 commented Apr 30, 2017

I thought about it before, but there are two or three problems:

  • base::grouping has ... signature so it is not possible without changing behavior or adding nullary variant.
  • It makes impossible to add SparkR::grouping which accepts characterOrColumn (and this something that we should do to achieve consistent API).
  • Finally it won't work that well if we decide to use some form of NSE.

There can be some other problems I am not aware of.

In general I fully agree that we should avoid conflicts when possible, but I am skeptical about dispatching hacks, which in some border cases could actually brake user code.

There are some other options we can explore:

  • Adding common prefix like sql.some_name - a bit to verbose for my taste.
  • Using _ suffix. This is fine but can be confusing for advanced user who may expect standard vs. non standard evaluation.

I trust your judgment so if you think that generic option is the best I'll go with it.

@SparkQA
Copy link

SparkQA commented Apr 30, 2017

Test build #76330 has finished for PR 17807 at commit abee444.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Apr 30, 2017

To keep this option open zero323@a67c77e.

One caveat is that it is messing docs a bit:

image

@felixcheung
Copy link
Member

yea, I agree it's not ideal and I'm not in favor of dispatch hacking either.

I think the latest you have is reasonable... but as you say, perhaps this isn't worthwhile to iterate that much with. Either that or what if we name this something else, like perhaps grouping_bit to distinguish with grouping_id?

@felixcheung
Copy link
Member

strange error with AppVeyor..

@zero323
Copy link
Member Author

zero323 commented May 1, 2017

strange error with AppVeyor..

Indeed strange. tempfile conflict is not likely, is it?

@zero323
Copy link
Member Author

zero323 commented May 1, 2017

What I like about masking is that is relatively explicit:

> library(SparkR)

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

    as.data.frame, colnames, colnames<-, drop, endsWith, grouping,
    intersect, rank, rbind, sample, startsWith, subset, summary,
    transform, union

When we use generic tricks things get fuzzy. Like drop above. Is it masked? Based on what I see above I would expect it is.

While grouping is probably of the least concern, I think this is actually an important discussion.

@felixcheung
Copy link
Member

I'm not sure there is one rule for this - we would need to look at it on a case-by-case basis. Generally we should try to avoid conflict, because it is inconvenient and breaks any existing code users might have.

The "masked" message unfortunately isn't that accurate either - in reality, we only masked 3 methods (and we have to document them and so on) and not the 23 listed there. For many, there isn't any hack - just the mere act of adding a generic is triggering inclusion in this message, and in which case nothing is "wrong" - say, predict for example. That's why we have tests here and here to make sure it is not broken.

@zero323
Copy link
Member Author

zero323 commented May 1, 2017

Don't take away my hope :)

i think that grouping_bitis OK. It is descriptive, doesn't introduce generic _col and we can avoid masking. I'll resolve the conflicts and update PR in a moment.

@felixcheung
Copy link
Member

let's just say, happy to discuss :)

@SparkQA
Copy link

SparkQA commented May 1, 2017

Test build #76340 has finished for PR 17807 at commit b4cccd9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@felixcheung
Copy link
Member

merged to master, thanks!

@asfgit asfgit closed this in 90d77e9 May 2, 2017
@zero323 zero323 deleted the SPARK-20532 branch May 8, 2017 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants