-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20532][SPARKR] Implement grouping and grouping_id #17807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #76294 has finished for PR 17807 at commit
|
|
Test build #76296 has started for PR 17807 at commit |
|
Jenkins retest this please. |
|
Test build #76299 has finished for PR 17807 at commit
|
R/pkg/NAMESPACE
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please sort this. you might have forgotten to update this after changing the name
R/pkg/R/functions.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it not more intuitive if it is TRUE/FALSE? I guess this is the behavior in Scala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I remember this is actually SQL:1999 standard. And has direct relation to binary encoding used for grouping id.
Maybe is_grouping is not the most fortunate choice of name, but I don't have a better idea. Theoretically we can use just grouping and let users worry about disambiguation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe grouping_col?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with grouping_col. It doesn't suggest boolean type, and doesn't conflict with built-ins.
R/pkg/R/functions.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd simplify this - since this is the in @family agg_funcs generated doc already has links to most of these. unless there is a very direct relation, say to grouping_id, I don't think we need extra links here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... because it gets very messy (which is something we need to clean up actually)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think it is better... Though @family based lists grow pretty fast and it is hard to find anything useful there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is something we need to clean up actually
There are like four functions left for full parity so we can try to figure this out after that. I would like to take a look at the overall structure as well. There is a lot of boilerplate there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok either way. We have a pending item to clean up @family though, we are seeing duplicated entries (which is why it's so long) because of the use of @aliases to get CRAN check happy
R/pkg/R/generics.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort
R/pkg/R/functions.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto links
R/pkg/R/functions.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: add (optional)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically speaking it is true, but grouping_id with single column is just grouping :)
R/pkg/R/functions.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does <<; gets handled properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also perhaps it's more appropriate to use the R-equivalent operator instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I am not aware of any base implementation. Is there one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or bitwShiftL? https://stat.ethz.ch/R-manual/R-devel/library/base/html/bitwise.html
it's not as readable I agree, but it's syntax correct R code..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we prefer code then this should be exactly what we need.
grouping_col(c1) * 2^(n - 1) + grouping_col(c2) * 2^(n - 2) + ... + grouping_col(cn)
It is a valid code R / SparkR code, and arguably pretty clear.
|
Test build #76304 has finished for PR 17807 at commit
|
|
Didn't we added something_col recently? I worry if the meaning diverge from the earlier case.
Also if grouping_id(a$col) == grouping(a$col) maybe we just need one, grouping_id is good enough?
|
I don't think we did. There was
This is certainly a good point, but we won't get away with it. I don't know if is standard or not but Luckily we don't have to worry about |
|
right, there's a bunch of *Col in ml. I think people would associate I do see your point about |
|
What happens if we just mask We won't brake any packages, and accessing |
|
I understanding what you are saying and for frequent R users this might not really be a big issue, so far we are taking the view of avoiding masking any inconvenience, if we if at all could. Ideally, I'd prefer to eliminate any conflict with From a quick check I think what we have for |
|
I thought about it before, but there are two or three problems:
There can be some other problems I am not aware of. In general I fully agree that we should avoid conflicts when possible, but I am skeptical about dispatching hacks, which in some border cases could actually brake user code. There are some other options we can explore:
I trust your judgment so if you think that generic option is the best I'll go with it. |
|
Test build #76330 has finished for PR 17807 at commit
|
|
To keep this option open zero323@a67c77e. One caveat is that it is messing docs a bit: |
|
yea, I agree it's not ideal and I'm not in favor of dispatch hacking either. I think the latest you have is reasonable... but as you say, perhaps this isn't worthwhile to iterate that much with. Either that or what if we name this something else, like perhaps |
|
strange error with AppVeyor.. |
Indeed strange. |
|
What I like about masking is that is relatively explicit: When we use generic tricks things get fuzzy. Like While |
|
I'm not sure there is one rule for this - we would need to look at it on a case-by-case basis. Generally we should try to avoid conflict, because it is inconvenient and breaks any existing code users might have. The "masked" message unfortunately isn't that accurate either - in reality, we only masked 3 methods (and we have to document them and so on) and not the 23 listed there. For many, there isn't any hack - just the mere act of adding a generic is triggering inclusion in this message, and in which case nothing is "wrong" - say, |
|
Don't take away my hope :) i think that |
|
let's just say, happy to discuss :) |
|
Test build #76340 has finished for PR 17807 at commit
|
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
merged to master, thanks! |



What changes were proposed in this pull request?
Adds R wrappers for:
o.a.s.sql.functions.groupingaso.a.s.sql.functions.is_grouping(to avoid shadingbase::groupingo.a.s.sql.functions.grouping_idHow was this patch tested?
Existing unit tests, additional unit tests.
check-cran.sh.