-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13778: [R] Handle complex summarize expressions #11108
Conversation
fcbc30d
to
72b67fe
Compare
72b67fe
to
acb0461
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, a few comments/suggestions. One more substantive one about if someone does have a nefariously named column like ..temp1
in their input already
@@ -42,33 +49,63 @@ summarise.arrow_dplyr_query <- function(.data, ..., .engine = c("arrow", "duckdb | |||
} | |||
summarise.Dataset <- summarise.ArrowTabular <- summarise.arrow_dplyr_query | |||
|
|||
# This is the Arrow summarize implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great — finding these can be tricky in this code sometimes
# TODO: support in ARROW-13926 | ||
# TODO: Add "because" arg to explain _why_ it's not supported? | ||
# TODO: this message could also say "not supported in summarize()" | ||
# since some of these expressions may be legal elsewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last TODO I think is important — I think anyone who gets a not supported message will assume that expression is not supported in Arrow at all anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah maybe so, but all of the examples I can think of are just bad form (like summarize(new_column = dbl - int)
, which really should be expressed in mutate()
). Anyway I don't think this PR is the last word on any of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# so arrow_eval the expression on the data and give it a ..temp name prefix, | ||
# then insert that name (symbol) back into the expression so that we can | ||
# mutate() on the result of the aggregation and reference this field. | ||
tmpname <- paste0("..temp", length(ctx$aggregations)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is vanishingly rare, but could we at this point check for any fields named in tmpname
here? I can't imagine anyone would have one, but it would be better to error. Or would this error later without proactive checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way this would fail that I can think of is that if you deliberately did something like summarize(..temp1 = mean(x), avg = sum(x) / n())
, and, well, can't you just not do that? Open to other suggestions, the names don't matter as long as they're unique and valid as names (so you can't just use the expression like "sum(x)"
as the name).
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
Merging; happy to address any other feedback in a followup. |
This handles `summarize()` queries like `avg = sum(x) / n()` by extracting the aggregations and evaluating them first, then implicitly doing `mutate()` afterwards. It does not support things like `stddev = sqrt(sum((x - mean(x)) ^ 2) / n())` because `x - mean(x)` implies a grouped aggregation -> left join -> mutate; that will be ARROW-13926 (after we can do joins). TODO: - [x] More testing and better error handling for unsupported cases - [x] Add more explanatory discussion in comments because the logic gets complex Closes apache#11108 from nealrichardson/complex-exprs Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
This handles
summarize()
queries likeavg = sum(x) / n()
by extracting the aggregations and evaluating them first, then implicitly doingmutate()
afterwards. It does not support things likestddev = sqrt(sum((x - mean(x)) ^ 2) / n())
becausex - mean(x)
implies a grouped aggregation -> left join -> mutate; that will be ARROW-13926 (after we can do joins).TODO: