Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13778: [R] Handle complex summarize expressions #11108

Closed
wants to merge 5 commits into from

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Sep 8, 2021

This handles summarize() queries like avg = sum(x) / n() by extracting the aggregations and evaluating them first, then implicitly doing mutate() afterwards. It does not support things like stddev = sqrt(sum((x - mean(x)) ^ 2) / n()) because x - mean(x) implies a grouped aggregation -> left join -> mutate; that will be ARROW-13926 (after we can do joins).

TODO:

  • More testing and better error handling for unsupported cases
  • Add more explanatory discussion in comments because the logic gets complex

@github-actions
Copy link

github-actions bot commented Sep 8, 2021

@nealrichardson nealrichardson force-pushed the complex-exprs branch 3 times, most recently from fcbc30d to 72b67fe Compare September 9, 2021 18:50
@nealrichardson nealrichardson marked this pull request as ready for review September 9, 2021 18:50
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few comments/suggestions. One more substantive one about if someone does have a nefariously named column like ..temp1 in their input already

@@ -42,33 +49,63 @@ summarise.arrow_dplyr_query <- function(.data, ..., .engine = c("arrow", "duckdb
}
summarise.Dataset <- summarise.ArrowTabular <- summarise.arrow_dplyr_query

# This is the Arrow summarize implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great — finding these can be tricky in this code sometimes

# TODO: support in ARROW-13926
# TODO: Add "because" arg to explain _why_ it's not supported?
# TODO: this message could also say "not supported in summarize()"
# since some of these expressions may be legal elsewhere
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last TODO I think is important — I think anyone who gets a not supported message will assume that expression is not supported in Arrow at all anywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe so, but all of the examples I can think of are just bad form (like summarize(new_column = dbl - int), which really should be expressed in mutate()). Anyway I don't think this PR is the last word on any of this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# so arrow_eval the expression on the data and give it a ..temp name prefix,
# then insert that name (symbol) back into the expression so that we can
# mutate() on the result of the aggregation and reference this field.
tmpname <- paste0("..temp", length(ctx$aggregations))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is vanishingly rare, but could we at this point check for any fields named in tmpname here? I can't imagine anyone would have one, but it would be better to error. Or would this error later without proactive checking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way this would fail that I can think of is that if you deliberately did something like summarize(..temp1 = mean(x), avg = sum(x) / n()), and, well, can't you just not do that? Open to other suggestions, the names don't matter as long as they're unique and valid as names (so you can't just use the expression like "sum(x)" as the name).

r/R/dplyr-summarize.R Outdated Show resolved Hide resolved
r/R/dplyr-summarize.R Show resolved Hide resolved
r/R/dplyr-summarize.R Show resolved Hide resolved
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
@nealrichardson
Copy link
Member Author

Merging; happy to address any other feedback in a followup.

@nealrichardson nealrichardson deleted the complex-exprs branch September 10, 2021 12:40
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
This handles `summarize()` queries like `avg = sum(x) / n()` by extracting the aggregations and evaluating them first, then implicitly doing `mutate()` afterwards. It does not support things like `stddev = sqrt(sum((x - mean(x)) ^ 2) / n())` because `x - mean(x)` implies a grouped aggregation -> left join -> mutate; that will be ARROW-13926 (after we can do joins).

TODO:

- [x] More testing and better error handling for unsupported cases
- [x] Add more explanatory discussion in comments because the logic gets complex

Closes apache#11108 from nealrichardson/complex-exprs

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants