New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16549: [C++] Simplify AggregateNodeOptions aggregates/targets #13150
Conversation
FWIW, you can use "draft" status when something is WIP but you still want CI runs. |
@lidavidm Yes, I want to run the CIs and see what is failing. Is it possible with a draft PR? (I mean selecting the draft option, I remember it pausing the CIs, may be I am mistaken.) |
Draft runs CI. Only "WIP" skips CI. |
C Glib is WIP, but appreciate thoughts on the core change. |
Thank you @kou !!! |
I've fixed C GLib failures. |
Thank you @kou |
r/R/query-engine.R
Outdated
@@ -121,11 +119,13 @@ ExecPlan <- R6Class("ExecPlan", | |||
x | |||
}) | |||
} | |||
target_names <- names(.data$aggregations) | |||
for (i in seq_len(length(target_names))) { | |||
.data$aggregations[[i]][["name"]] <- .data$aggregations[[i]][["target"]] <- target_names[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "target" and "name" are always the same, why are we passing them twice? Can't we reuse the same std::vector<std::string>
in the C++ bindings?
Also, why put target names inside the aggregations elements when we're just going to pull them back out and make a vector in C++? If you pass target_names
directly, you already have std::vector<std::string>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah my_bad, I should have added an affix here.
The idea of this PR was to simplify the usage of AggregateNodeOptions
. In that case, (referring to the JIRA) we thought it is better to put what is relevant to an aggregation within the object itself. It is not always the same, you're correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, the equivalent SQL is...
SELECT function_1(target_1) as name_1, function_2(target_2) as name_2, ... FROM ... GROUP BY ...
Each aggregate has (up to) three parts. The C++ is changing from passing in three vectors of strings (which is error-prone) to passing in one vectors of structs where each item has up to three parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it, you're right, I misread, I see that you're making a vector of structs now, so disregard my last comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nealrichardson Since you pointed out the names
and target
being the same, in previous implementation they were set to the same value. As far as I understood, the test cases are also following that. Is it wise to affix this or just leave it as it is? I have doubts about which one to select.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...do we have any test case doing something like...
mtcars %>% group_by(cyl)
%>% summarise(
disp_mean = mean(disp),
hp_mean = mean(hp)
)
I would expect target
to be disp
and name
to be disp_mean
. It's always possible we are renaming the columns somewhere else in the generated plan as well in which case name
is meaningless here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think what's happening (IIRC) is that we Project before Aggregate. So if someone does
mtcars %>% group_by(cyl)
%>% summarise(
disp_mean = mean(disp / hp)
)
we first project disp_mean = disp / hp
and them aggregate over that.
@nealrichardson I will work on these suggestions, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor suggestions but overall this is looking good.
@@ -480,6 +480,12 @@ struct ARROW_EXPORT Aggregate { | |||
|
|||
/// options for the aggregation function | |||
const FunctionOptions* options; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should convert this to std::unique_ptr<FunctionOptions>
but I don't mind if we leave that for a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up JIRA created: https://issues.apache.org/jira/browse/ARROW-16686
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should it be unique_ptr
? It's not obvious there's a need for ownership here; also, the caller might want to keep ownership as well.
@@ -372,22 +374,12 @@ static inline void PrintToImpl(const std::string& factory_name, | |||
for (const auto& agg : o->aggregates) { | |||
*os << agg.function << "<"; | |||
if (agg.options) PrintTo(*agg.options, os); | |||
*os << agg.target.ToString() << "<"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is correct. Ideally we would have something like...
aggregates=[{function=hash_sum<>,target=a,name=sum(a)}]
I think the way it is now it would print:
aggregates={hash_sum<a<sum_a<>,},
r/R/query-engine.R
Outdated
@@ -121,11 +119,13 @@ ExecPlan <- R6Class("ExecPlan", | |||
x | |||
}) | |||
} | |||
target_names <- names(.data$aggregations) | |||
for (i in seq_len(length(target_names))) { | |||
.data$aggregations[[i]][["name"]] <- .data$aggregations[[i]][["target"]] <- target_names[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...do we have any test case doing something like...
mtcars %>% group_by(cyl)
%>% summarise(
disp_mean = mean(disp),
hp_mean = mean(hp)
)
I would expect target
to be disp
and name
to be disp_mean
. It's always possible we are renaming the columns somewhere else in the generated plan as well in which case name
is meaningless here.
{"hash_sum", nullptr}, | ||
{"hash_sum", nullptr}, | ||
{"hash_mean", nullptr}, | ||
{"hash_mean", nullptr}, | ||
{"hash_product", nullptr}, | ||
{"hash_product", nullptr}, | ||
{"hash_sum", nullptr, "agg_0", "hash_sum"}, | ||
{"hash_sum", nullptr, "agg_1", "hash_sum"}, | ||
{"hash_mean", nullptr, "agg_2", "hash_mean"}, | ||
{"hash_mean", nullptr, "agg_3", "hash_mean"}, | ||
{"hash_product", nullptr, "agg_4", "hash_product"}, | ||
{"hash_product", nullptr, "agg_5", "hash_product"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have to make these changes and they do not make the test easier to read. Could we change the definition of GroupByTest
so that instead of taking in const std::vector<internal::Aggregate>& aggregates
it takes in const std::vector<TestAggregate>& aggregates
where we define:
struct TestAggregate {
std::string function;
const FunctionOptions* options;
};
The old style makes sense for these unit tests since we don't really care about what we are naming the columns.
c1d7572
to
d47bfee
Compare
@nealrichardson regarding the following file=r/R/query-engine.R,line=18,col=1,functions should have cyclomatic complexity of less than 26, this has 28. Should we create a few util functions external to the |
@nealrichardson I think it resolved the CI issue. Thank you. |
@westonpace should we take the |
Yes. |
@kou need some help with the C-Glib, when I rebased I tried to fix it, but didn't succeed. Appreciate your help. |
Done. |
Thank you @kou |
r/R/query-engine.R
Outdated
) | ||
} | ||
} | ||
config_agg <- private$.set_aggregation(node, .data, grouped, group_vars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't like this refactoring. I think it obfuscates what is happening here, and that's not worth doing just to try to trick a misguided linter. Would you mind reverting it? I'd rather tune the linting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I will. Do you mind just making it one function or do you want a full revert where there are no helper functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full revert, the helpers don't seem to be helping with the lint warning
I just pushed a commit applying the suggestion from #13150 (comment), and that makes the lint warning go away. Also updated the lintr config and added a reference to an issue about cyclocomp for R6 classes. |
Thank you @nealrichardson I have missed that comment. |
@nealrichardson the CI is failing here: https://github.com/apache/arrow/runs/7029848711?check_suite_focus=true#step:5:1040 Error in linters_with_defaults(line_length_linter = line_length_linter(120), :
could not find function "linters_with_defaults"
Calls: <Anonymous> -> read_settings -> get_setting -> eval -> eval |
I backed out the lintr changes (they were just cleaning up warnings that show on the latest release of lintr, but we have pinned an old version elsewhere apparently). Will handle in ARROW-16900. |
Thanks @nealrichardson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with this cleanup.
Appreciate the support !. |
…pache#13150) This PR is simplifying the existing `AggregateNodeOptions` usage. This work is still in progress and need to evaluate the existing refactor and usage. Todos - [x] Test - [ ] Update documentation - [ ] Update function docs - [x] Evaluate CI failures (only tested on Mac M1 with C++/Python, need to check if the change breaks other language bindings Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
@github-actions crossbow submit verify-rc-source-cpp-macos-arm64 wheel-macos-big-sur-cp38-arm64 |
Revision: 0a09d22 Submitted crossbow builds: ursacomputing/crossbow @ actions-17830e587b
|
…pache#13150) This PR is simplifying the existing `AggregateNodeOptions` usage. This work is still in progress and need to evaluate the existing refactor and usage. Todos - [x] Test - [ ] Update documentation - [ ] Update function docs - [x] Evaluate CI failures (only tested on Mac M1 with C++/Python, need to check if the change breaks other language bindings Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
This PR is simplifying the existing
AggregateNodeOptions
usage. This work is still in progress and need to evaluate the existing refactor and usage.Todos