ARROW-16549: [C++] Simplify AggregateNodeOptions aggregates/targets #13150

vibhatha · 2022-05-13T13:10:57Z

This PR is simplifying the existing AggregateNodeOptions usage. This work is still in progress and need to evaluate the existing refactor and usage.

Todos

Test
Update documentation
Update function docs
Evaluate CI failures (only tested on Mac M1 with C++/Python, need to check if the change breaks other language bindings

github-actions · 2022-05-13T13:12:43Z

https://issues.apache.org/jira/browse/ARROW-16549

lidavidm · 2022-05-16T16:48:13Z

FWIW, you can use "draft" status when something is WIP but you still want CI runs.

vibhatha · 2022-05-17T09:40:26Z

@lidavidm Yes, I want to run the CIs and see what is failing. Is it possible with a draft PR? (I mean selecting the draft option, I remember it pausing the CIs, may be I am mistaken.)

lidavidm · 2022-05-17T11:46:07Z

Draft runs CI. Only "WIP" skips CI.

vibhatha · 2022-05-20T02:29:06Z

cc @westonpace @lidavidm @nealrichardson

vibhatha · 2022-05-20T02:29:52Z

C Glib is WIP, but appreciate thoughts on the core change.

vibhatha · 2022-05-20T12:21:51Z

Thank you @kou !!!
It is still failing for a different reason now.

kou · 2022-05-22T13:15:22Z

I've fixed C GLib failures.

vibhatha · 2022-05-22T13:21:14Z

I've fixed C GLib failures.

Thank you @kou

r/src/compute-exec.cpp

r/R/query-engine.R

nealrichardson · 2022-05-23T13:25:11Z

r/R/query-engine.R

@@ -121,11 +119,13 @@ ExecPlan <- R6Class("ExecPlan",
            x
          })
        }
+        target_names <- names(.data$aggregations)
+        for (i in seq_len(length(target_names))) {
+          .data$aggregations[[i]][["name"]] <- .data$aggregations[[i]][["target"]] <- target_names[i]


If "target" and "name" are always the same, why are we passing them twice? Can't we reuse the same std::vector<std::string> in the C++ bindings?

Also, why put target names inside the aggregations elements when we're just going to pull them back out and make a vector in C++? If you pass target_names directly, you already have std::vector<std::string>.

Ah my_bad, I should have added an affix here.

The idea of this PR was to simplify the usage of AggregateNodeOptions. In that case, (referring to the JIRA) we thought it is better to put what is relevant to an aggregation within the object itself. It is not always the same, you're correct.

For reference, the equivalent SQL is...

SELECT function_1(target_1) as name_1, function_2(target_2) as name_2, ... FROM ... GROUP BY ...

Each aggregate has (up to) three parts. The C++ is changing from passing in three vectors of strings (which is error-prone) to passing in one vectors of structs where each item has up to three parts.

Ah got it, you're right, I misread, I see that you're making a vector of structs now, so disregard my last comment.

@nealrichardson Since you pointed out the names and target being the same, in previous implementation they were set to the same value. As far as I understood, the test cases are also following that. Is it wise to affix this or just leave it as it is? I have doubts about which one to select.

Hmm...do we have any test case doing something like...

mtcars %>% group_by(cyl) %>% summarise( disp_mean = mean(disp), hp_mean = mean(hp) )

I would expect target to be disp and name to be disp_mean. It's always possible we are renaming the columns somewhere else in the generated plan as well in which case name is meaningless here.

Yeah I think what's happening (IIRC) is that we Project before Aggregate. So if someone does

mtcars %>% group_by(cyl) %>% summarise( disp_mean = mean(disp / hp) )

we first project disp_mean = disp / hp and them aggregate over that.

vibhatha · 2022-05-23T13:46:09Z

@nealrichardson I will work on these suggestions, thank you!

westonpace

Some minor suggestions but overall this is looking good.

westonpace · 2022-05-25T22:28:54Z

cpp/src/arrow/compute/api_aggregate.h

@@ -480,6 +480,12 @@ struct ARROW_EXPORT Aggregate {

  /// options for the aggregation function
  const FunctionOptions* options;


We should convert this to std::unique_ptr<FunctionOptions> but I don't mind if we leave that for a follow-up PR.

Follow-up JIRA created: https://issues.apache.org/jira/browse/ARROW-16686

Why should it be unique_ptr? It's not obvious there's a need for ownership here; also, the caller might want to keep ownership as well.

cpp/src/arrow/compute/api_aggregate.h

westonpace · 2022-05-25T22:42:15Z

cpp/src/arrow/compute/exec/test_util.cc

@@ -372,22 +374,12 @@ static inline void PrintToImpl(const std::string& factory_name,
    for (const auto& agg : o->aggregates) {
      *os << agg.function << "<";
      if (agg.options) PrintTo(*agg.options, os);
+      *os << agg.target.ToString() << "<";


I'm not sure this is correct. Ideally we would have something like...

aggregates=[{function=hash_sum<>,target=a,name=sum(a)}]

I think the way it is now it would print:

aggregates={hash_sum<a<sum_a<>,},

westonpace · 2022-05-25T22:47:27Z

r/R/query-engine.R

@@ -121,11 +119,13 @@ ExecPlan <- R6Class("ExecPlan",
            x
          })
        }
+        target_names <- names(.data$aggregations)
+        for (i in seq_len(length(target_names))) {
+          .data$aggregations[[i]][["name"]] <- .data$aggregations[[i]][["target"]] <- target_names[i]


Hmm...do we have any test case doing something like...

mtcars %>% group_by(cyl) %>% summarise( disp_mean = mean(disp), hp_mean = mean(hp) )

I would expect target to be disp and name to be disp_mean. It's always possible we are renaming the columns somewhere else in the generated plan as well in which case name is meaningless here.

r/src/compute-exec.cpp

westonpace · 2022-05-25T22:57:55Z

cpp/src/arrow/compute/kernels/hash_aggregate_test.cc

-                                   {"hash_sum", nullptr},
-                                   {"hash_sum", nullptr},
-                                   {"hash_mean", nullptr},
-                                   {"hash_mean", nullptr},
-                                   {"hash_product", nullptr},
-                                   {"hash_product", nullptr},
+                                   {"hash_sum", nullptr, "agg_0", "hash_sum"},
+                                   {"hash_sum", nullptr, "agg_1", "hash_sum"},
+                                   {"hash_mean", nullptr, "agg_2", "hash_mean"},
+                                   {"hash_mean", nullptr, "agg_3", "hash_mean"},
+                                   {"hash_product", nullptr, "agg_4", "hash_product"},
+                                   {"hash_product", nullptr, "agg_5", "hash_product"},


I don't think we should have to make these changes and they do not make the test easier to read. Could we change the definition of GroupByTest so that instead of taking in const std::vector<internal::Aggregate>& aggregates it takes in const std::vector<TestAggregate>& aggregates where we define:

struct TestAggregate { std::string function; const FunctionOptions* options; };

The old style makes sense for these unit tests since we don't really care about what we are naming the columns.

vibhatha · 2022-06-01T04:27:59Z

@nealrichardson regarding the following

file=r/R/query-engine.R,line=18,col=1,functions should have cyclomatic complexity of less than 26, this has 28.

Should we create a few util functions external to theR6Class to mitigate this issue? What's the best?

r/R/query-engine.R

vibhatha · 2022-06-02T23:36:46Z

@nealrichardson I think it resolved the CI issue. Thank you.

vibhatha · 2022-06-02T23:37:49Z

@westonpace should we take the Aggregate out from internal namespace?

westonpace · 2022-06-02T23:57:56Z

@westonpace should we take the Aggregate out from internal namespace?

Yes.

vibhatha · 2022-06-23T03:01:26Z

@kou need some help with the C-Glib, when I rebased I tried to fix it, but didn't succeed. Appreciate your help.

kou · 2022-06-23T05:41:55Z

Done.

vibhatha · 2022-06-23T07:02:37Z

Done.

Thank you @kou

nealrichardson · 2022-06-23T14:17:50Z

r/R/query-engine.R

-            )
-          }
-        }
+        config_agg <- private$.set_aggregation(node, .data, grouped, group_vars)


Sorry, I don't like this refactoring. I think it obfuscates what is happening here, and that's not worth doing just to try to trick a misguided linter. Would you mind reverting it? I'd rather tune the linting.

Sure I will. Do you mind just making it one function or do you want a full revert where there are no helper functions?

Full revert, the helpers don't seem to be helping with the lint warning

…lintr config for 3.0.0

nealrichardson · 2022-06-23T18:44:13Z

I just pushed a commit applying the suggestion from #13150 (comment), and that makes the lint warning go away. Also updated the lintr config and added a reference to an issue about cyclocomp for R6 classes.

vibhatha · 2022-06-24T02:24:04Z

I just pushed a commit applying the suggestion from #13150 (comment), and that makes the lint warning go away. Also updated the lintr config and added a reference to an issue about cyclocomp for R6 classes.

Thank you @nealrichardson I have missed that comment.

vibhatha · 2022-06-24T02:35:39Z

@nealrichardson the CI is failing here: https://github.com/apache/arrow/runs/7029848711?check_suite_focus=true#step:5:1040

Error in linters_with_defaults(line_length_linter = line_length_linter(120),  : 
  could not find function "linters_with_defaults"
Calls: <Anonymous> -> read_settings -> get_setting -> eval -> eval

nealrichardson · 2022-06-24T12:22:15Z

@nealrichardson the CI is failing here: https://github.com/apache/arrow/runs/7029848711?check_suite_focus=true#step:5:1040
Error in linters_with_defaults(line_length_linter = line_length_linter(120),  : 
  could not find function "linters_with_defaults"
Calls: <Anonymous> -> read_settings -> get_setting -> eval -> eval

I backed out the lintr changes (they were just cleaning up warnings that show on the latest release of lintr, but we have pinned an old version elsewhere apparently). Will handle in ARROW-16900.

vibhatha · 2022-06-24T12:24:26Z

@nealrichardson the CI is failing here: https://github.com/apache/arrow/runs/7029848711?check_suite_focus=true#step:5:1040
Error in linters_with_defaults(line_length_linter = line_length_linter(120),  : 
  could not find function "linters_with_defaults"
Calls: <Anonymous> -> read_settings -> get_setting -> eval -> eval
I backed out the lintr changes (they were just cleaning up warnings that show on the latest release of lintr, but we have pinned an old version elsewhere apparently). Will handle in ARROW-16900.

Thanks @nealrichardson

westonpace

Thanks for sticking with this cleanup.

vibhatha · 2022-06-27T17:02:46Z

Appreciate the support !.

…pache#13150) This PR is simplifying the existing `AggregateNodeOptions` usage. This work is still in progress and need to evaluate the existing refactor and usage. Todos - [x] Test - [ ] Update documentation - [ ] Update function docs - [x] Evaluate CI failures (only tested on Mac M1 with C++/Python, need to check if the change breaks other language bindings Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

assignUser · 2022-07-04T10:25:53Z

@github-actions crossbow submit verify-rc-source-cpp-macos-arm64 wheel-macos-big-sur-cp38-arm64

github-actions · 2022-07-04T10:45:02Z

Revision: 0a09d22

Submitted crossbow builds: ursacomputing/crossbow @ actions-17830e587b

Task	Status
verify-rc-source-cpp-macos-arm64
wheel-macos-big-sur-cp38-arm64

…pache#13150) This PR is simplifying the existing `AggregateNodeOptions` usage. This work is still in progress and need to evaluate the existing refactor and usage. Todos - [x] Test - [ ] Update documentation - [ ] Update function docs - [x] Evaluate CI failures (only tested on Mac M1 with C++/Python, need to check if the change breaks other language bindings Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

github-actions bot added the Component: C++ label May 13, 2022

vibhatha marked this pull request as draft May 17, 2022 11:51

github-actions bot added the Component: R label May 19, 2022

github-actions bot added the Component: GLib label May 20, 2022

vibhatha marked this pull request as ready for review May 20, 2022 12:21

westonpace self-requested a review May 21, 2022 01:25

nealrichardson requested changes May 23, 2022

View reviewed changes

vibhatha force-pushed the arrow-16549 branch from 06effde to ac27bbc Compare May 24, 2022 10:42

westonpace requested changes May 25, 2022

View reviewed changes

vibhatha force-pushed the arrow-16549 branch 3 times, most recently from c1d7572 to d47bfee Compare May 31, 2022 08:16

nealrichardson reviewed Jun 1, 2022

View reviewed changes

r/R/query-engine.R Outdated Show resolved Hide resolved

vibhatha requested review from nealrichardson, westonpace and pitrou June 2, 2022 23:38

vibhatha added 2 commits June 22, 2022 09:25

fixed the test issue

fd443e0

rebase

d53c685

vibhatha force-pushed the arrow-16549 branch from 9d46827 to d53c685 Compare June 22, 2022 05:22

vibhatha added 3 commits June 22, 2022 11:16

fix R format issue

3ebc09a

test fix for cglib

5754ebb

typo fix in tpch bench script

dfef8ec

partition the function further

038595b

vibhatha force-pushed the arrow-16549 branch from 443b750 to 038595b Compare June 23, 2022 07:29

added the overwritten patch

764927a

nealrichardson reviewed Jun 23, 2022

View reviewed changes

vibhatha and others added 2 commits June 23, 2022 21:17

removing the splitted function

4ab7b83

Apply suggested imap() patch to reduce (detected) complexity. Update …

5c367b5

…lintr config for 3.0.0

vibhatha requested a review from nealrichardson June 24, 2022 02:36

Back out lintr changes (will PR separately)

0a09d22

westonpace approved these changes Jun 27, 2022

View reviewed changes

westonpace merged commit bb67f8d into apache:master Jun 27, 2022

vibhatha mentioned this pull request Jan 3, 2023

ARROW-16212: [C++][Python] Register Multiple Kernels for a UDF #14320

Closed

		@@ -480,6 +480,12 @@ struct ARROW_EXPORT Aggregate {

		/// options for the aggregation function
		const FunctionOptions* options;

ARROW-16549: [C++] Simplify AggregateNodeOptions aggregates/targets #13150

ARROW-16549: [C++] Simplify AggregateNodeOptions aggregates/targets #13150

Conversation

vibhatha commented May 13, 2022 • edited

github-actions bot commented May 13, 2022

lidavidm commented May 16, 2022

vibhatha commented May 17, 2022

lidavidm commented May 17, 2022

vibhatha commented May 20, 2022

vibhatha commented May 20, 2022

vibhatha commented May 20, 2022

kou commented May 22, 2022

vibhatha commented May 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha commented May 23, 2022

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha May 30, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha commented Jun 1, 2022

vibhatha commented Jun 2, 2022

vibhatha commented Jun 2, 2022

westonpace commented Jun 2, 2022

vibhatha commented Jun 23, 2022

kou commented Jun 23, 2022

vibhatha commented Jun 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nealrichardson commented Jun 23, 2022

vibhatha commented Jun 24, 2022

vibhatha commented Jun 24, 2022

nealrichardson commented Jun 24, 2022

vibhatha commented Jun 24, 2022

westonpace left a comment

Choose a reason for hiding this comment

vibhatha commented Jun 27, 2022

assignUser commented Jul 4, 2022 • edited

github-actions bot commented Jul 4, 2022

vibhatha commented May 13, 2022 •

edited

vibhatha May 30, 2022 •

edited

assignUser commented Jul 4, 2022 •

edited