Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16549: [C++] Simplify AggregateNodeOptions aggregates/targets #13150

Merged
merged 32 commits into from Jun 27, 2022

Conversation

vibhatha
Copy link
Collaborator

@vibhatha vibhatha commented May 13, 2022

This PR is simplifying the existing AggregateNodeOptions usage. This work is still in progress and need to evaluate the existing refactor and usage.

Todos

  • Test
  • Update documentation
  • Update function docs
  • Evaluate CI failures (only tested on Mac M1 with C++/Python, need to check if the change breaks other language bindings

@github-actions
Copy link

@lidavidm
Copy link
Member

FWIW, you can use "draft" status when something is WIP but you still want CI runs.

@vibhatha
Copy link
Collaborator Author

@lidavidm Yes, I want to run the CIs and see what is failing. Is it possible with a draft PR? (I mean selecting the draft option, I remember it pausing the CIs, may be I am mistaken.)

@lidavidm
Copy link
Member

Draft runs CI. Only "WIP" skips CI.

@vibhatha vibhatha marked this pull request as draft May 17, 2022 11:51
@vibhatha
Copy link
Collaborator Author

@vibhatha
Copy link
Collaborator Author

C Glib is WIP, but appreciate thoughts on the core change.

@vibhatha
Copy link
Collaborator Author

Thank you @kou !!!
It is still failing for a different reason now.

@vibhatha vibhatha marked this pull request as ready for review May 20, 2022 12:21
@westonpace westonpace self-requested a review May 21, 2022 01:25
@kou
Copy link
Member

kou commented May 22, 2022

I've fixed C GLib failures.

@vibhatha
Copy link
Collaborator Author

I've fixed C GLib failures.

Thank you @kou

r/src/compute-exec.cpp Outdated Show resolved Hide resolved
r/R/query-engine.R Outdated Show resolved Hide resolved
r/R/query-engine.R Outdated Show resolved Hide resolved
@@ -121,11 +119,13 @@ ExecPlan <- R6Class("ExecPlan",
x
})
}
target_names <- names(.data$aggregations)
for (i in seq_len(length(target_names))) {
.data$aggregations[[i]][["name"]] <- .data$aggregations[[i]][["target"]] <- target_names[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "target" and "name" are always the same, why are we passing them twice? Can't we reuse the same std::vector<std::string> in the C++ bindings?

Also, why put target names inside the aggregations elements when we're just going to pull them back out and make a vector in C++? If you pass target_names directly, you already have std::vector<std::string>.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my_bad, I should have added an affix here.

The idea of this PR was to simplify the usage of AggregateNodeOptions. In that case, (referring to the JIRA) we thought it is better to put what is relevant to an aggregation within the object itself. It is not always the same, you're correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, the equivalent SQL is...

SELECT function_1(target_1) as name_1, function_2(target_2) as name_2, ... FROM ... GROUP BY ...

Each aggregate has (up to) three parts. The C++ is changing from passing in three vectors of strings (which is error-prone) to passing in one vectors of structs where each item has up to three parts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it, you're right, I misread, I see that you're making a vector of structs now, so disregard my last comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nealrichardson Since you pointed out the names and target being the same, in previous implementation they were set to the same value. As far as I understood, the test cases are also following that. Is it wise to affix this or just leave it as it is? I have doubts about which one to select.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...do we have any test case doing something like...

mtcars %>% group_by(cyl)
    %>% summarise(
  disp_mean = mean(disp),
  hp_mean = mean(hp)
)

I would expect target to be disp and name to be disp_mean. It's always possible we are renaming the columns somewhere else in the generated plan as well in which case name is meaningless here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think what's happening (IIRC) is that we Project before Aggregate. So if someone does

mtcars %>% group_by(cyl)
    %>% summarise(
  disp_mean = mean(disp / hp)
)

we first project disp_mean = disp / hp and them aggregate over that.

@vibhatha
Copy link
Collaborator Author

@nealrichardson I will work on these suggestions, thank you!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions but overall this is looking good.

@@ -480,6 +480,12 @@ struct ARROW_EXPORT Aggregate {

/// options for the aggregation function
const FunctionOptions* options;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should convert this to std::unique_ptr<FunctionOptions> but I don't mind if we leave that for a follow-up PR.

Copy link
Collaborator Author

@vibhatha vibhatha May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should it be unique_ptr? It's not obvious there's a need for ownership here; also, the caller might want to keep ownership as well.

cpp/src/arrow/compute/api_aggregate.h Show resolved Hide resolved
cpp/src/arrow/compute/api_aggregate.h Show resolved Hide resolved
@@ -372,22 +374,12 @@ static inline void PrintToImpl(const std::string& factory_name,
for (const auto& agg : o->aggregates) {
*os << agg.function << "<";
if (agg.options) PrintTo(*agg.options, os);
*os << agg.target.ToString() << "<";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is correct. Ideally we would have something like...

aggregates=[{function=hash_sum<>,target=a,name=sum(a)}]

I think the way it is now it would print:

aggregates={hash_sum<a<sum_a<>,},

@@ -121,11 +119,13 @@ ExecPlan <- R6Class("ExecPlan",
x
})
}
target_names <- names(.data$aggregations)
for (i in seq_len(length(target_names))) {
.data$aggregations[[i]][["name"]] <- .data$aggregations[[i]][["target"]] <- target_names[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...do we have any test case doing something like...

mtcars %>% group_by(cyl)
    %>% summarise(
  disp_mean = mean(disp),
  hp_mean = mean(hp)
)

I would expect target to be disp and name to be disp_mean. It's always possible we are renaming the columns somewhere else in the generated plan as well in which case name is meaningless here.

r/src/compute-exec.cpp Show resolved Hide resolved
Comment on lines 978 to 972
{"hash_sum", nullptr},
{"hash_sum", nullptr},
{"hash_mean", nullptr},
{"hash_mean", nullptr},
{"hash_product", nullptr},
{"hash_product", nullptr},
{"hash_sum", nullptr, "agg_0", "hash_sum"},
{"hash_sum", nullptr, "agg_1", "hash_sum"},
{"hash_mean", nullptr, "agg_2", "hash_mean"},
{"hash_mean", nullptr, "agg_3", "hash_mean"},
{"hash_product", nullptr, "agg_4", "hash_product"},
{"hash_product", nullptr, "agg_5", "hash_product"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have to make these changes and they do not make the test easier to read. Could we change the definition of GroupByTest so that instead of taking in const std::vector<internal::Aggregate>& aggregates it takes in const std::vector<TestAggregate>& aggregates where we define:

struct TestAggregate {
  std::string function;
  const FunctionOptions* options;
};

The old style makes sense for these unit tests since we don't really care about what we are naming the columns.

@vibhatha vibhatha force-pushed the arrow-16549 branch 3 times, most recently from c1d7572 to d47bfee Compare May 31, 2022 08:16
@vibhatha
Copy link
Collaborator Author

vibhatha commented Jun 1, 2022

@nealrichardson regarding the following

file=r/R/query-engine.R,line=18,col=1,functions should have cyclomatic complexity of less than 26, this has 28.

Should we create a few util functions external to theR6Class to mitigate this issue? What's the best?

r/R/query-engine.R Outdated Show resolved Hide resolved
@vibhatha
Copy link
Collaborator Author

vibhatha commented Jun 2, 2022

@nealrichardson I think it resolved the CI issue. Thank you.

@vibhatha
Copy link
Collaborator Author

vibhatha commented Jun 2, 2022

@westonpace should we take the Aggregate out from internal namespace?

@westonpace
Copy link
Member

@westonpace should we take the Aggregate out from internal namespace?

Yes.

@vibhatha
Copy link
Collaborator Author

@kou need some help with the C-Glib, when I rebased I tried to fix it, but didn't succeed. Appreciate your help.

@kou
Copy link
Member

kou commented Jun 23, 2022

Done.

@vibhatha
Copy link
Collaborator Author

Done.

Thank you @kou

)
}
}
config_agg <- private$.set_aggregation(node, .data, grouped, group_vars)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't like this refactoring. I think it obfuscates what is happening here, and that's not worth doing just to try to trick a misguided linter. Would you mind reverting it? I'd rather tune the linting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I will. Do you mind just making it one function or do you want a full revert where there are no helper functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full revert, the helpers don't seem to be helping with the lint warning

@nealrichardson
Copy link
Member

I just pushed a commit applying the suggestion from #13150 (comment), and that makes the lint warning go away. Also updated the lintr config and added a reference to an issue about cyclocomp for R6 classes.

@vibhatha
Copy link
Collaborator Author

I just pushed a commit applying the suggestion from #13150 (comment), and that makes the lint warning go away. Also updated the lintr config and added a reference to an issue about cyclocomp for R6 classes.

Thank you @nealrichardson I have missed that comment.

@vibhatha
Copy link
Collaborator Author

@nealrichardson the CI is failing here: https://github.com/apache/arrow/runs/7029848711?check_suite_focus=true#step:5:1040

Error in linters_with_defaults(line_length_linter = line_length_linter(120),  : 
  could not find function "linters_with_defaults"
Calls: <Anonymous> -> read_settings -> get_setting -> eval -> eval

@nealrichardson
Copy link
Member

@nealrichardson the CI is failing here: https://github.com/apache/arrow/runs/7029848711?check_suite_focus=true#step:5:1040

Error in linters_with_defaults(line_length_linter = line_length_linter(120),  : 
  could not find function "linters_with_defaults"
Calls: <Anonymous> -> read_settings -> get_setting -> eval -> eval

I backed out the lintr changes (they were just cleaning up warnings that show on the latest release of lintr, but we have pinned an old version elsewhere apparently). Will handle in ARROW-16900.

@vibhatha
Copy link
Collaborator Author

@nealrichardson the CI is failing here: https://github.com/apache/arrow/runs/7029848711?check_suite_focus=true#step:5:1040

Error in linters_with_defaults(line_length_linter = line_length_linter(120),  : 
  could not find function "linters_with_defaults"
Calls: <Anonymous> -> read_settings -> get_setting -> eval -> eval

I backed out the lintr changes (they were just cleaning up warnings that show on the latest release of lintr, but we have pinned an old version elsewhere apparently). Will handle in ARROW-16900.

Thanks @nealrichardson

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this cleanup.

@westonpace westonpace merged commit bb67f8d into apache:master Jun 27, 2022
@vibhatha
Copy link
Collaborator Author

Appreciate the support !.

djnavarro pushed a commit to djnavarro/arrow that referenced this pull request Jun 28, 2022
…pache#13150)

This PR is simplifying the existing `AggregateNodeOptions` usage. This work is still in progress and need to evaluate the existing refactor and usage. 

Todos

- [x] Test 
- [ ] Update documentation
- [ ] Update function docs
- [x] Evaluate CI failures (only tested on Mac M1 with C++/Python, need to check if the change breaks other language bindings


Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@assignUser
Copy link
Member

assignUser commented Jul 4, 2022

@github-actions crossbow submit verify-rc-source-cpp-macos-arm64 wheel-macos-big-sur-cp38-arm64

@github-actions
Copy link

github-actions bot commented Jul 4, 2022

Revision: 0a09d22

Submitted crossbow builds: ursacomputing/crossbow @ actions-17830e587b

Task Status
verify-rc-source-cpp-macos-arm64 Github Actions
wheel-macos-big-sur-cp38-arm64 Github Actions

vibhatha added a commit to vibhatha/arrow that referenced this pull request Jul 5, 2022
…pache#13150)

This PR is simplifying the existing `AggregateNodeOptions` usage. This work is still in progress and need to evaluate the existing refactor and usage. 

Todos

- [x] Test 
- [ ] Update documentation
- [ ] Update function docs
- [x] Evaluate CI failures (only tested on Mac M1 with C++/Python, need to check if the change breaks other language bindings


Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants