GH-35363: [C++] Fix Substrait schema names and for segmented aggregation #35364

rtpsw · 2023-04-28T05:47:10Z

See #35363

Closes: [C++] Fix Substrait schema names and for segmented aggregation #35363

github-actions · 2023-04-28T05:47:32Z

Closes: [C++] Fix Substrait schema names and for segmented aggregation #35363

github-actions · 2023-04-28T05:47:34Z

⚠️ GitHub issue #35363 has been automatically assigned in GitHub to PR creator.

…gregation

rtpsw · 2023-04-28T10:51:52Z

cc @westonpace, @icexelloss

cpp/src/arrow/engine/substrait/options.cc

icexelloss · 2023-04-28T18:07:51Z

cpp/src/arrow/engine/substrait/serde_test.cc

@@ -3937,7 +3937,7 @@ TEST(Substrait, ProjectWithMultiFieldExpressions) {
            }]
          }
        },
-        "names": ["A", "B", "C", "D"]


Why this change?

This seems wrong. The plan has 4 output columns:
direct reference 0, 1, 2 and a new column with scalar Function

Without this change, this check raises. IIUC, the plan has 3 outputs because its root is a projection with 3 emit indices.

These names here (A, B, C) doesn't match the “output_schema" below (A, A1, A+1). Is this expected?

(Not that I expect your code change has anything to do with this, but looks quite confusing to me so I asked)

Right, not related to my changes herer. The invocations look like CheckRoundTripResult -> AssertTablesEqualIgnoringOrder -> AssertTablesEqual, and the latter (surprisingly, I guess) ignores column names for some reason.

icexelloss · 2023-04-28T18:09:53Z

cpp/src/arrow/engine/substrait/options.cc

@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
      ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
                                                agg_measure, ext_set, conv_opts,
                                                /*is_hash=*/!keys.empty(), input_schema));
+      aggregate.name = aggregate.function;


Why do we need this logic?

Without this logic, the aggregate schema ends up with an empty name for each aggregate. This is because the aggregate decoding is returning an aggregate with an empty name, which in turn is likely because it has no access to the input schema against which to resolve the field reference (*arg_ref). The code here does have access to this schema and resolves the field reference.

Without this logic, the aggregate schema ends up with an empty name for each aggregate

While empty name is not great. I wonder if it actually breaks anything - substrait doesn't use names to refer to intermediate columns so IIUC the names here doesn't matter for correctness. (Although I agree meanful names are better for debug).

Does any code breaks without this change?

I think you're right and this is mostly useful in debugging. However, note that the schema produced by the pre-PR code is inconsistent - aggregate columns have empty names while other column have "normal" names (derived from the input). So, this change make sense at least for consistency.

Does any code breaks without this change?

No. However, after adding this logic I needed to change one test case that checks the aggregate column name. There must be a reason for the existence of this test case, but I don't know it. @westonpace ?

@rtpsw I am happy to merge the code that fixes the the segmented aggregation bug (didn't pass the segmented key). and the change that addes validation of number of output names However for the internal aggregation names change we would need validate that this naming is consistent with other aggregation codepath and/or refactor the naming logic to be shared (e.g., move that logic into ParseAggregateMeasure), so there is more work. I would suggest revert that change for now so we can move forward with the bug fix faster. (Since the naming change seems cosmetic)

icexelloss · 2023-04-28T22:17:10Z

@rtpsw LGTM. Can you fix the lint error?

rtpsw · 2023-05-01T08:28:26Z

@westonpace, @icexelloss: is this good to go?

icexelloss · 2023-05-01T12:45:29Z

Thanks @rtpsw - LGTM.

icexelloss

LGTM

ursabot · 2023-05-01T18:24:22Z

Benchmark runs are scheduled for baseline = 41adf00 and contender = 3b48834. 3b48834 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️25.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.81% ⬆️0.18%] test-mac-arm
[Failed ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.72% ⬆️0.09%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 3b488349 ec2-t3-xlarge-us-east-2
[Failed] 3b488349 test-mac-arm
[Failed] 3b488349 ursa-i9-9960x
[Finished] 3b488349 ursa-thinkcentre-m75q
[Finished] 41adf006 ec2-t3-xlarge-us-east-2
[Failed] 41adf006 test-mac-arm
[Failed] 41adf006 ursa-i9-9960x
[Finished] 41adf006 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…gregation (apache#35364) See apache#35363 * Closes: apache#35363 Authored-by: Yaron Gvili <rtpsw@hotmail.com> Signed-off-by: Li Jin <ice.xelloss@gmail.com>

apacheGH-35363: [C++] Minor code cleanup

cc792f3

rtpsw requested a review from westonpace as a code owner April 28, 2023 05:47

github-actions bot added Component: C++ awaiting review Awaiting review labels Apr 28, 2023

apacheGH-35363: [C++] Fix Substrait schema names and for segmented ag…

2098885

…gregation

rtpsw changed the title ~~GH-35363: [C++] Minor code cleanup~~ GH-35363: [C++] Fix Substrait schema names and for segmented aggregation Apr 28, 2023

icexelloss reviewed Apr 28, 2023

View reviewed changes

cpp/src/arrow/engine/substrait/options.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Apr 28, 2023

icexelloss reviewed Apr 28, 2023

View reviewed changes

consistent name in segmented and regular aggregates

3c84123

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 28, 2023

revert name fixes

529ec68

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 28, 2023

add back output number validation

b2acc6b

lint

b817b46

icexelloss merged commit 3b48834 into apache:main May 1, 2023

icexelloss reviewed May 1, 2023

View reviewed changes

github-actions bot removed the awaiting change review Awaiting change review label May 1, 2023

github-actions bot added the awaiting changes Awaiting changes label May 1, 2023

ElenaHenderson mentioned this pull request May 1, 2023

[CI] Some TPCH benchmarks started failing #35383

Closed

rtpsw deleted the GH-35363 branch May 9, 2023 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-35363: [C++] Fix Substrait schema names and for segmented aggregation #35364

GH-35363: [C++] Fix Substrait schema names and for segmented aggregation #35364

rtpsw commented Apr 28, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Apr 28, 2023

github-actions bot commented Apr 28, 2023

rtpsw commented Apr 28, 2023

icexelloss Apr 28, 2023

icexelloss Apr 28, 2023

rtpsw Apr 28, 2023

icexelloss Apr 28, 2023

icexelloss Apr 28, 2023

icexelloss Apr 28, 2023

rtpsw Apr 28, 2023

icexelloss Apr 28, 2023

rtpsw Apr 28, 2023

icexelloss Apr 28, 2023

rtpsw Apr 28, 2023

icexelloss Apr 28, 2023 •

edited

Loading

rtpsw Apr 28, 2023

icexelloss commented Apr 28, 2023

rtpsw commented May 1, 2023

icexelloss commented May 1, 2023

icexelloss left a comment

ursabot commented May 1, 2023

GH-35363: [C++] Fix Substrait schema names and for segmented aggregation #35364

GH-35363: [C++] Fix Substrait schema names and for segmented aggregation #35364

Conversation

rtpsw commented Apr 28, 2023 • edited by github-actions bot Loading

github-actions bot commented Apr 28, 2023

github-actions bot commented Apr 28, 2023

rtpsw commented Apr 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss Apr 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss commented Apr 28, 2023

rtpsw commented May 1, 2023

icexelloss commented May 1, 2023

icexelloss left a comment

Choose a reason for hiding this comment

ursabot commented May 1, 2023

rtpsw commented Apr 28, 2023 •

edited by github-actions bot

Loading

icexelloss Apr 28, 2023 •

edited

Loading