Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35363: [C++] Fix Substrait schema names and for segmented aggregation #35364

Merged
merged 6 commits into from
May 1, 2023

Conversation

rtpsw
Copy link
Contributor

@rtpsw rtpsw commented Apr 28, 2023

@rtpsw rtpsw requested a review from westonpace as a code owner April 28, 2023 05:47
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #35363 has been automatically assigned in GitHub to PR creator.

@rtpsw rtpsw changed the title GH-35363: [C++] Minor code cleanup GH-35363: [C++] Fix Substrait schema names and for segmented aggregation Apr 28, 2023
@rtpsw
Copy link
Contributor Author

rtpsw commented Apr 28, 2023

cc @westonpace, @icexelloss

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Apr 28, 2023
@@ -3937,7 +3937,7 @@ TEST(Substrait, ProjectWithMultiFieldExpressions) {
}]
}
},
"names": ["A", "B", "C", "D"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong. The plan has 4 output columns:
direct reference 0, 1, 2 and a new column with scalar Function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, this check raises. IIUC, the plan has 3 outputs because its root is a projection with 3 emit indices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names here (A, B, C) doesn't match the “output_schema" below (A, A1, A+1). Is this expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not that I expect your code change has anything to do with this, but looks quite confusing to me so I asked)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, not related to my changes herer. The invocations look like CheckRoundTripResult -> AssertTablesEqualIgnoringOrder -> AssertTablesEqual, and the latter (surprisingly, I guess) ignores column names for some reason.

@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
agg_measure, ext_set, conv_opts,
/*is_hash=*/!keys.empty(), input_schema));
aggregate.name = aggregate.function;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this logic, the aggregate schema ends up with an empty name for each aggregate. This is because the aggregate decoding is returning an aggregate with an empty name, which in turn is likely because it has no access to the input schema against which to resolve the field reference (*arg_ref). The code here does have access to this schema and resolves the field reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this logic, the aggregate schema ends up with an empty name for each aggregate

While empty name is not great. I wonder if it actually breaks anything - substrait doesn't use names to refer to intermediate columns so IIUC the names here doesn't matter for correctness. (Although I agree meanful names are better for debug).

Does any code breaks without this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right and this is mostly useful in debugging. However, note that the schema produced by the pre-PR code is inconsistent - aggregate columns have empty names while other column have "normal" names (derived from the input). So, this change make sense at least for consistency.

Does any code breaks without this change?

No. However, after adding this logic I needed to change one test case that checks the aggregate column name. There must be a reason for the existence of this test case, but I don't know it. @westonpace ?

Copy link
Contributor

@icexelloss icexelloss Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rtpsw I am happy to merge the code that fixes the the segmented aggregation bug (didn't pass the segmented key). and the change that addes validation of number of output names However for the internal aggregation names change we would need validate that this naming is consistent with other aggregation codepath and/or refactor the naming logic to be shared (e.g., move that logic into ParseAggregateMeasure), so there is more work. I would suggest revert that change for now so we can move forward with the bug fix faster. (Since the naming change seems cosmetic)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 28, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 28, 2023
@icexelloss
Copy link
Contributor

@rtpsw LGTM. Can you fix the lint error?

@rtpsw
Copy link
Contributor Author

rtpsw commented May 1, 2023

@westonpace, @icexelloss: is this good to go?

@icexelloss icexelloss merged commit 3b48834 into apache:main May 1, 2023
@icexelloss
Copy link
Contributor

Thanks @rtpsw - LGTM.

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot removed the awaiting change review Awaiting change review label May 1, 2023
@github-actions github-actions bot added the awaiting changes Awaiting changes label May 1, 2023
@ursabot
Copy link

ursabot commented May 1, 2023

Benchmark runs are scheduled for baseline = 41adf00 and contender = 3b48834. 3b48834 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️25.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.81% ⬆️0.18%] test-mac-arm
[Failed ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.72% ⬆️0.09%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 3b488349 ec2-t3-xlarge-us-east-2
[Failed] 3b488349 test-mac-arm
[Failed] 3b488349 ursa-i9-9960x
[Finished] 3b488349 ursa-thinkcentre-m75q
[Finished] 41adf006 ec2-t3-xlarge-us-east-2
[Failed] 41adf006 test-mac-arm
[Failed] 41adf006 ursa-i9-9960x
[Finished] 41adf006 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@rtpsw rtpsw deleted the GH-35363 branch May 9, 2023 09:04
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
…gregation (apache#35364)

See apache#35363
* Closes: apache#35363

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Li Jin <ice.xelloss@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…gregation (apache#35364)

See apache#35363
* Closes: apache#35363

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Li Jin <ice.xelloss@gmail.com>
rtpsw added a commit to rtpsw/arrow that referenced this pull request May 16, 2023
…gregation (apache#35364)

See apache#35363
* Closes: apache#35363

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Li Jin <ice.xelloss@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Fix Substrait schema names and for segmented aggregation
3 participants