-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35363: [C++] Fix Substrait schema names and for segmented aggregation #35364
Conversation
|
@@ -3937,7 +3937,7 @@ TEST(Substrait, ProjectWithMultiFieldExpressions) { | |||
}] | |||
} | |||
}, | |||
"names": ["A", "B", "C", "D"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong. The plan has 4 output columns:
direct reference 0, 1, 2 and a new column with scalar Function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change, this check raises. IIUC, the plan has 3 outputs because its root is a projection with 3 emit indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These names here (A, B, C) doesn't match the “output_schema" below (A, A1, A+1). Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not that I expect your code change has anything to do with this, but looks quite confusing to me so I asked)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, not related to my changes herer. The invocations look like CheckRoundTripResult
-> AssertTablesEqualIgnoringOrder
-> AssertTablesEqual
, and the latter (surprisingly, I guess) ignores column names for some reason.
@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider { | |||
ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure( | |||
agg_measure, ext_set, conv_opts, | |||
/*is_hash=*/!keys.empty(), input_schema)); | |||
aggregate.name = aggregate.function; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this logic, the aggregate schema ends up with an empty name for each aggregate. This is because the aggregate decoding is returning an aggregate with an empty name, which in turn is likely because it has no access to the input schema against which to resolve the field reference (*arg_ref
). The code here does have access to this schema and resolves the field reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this logic, the aggregate schema ends up with an empty name for each aggregate
While empty name is not great. I wonder if it actually breaks anything - substrait doesn't use names to refer to intermediate columns so IIUC the names here doesn't matter for correctness. (Although I agree meanful names are better for debug).
Does any code breaks without this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right and this is mostly useful in debugging. However, note that the schema produced by the pre-PR code is inconsistent - aggregate columns have empty names while other column have "normal" names (derived from the input). So, this change make sense at least for consistency.
Does any code breaks without this change?
No. However, after adding this logic I needed to change one test case that checks the aggregate column name. There must be a reason for the existence of this test case, but I don't know it. @westonpace ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rtpsw I am happy to merge the code that fixes the the segmented aggregation bug (didn't pass the segmented key). and the change that addes validation of number of output names However for the internal aggregation names change we would need validate that this naming is consistent with other aggregation codepath and/or refactor the naming logic to be shared (e.g., move that logic into ParseAggregateMeasure), so there is more work. I would suggest revert that change for now so we can move forward with the bug fix faster. (Since the naming change seems cosmetic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@rtpsw LGTM. Can you fix the lint error? |
@westonpace, @icexelloss: is this good to go? |
Thanks @rtpsw - LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Benchmark runs are scheduled for baseline = 41adf00 and contender = 3b48834. 3b48834 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…gregation (apache#35364) See apache#35363 * Closes: apache#35363 Authored-by: Yaron Gvili <rtpsw@hotmail.com> Signed-off-by: Li Jin <ice.xelloss@gmail.com>
…gregation (apache#35364) See apache#35363 * Closes: apache#35363 Authored-by: Yaron Gvili <rtpsw@hotmail.com> Signed-off-by: Li Jin <ice.xelloss@gmail.com>
…gregation (apache#35364) See apache#35363 * Closes: apache#35363 Authored-by: Yaron Gvili <rtpsw@hotmail.com> Signed-off-by: Li Jin <ice.xelloss@gmail.com>
See #35363