-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ExpressionVirtualColumn capabilities; fix groupBy's improper uses of StorageAdapter#getColumnCapabilities. #8013
Conversation
1) A usage in "isArrayAggregateApplicable" that would potentially incorrectly use array-based aggregation on a virtual column that shadows a real column. 2) A usage in "process" that would potentially use the more expensive multi-value aggregation path on a singly-valued virtual column. (No correctness issue, but a performance issue.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this. +1 after CI.
@gianm I think this is failing UT |
It looks like the tests were failing because ExpressionVirtualColumn always set its capabilities to be singly-valued, which is a bug, since ever since #7588 they might be multi-valued. However, that bug was probably not detected since it was masked by this bug (which prevents groupBy from using its all-singly-valued-dimension optimization if some of the columns involved are virtual columns). I pushed a fix for the ExpressionVirtualColumn issue and updated the top comment. In this fix I just set it to always be "true". This isn't ideal, since it means singly-valued optimizations won't work on top of it, but I didn't see an easy way for the ExpressionVirtualColumn to determine upfront if it will be singly-valued or not. I think this should be possible in the future as we add more upfront type info to the expression system, so I added a comment saying as much. /cc @clintropolis |
👍 I think this makes sense for now, I will follow this up with a fix to allow us to determine when a single input column will produce an scalar or array output so we can have this optimization again where possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤘
… of StorageAdapter#getColumnCapabilities. (apache#8013) * GroupBy: Fix improper uses of StorageAdapter#getColumnCapabilities. 1) A usage in "isArrayAggregateApplicable" that would potentially incorrectly use array-based aggregation on a virtual column that shadows a real column. 2) A usage in "process" that would potentially use the more expensive multi-value aggregation path on a singly-valued virtual column. (No correctness issue, but a performance issue.) * Add addl javadoc. * ExpressionVirtualColumn: Set multi-value flag.
… of StorageAdapter#getColumnCapabilities. (#8013) * GroupBy: Fix improper uses of StorageAdapter#getColumnCapabilities. 1) A usage in "isArrayAggregateApplicable" that would potentially incorrectly use array-based aggregation on a virtual column that shadows a real column. 2) A usage in "process" that would potentially use the more expensive multi-value aggregation path on a singly-valued virtual column. (No correctness issue, but a performance issue.) * Add addl javadoc. * ExpressionVirtualColumn: Set multi-value flag.
array-based aggregation on a virtual column that shadows a real column.
aggregation path on a singly-valued virtual column. (No correctness issue, but
a performance issue.)
Also makes ExpressionVirtualColumn always report that it is multi-valued. Previously,
it always set its capabilities to be singly-valued, which was bug ever since #7588, since
it might actually be multi-valued.