-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail if inputs contain nested nulls in min/max/min_by/max_by Presto aggregates #6723
Conversation
✅ Deploy Preview for meta-velox canceled.
|
91eca6f
to
96b5549
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@duanmeng Thank you for working on this fix. Some comments below. Also, would it be possible to add a test that triggers the new check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@duanmeng Let's also update documentation for min/max to explain that this function doesn't allow input values of complex types that contain nulls.
4264201
to
35c1d0b
Compare
@mbasmanova Sure done. |
@mbasmanova I've adjusted the complex-type tests to the pattern that would not throw an exception without this patch. auto data = makeRowVector({
makeNullableMapVector<int64_t, int64_t>({
{{{1, 1}, {2, 2}}},
{{{2, std::nullopt}, {2, 3}}},
{{{4, 50}}},
}),
});
VELOX_ASSERT_THROW(
testAggregations({data}, {}, {"min(c0)", "max(c0)"}, {expected}),
"MAP comparison not supported for values that contain nulls"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments. CC: @kagamiori
Somehow I don't see these changes. Would you double check? BTW, in a follow-up we are going to remove support for MAP inputs altogether. Hence, it might better to use ARRAY or ROW for these tests. |
35c1d0b
to
2d62b7a
Compare
@mbasmanova Hi Masha, I've resolved all the comments according to your guidance.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great % a couple minor comments.
3d7022f
to
0bbde39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova merged this pull request in 05386f6. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
@duanmeng It looks like there is one more place that needs updating. NonNumericMinMaxAggregateBase::toIntermediate also needs to check complex type inputs and throw if there are nested nulls. Any chance you could help make this change? |
@mbasmanova Got it, I will fix this. I should read the doc more carefully. https://facebookincubator.github.io/velox/develop/aggregate-functions.html#:~:text=By%20default%2C%20to,the%20input%20unmodified. |
@mbasmanova Hi Masha,
Do I understand correctly? |
@duanmeng Thanks. You got it perfectly. This is exactly what happens. |
Summary: `toIntermediate` is an adaptive optimization to abandon partial aggregation that has no meaningful cardinality reduction. For min/max aggregates `toIntermediate` act as follows, - Raw data in raw data out (sometimes with dictionary encoding) - It happens in step ofStep::kPartial - Its result is the input of another step Step::kIntermediate\Step::kFinal, which would not check nested nulls of the input, causing no null checks for the rest of the raw input. This is a follow-up to #6723. `NonNumericMinMaxAggregateBase::toIntermediate` also needs to check complex type inputs and throw if there are nested nulls. Pull Request resolved: #6790 Reviewed By: xiaoxmeng Differential Revision: D49831258 Pulled By: mbasmanova fbshipit-source-id: a6f1dd2d8e72faedbe75ff1f48d0f24fbcf63248
…ggregates (facebookincubator#6723) Summary: In terms of min/max, processing a single stream of values [1, 2], [2, 3], [3, null], [3, 4] goes like this: - acc = [1, 2] - compare(acc, [2, 3]) -> keep acc = [1, 2] - compare (acc, [3, null]) -> detect that 1 < 3 and do not continue comparing 2 with null - > keep acc = [1, 2] But processing 2 streams: ([1, 2], [2, 3]) and ([3, null], [3, 4]) goes differently. When processing 2nd stream: - acc = [3, null] - compare (acc, [3, 4]) triggers comparison null v. 4 and fails. For example, ```SQL presto> select min(col0) from (values (array [1, 2]), (array [2, 3]), (array [3, null]), (array [3, 4])) as tbl(col0); _col0 -------- [1, 2] (1 row) ``` But it will fail if I break them into two aggregates, ```SQL presto> select min(col0) from (values (array [1, 2]), (array [2, 3])) as tbl(col0); _col0 -------- [1, 2] (1 row) presto> select min(col0) from (values (array [3, null]), (array [3, 4])) as tbl(col0); Query 20230919_044628_00012_6yp5g failed: ARRAY comparison not supported for arrays with null elements ``` This behavior is problematic because when a query runs on a cluster it can go either way and therefore, the same query may either succeed or fail. Failures are intermittent and non-deterministic. We need to make sure that the query always fails to make the query always fail, to make it more semantically clear. This PR adds a check that all inputs to min/max (and other similar functions) do not contain nested nulls before serializing them to the accumulator. Part of facebookincubator#6314 Pull Request resolved: facebookincubator#6723 Reviewed By: xiaoxmeng Differential Revision: D49674252 Pulled By: mbasmanova fbshipit-source-id: 37c71bf8a60471bca0dcb31a53042b417d9e0a77
…ookincubator#6790) Summary: `toIntermediate` is an adaptive optimization to abandon partial aggregation that has no meaningful cardinality reduction. For min/max aggregates `toIntermediate` act as follows, - Raw data in raw data out (sometimes with dictionary encoding) - It happens in step ofStep::kPartial - Its result is the input of another step Step::kIntermediate\Step::kFinal, which would not check nested nulls of the input, causing no null checks for the rest of the raw input. This is a follow-up to facebookincubator#6723. `NonNumericMinMaxAggregateBase::toIntermediate` also needs to check complex type inputs and throw if there are nested nulls. Pull Request resolved: facebookincubator#6790 Reviewed By: xiaoxmeng Differential Revision: D49831258 Pulled By: mbasmanova fbshipit-source-id: a6f1dd2d8e72faedbe75ff1f48d0f24fbcf63248
…y Presto aggregates (#6991) Summary: `min_by/max_by` suffers the same issue described in #6723. For example, ```SQL presto> select min_by(col0, col1) from (values (1, array [1, 2]), (2, array [2, 3]), (3, array [3, null]), (4, array [3, 4])) as tbl(col0, col1); _col0 ------- 1 presto> select min_by(col0,col1) from (values (1, array [1, 2]), (2, array [2, 3])) as tbl(col0,col1); _col0 ------- 1 presto> select min_by(col0,col1) from (values (1, array [3, null]), (2, array [3, 4])) as tbl(col0,col1); Query 20231010_190419_00057_2vkwe failed: ARRAY comparison not supported for arrays with null elements ``` This PR checks nested nulls in compare values when the compare type is a complex type. Similar to #6723, part of #6314 Pull Request resolved: #6991 Reviewed By: Yuhta Differential Revision: D50211257 Pulled By: mbasmanova fbshipit-source-id: 25b5840c2bb1e3c325652a5d11bddb25406ed75a
@duanmeng Meng, do you know if we have create an issue about this in PrestoDB repo? |
@mbasmanova Hi Masha, I think we do not. I just scanned issue #6314 and found no PrestoDB issue linked to it. |
In terms of min/max, processing a single stream of values [1, 2], [2, 3], [3, null], [3, 4] goes like this:
But processing 2 streams: ([1, 2], [2, 3]) and ([3, null], [3, 4]) goes differently. When processing 2nd stream:
For example,
But it will fail if I break them into two aggregates,
This behavior is problematic because when a query runs on a cluster it can go either way and therefore, the same query may either succeed or fail. Failures are intermittent and non-deterministic.
We need to make sure that the query always fails to make the query always fail, to make it more semantically clear. This PR adds a check that all inputs to min/max (and other similar functions) do not contain nested nulls before serializing them to the accumulator.
Part of #6314