Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail if inputs contain nested nulls in min/max/min_by/max_by Presto aggregates #6723

Closed
wants to merge 2 commits into from

Conversation

duanmeng
Copy link
Collaborator

@duanmeng duanmeng commented Sep 24, 2023

In terms of min/max, processing a single stream of values [1, 2], [2, 3], [3, null], [3, 4] goes like this:

  • acc = [1, 2]
  • compare(acc, [2, 3]) -> keep acc = [1, 2]
  • compare (acc, [3, null]) -> detect that 1 < 3 and do not continue comparing 2 with null - > keep acc = [1, 2]

But processing 2 streams: ([1, 2], [2, 3]) and ([3, null], [3, 4]) goes differently. When processing 2nd stream:

  • acc = [3, null]
  • compare (acc, [3, 4]) triggers comparison null v. 4 and fails.

For example,

presto> select min(col0) from (values (array [1, 2]), (array [2, 3]), (array [3, null]), (array [3, 4])) as tbl(col0);
 _col0
--------
 [1, 2]
(1 row)

But it will fail if I break them into two aggregates,

presto> select min(col0) from (values (array [1, 2]), (array [2, 3])) as tbl(col0);
 _col0
--------
 [1, 2]
(1 row)

presto> select min(col0) from (values (array [3, null]), (array [3, 4])) as tbl(col0);
Query 20230919_044628_00012_6yp5g failed: ARRAY comparison not supported for arrays with null elements

This behavior is problematic because when a query runs on a cluster it can go either way and therefore, the same query may either succeed or fail. Failures are intermittent and non-deterministic.

We need to make sure that the query always fails to make the query always fail, to make it more semantically clear. This PR adds a check that all inputs to min/max (and other similar functions) do not contain nested nulls before serializing them to the accumulator.

Part of #6314

@netlify
Copy link

netlify bot commented Sep 24, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 0bbde39
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6513b71114ad8a0008cd34c6

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 24, 2023
@duanmeng duanmeng force-pushed the addCheckNull branch 3 times, most recently from 91eca6f to 96b5549 Compare September 26, 2023 03:29
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duanmeng Thank you for working on this fix. Some comments below. Also, would it be possible to add a test that triggers the new check?

velox/functions/prestosql/aggregates/MinMaxAggregates.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duanmeng Let's also update documentation for min/max to explain that this function doesn't allow input values of complex types that contain nulls.

@duanmeng duanmeng force-pushed the addCheckNull branch 2 times, most recently from 4264201 to 35c1d0b Compare September 26, 2023 16:31
@duanmeng
Copy link
Collaborator Author

@duanmeng Let's also update documentation for min/max to explain that this function doesn't allow input values of complex types that contain nulls.

@mbasmanova Sure done.

@duanmeng
Copy link
Collaborator Author

duanmeng commented Sep 26, 2023

@duanmeng Thank you for working on this fix. Some comments below. Also, would it be possible to add a test that triggers the new check?

@mbasmanova I've adjusted the complex-type tests to the pattern that would not throw an exception without this patch.
For example,

auto data = makeRowVector({
      makeNullableMapVector<int64_t, int64_t>({
          {{{1, 1}, {2, 2}}},
          {{{2, std::nullopt}, {2, 3}}},
          {{{4, 50}}},
      }),
  });

VELOX_ASSERT_THROW(
      testAggregations({data}, {}, {"min(c0)", "max(c0)"}, {expected}),
      "MAP comparison not supported for values that contain nulls");

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. CC: @kagamiori

velox/functions/prestosql/aggregates/MinMaxAggregates.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/aggregates/MinMaxAggregates.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/aggregates/MinMaxAggregates.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/aggregates/MinMaxAggregates.cpp Outdated Show resolved Hide resolved
@mbasmanova
Copy link
Contributor

I've adjusted the complex-type tests to the pattern that would not throw an exception without this patch.

Somehow I don't see these changes. Would you double check?

BTW, in a follow-up we are going to remove support for MAP inputs altogether. Hence, it might better to use ARRAY or ROW for these tests.

@mbasmanova mbasmanova changed the title Check nested null of inputs of min/max Presto aggregates Fail is inputs contain nested nulls in min/max/min_by/max_by Presto aggregates Sep 26, 2023
@duanmeng
Copy link
Collaborator Author

@mbasmanova Hi Masha, I've resolved all the comments according to your guidance.

  • Combine check null and check nested nulls into one function
  • Use isRawInput and throwOnNestedNulls
  • Add two more check null test cases following the principle that would not throw an exception without this patch.

@mbasmanova mbasmanova changed the title Fail is inputs contain nested nulls in min/max/min_by/max_by Presto aggregates Fail if inputs contain nested nulls in min/max/min_by/max_by Presto aggregates Sep 27, 2023
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great % a couple minor comments.

velox/functions/prestosql/aggregates/MinMaxAggregates.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/aggregates/MinMaxAggregates.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/aggregates/tests/MinMaxTest.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in 05386f6.

@conbench-facebook
Copy link

Conbench analyzed the 1 benchmark run on commit 05386f6a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@mbasmanova
Copy link
Contributor

@duanmeng It looks like there is one more place that needs updating. NonNumericMinMaxAggregateBase::toIntermediate also needs to check complex type inputs and throw if there are nested nulls. Any chance you could help make this change?

@duanmeng
Copy link
Collaborator Author

@duanmeng It looks like there is one more place that needs updating. NonNumericMinMaxAggregateBase::toIntermediate also needs to check complex type inputs and throw if there are nested nulls. Any chance you could help make this change?

@mbasmanova Got it, I will fix this. I should read the doc more carefully. https://facebookincubator.github.io/velox/develop/aggregate-functions.html#:~:text=By%20default%2C%20to,the%20input%20unmodified.

@duanmeng
Copy link
Collaborator Author

duanmeng commented Sep 28, 2023

@mbasmanova Hi Masha, toIntermediate is a kind of adaptive optimization by abandons partial aggregation that has no meaningful cardinality reduction. For min/max aggregates toIntermediate act as follows,

  • Raw data in raw data out (sometimes with dictionary encoding)
  • It happens in step ofStep::kPartial
  • Its result is the input of another step Step::kIntermediate\Step::kFinal, which would not check nested nulls of the input, causing no null checks for the rest of the raw input.

Do I understand correctly?

@mbasmanova
Copy link
Contributor

@duanmeng Thanks. You got it perfectly. This is exactly what happens.

facebook-github-bot pushed a commit that referenced this pull request Oct 3, 2023
Summary:
`toIntermediate`  is an adaptive optimization to abandon partial
aggregation that has no meaningful cardinality reduction. For min/max
aggregates `toIntermediate` act as follows,

- Raw data in raw data out (sometimes with dictionary encoding)
- It happens in step ofStep::kPartial
- Its result is the input of another step Step::kIntermediate\Step::kFinal,
  which would not check nested nulls of the input, causing no null checks for
  the rest of the raw input.

This is a follow-up to #6723. `NonNumericMinMaxAggregateBase::toIntermediate`
also needs to check complex type inputs and throw if there are nested nulls.

Pull Request resolved: #6790

Reviewed By: xiaoxmeng

Differential Revision: D49831258

Pulled By: mbasmanova

fbshipit-source-id: a6f1dd2d8e72faedbe75ff1f48d0f24fbcf63248
ericyuliu pushed a commit to ericyuliu/velox that referenced this pull request Oct 12, 2023
…ggregates (facebookincubator#6723)

Summary:
In terms of min/max, processing a single stream of values [1, 2], [2, 3], [3, null], [3, 4] goes like this:

- acc = [1, 2]
- compare(acc, [2, 3]) -> keep acc = [1, 2]
- compare (acc, [3, null]) -> detect that 1 < 3 and do not continue comparing 2 with null - > keep acc = [1, 2]

But processing 2 streams: ([1, 2], [2, 3]) and ([3, null], [3, 4]) goes differently. When processing 2nd stream:

- acc = [3, null]
- compare (acc, [3, 4]) triggers comparison null v. 4 and fails.

For example,

```SQL
presto> select min(col0) from (values (array [1, 2]), (array [2, 3]), (array [3, null]), (array [3, 4])) as tbl(col0);
 _col0
--------
 [1, 2]
(1 row)
```
But it will fail if I break them into two aggregates,
```SQL
presto> select min(col0) from (values (array [1, 2]), (array [2, 3])) as tbl(col0);
 _col0
--------
 [1, 2]
(1 row)

presto> select min(col0) from (values (array [3, null]), (array [3, 4])) as tbl(col0);
Query 20230919_044628_00012_6yp5g failed: ARRAY comparison not supported for arrays with null elements
```

This behavior is problematic because when a query runs on a cluster it can go either way and therefore, the same query may either succeed or fail. Failures are intermittent and non-deterministic.

We need to make sure that the query always fails to make the query always fail, to make it more semantically clear. This PR adds a check that all inputs to min/max (and other similar functions) do not contain nested nulls before serializing them to the accumulator.

Part of facebookincubator#6314

Pull Request resolved: facebookincubator#6723

Reviewed By: xiaoxmeng

Differential Revision: D49674252

Pulled By: mbasmanova

fbshipit-source-id: 37c71bf8a60471bca0dcb31a53042b417d9e0a77
ericyuliu pushed a commit to ericyuliu/velox that referenced this pull request Oct 12, 2023
…ookincubator#6790)

Summary:
`toIntermediate`  is an adaptive optimization to abandon partial
aggregation that has no meaningful cardinality reduction. For min/max
aggregates `toIntermediate` act as follows,

- Raw data in raw data out (sometimes with dictionary encoding)
- It happens in step ofStep::kPartial
- Its result is the input of another step Step::kIntermediate\Step::kFinal,
  which would not check nested nulls of the input, causing no null checks for
  the rest of the raw input.

This is a follow-up to facebookincubator#6723. `NonNumericMinMaxAggregateBase::toIntermediate`
also needs to check complex type inputs and throw if there are nested nulls.

Pull Request resolved: facebookincubator#6790

Reviewed By: xiaoxmeng

Differential Revision: D49831258

Pulled By: mbasmanova

fbshipit-source-id: a6f1dd2d8e72faedbe75ff1f48d0f24fbcf63248
facebook-github-bot pushed a commit that referenced this pull request Oct 17, 2023
…y Presto aggregates (#6991)

Summary:
`min_by/max_by` suffers the same issue described in #6723. For example,
```SQL
presto> select min_by(col0, col1) from (values (1, array [1, 2]),
(2, array [2, 3]), (3, array [3, null]), (4, array [3, 4])) as tbl(col0, col1);
 _col0
-------
     1
presto> select min_by(col0,col1) from (values (1, array [1, 2]),
(2, array [2, 3])) as tbl(col0,col1);
 _col0
-------
     1
presto> select min_by(col0,col1) from (values (1, array [3, null]),
(2, array [3, 4])) as tbl(col0,col1);
Query 20231010_190419_00057_2vkwe failed: ARRAY comparison
not supported for arrays with null elements
```
This PR checks nested nulls in compare values when the
compare type is a complex type.

Similar to #6723, part of #6314

Pull Request resolved: #6991

Reviewed By: Yuhta

Differential Revision: D50211257

Pulled By: mbasmanova

fbshipit-source-id: 25b5840c2bb1e3c325652a5d11bddb25406ed75a
@mbasmanova
Copy link
Contributor

@duanmeng Meng, do you know if we have create an issue about this in PrestoDB repo?

@duanmeng
Copy link
Collaborator Author

duanmeng commented Nov 17, 2023

@duanmeng Meng, do you know if we have create an issue about this in PrestoDB repo?

@mbasmanova Hi Masha, I think we do not. I just scanned issue #6314 and found no PrestoDB issue linked to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants