Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict input argument as orderable in array_sort, array_sort_desc #6928

Closed
wants to merge 9 commits into from

Conversation

duanmeng
Copy link
Collaborator

@duanmeng duanmeng commented Oct 6, 2023

In Presto, input to array_sort, and array_sort_desc are restricted to
orderable types. For example, MAP type is not orderable.

This PR uses the orderableTypeVariable to apply the restriction
to the input argument of array_sort. This restriction applies to the input
argument type of normal array_sort, and the return type of the lambda
of array_sort with a transform lambda, not the array_sort function
with a custom comparator.

For more details please check the blog https://velox-lib.io/blog/array-sort/

presto> SELECT array_sort(ARRAY [map(array['a', 'b', 'c'], array[1, 2, 3])]);]
...
Expected: array_sort(array(E)) E:orderable, ...

presto> SELECT array_sort(ARRAY [map(array['a', 'b', 'c'], array[1, 2, 3])], 
(x, y) -> IF(cardinality(x) < cardinality(y), -1, IF(cardinality(x) = cardinality(y), 0, 1)));
       _col0
-------------------
 [{a=1, b=2, c=3}]
(1 row)

Part of #6718, resolve #6712

@netlify
Copy link

netlify bot commented Oct 6, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit cda37e8
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/652e7d26d4ead40008dfa560

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 6, 2023
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks good % some comments.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating. A couple remaining comments.

@duanmeng duanmeng force-pushed the resSort branch 2 times, most recently from dc7be2f to 3239206 Compare October 6, 2023 08:18
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

CI is red. Looks like we need to update AggregationFuzzer to respect "orderable" constraint.

I20231006 10:10:20.467994 202359 AggregationFuzzer.cpp:709] ==============================> Started iteration 18 (seed: 4059922645)
terminate called after throwing an instance of 'facebook::velox::VeloxUserError'
  what():  Exception: VeloxUserError
Error Source: USER
Error Code: INVALID_ARGUMENT
Reason: Scalar function signature is not supported: array_sort(ARRAY<MAP<INTERVAL DAY TO SECOND,DOUBLE>>). Supported signatures: (array(T)) -> array(T), (array(T),constant function(T,U)) -> array(T), (array(T),constant function(T,T,bigint)) -> array(T).
Retriable: False
Function: resolveScalarFunctionType
File: ../../velox/parse/TypeResolver.cpp
Line: 99

@duanmeng
Copy link
Collaborator Author

duanmeng commented Oct 6, 2023

CI is red. Looks like we need to update AggregationFuzzer to respect "orderable" constraint.

@mbasmanova We could add this constraint in ArgumentTypeFuzzer::determineUnboundedTypeVariables, in which
the random type is generated.

@duanmeng
Copy link
Collaborator Author

An alternative solution might be to apply_sort after converting input to JSON: array_sort(json_format(cast(x as json)).

@mbasmanova This is more appropriate and cleaner (I should think more before coding :) ).
I just checked all the result types of the functions in customVerificationFunctions that use array_sort to transform results, which are array or map and could be cast to json.(array_agg, set_agg, set_union, map_agg, map_union, map_union_sum,multimap_agg). I will use this way to update this PR and rerun the fuzzer test.

PS: array_agg_partial, array_agg_merge, and array_agg_merge_extract seem useless, should we remove them?

@mbasmanova
Copy link
Contributor

PS: array_agg_partial, array_agg_merge, and array_agg_merge_extract seem useless, should we remove them?

There are so-called companion functions. See #4566

These are not currently used though.

I do see lots of issues when testing companion functions in the fuzzer. I was thinking to modify the registration APIs to allow to exclude all companion functions. CC: @kagamiori

@duanmeng
Copy link
Collaborator Author

An alternative solution might be to apply_sort after converting input to JSON: array_sort(json_format(cast(x as json)).

@mbasmanova Hi Masha, it seems we could not use JSON type as DuckDB does not support it. I am trying other transform functions, if not work would fall back to the specific functions way.

terminating with uncaught exception of type std::runtime_error: unsupported type for duckdb -> velox conversion: JSON

@duanmeng duanmeng changed the title Restrict input argument as orderable in array_sort Restrict input argument as orderable in array_sort, array_sort_desc Oct 14, 2023
@duanmeng
Copy link
Collaborator Author

@mbasmanova Hi Masha, I've updated this PR with the following changes, could you please help to take a look? Thanks.

  • Use internal$array_sort as the specific internal function name.($internal$array_sort is not supported by the duckdb-libpq_query)
  • Add the return type of the transform (comparison) check in the rewriteArraySortCall.
  • Update the PR description and title to clarify that this change applies to array_sort_desc as well.

@mbasmanova
Copy link
Contributor

@duanmeng

$internal$array_sort is not supported by the duckdb-libpq_query

You may need to put $internal$array_sort in double quotes: "$internal$array_sort"(x)

@mbasmanova
Copy link
Contributor

$internal$array_sort

I wonder if maybe a better name for this function would be $internal$canonicalize, because this function transforms an array into canonical form that allows for comparison.

@@ -464,4 +489,12 @@ VELOX_DECLARE_STATEFUL_VECTOR_FUNCTION(
signatures(false),
createDesc);

// Add an internal version of array_sort for used by AggregationFuzzerTest to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos:

An internal function to canonicalize an array to allow for comparisons. Used in AggregationFuzzerTest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -122,6 +122,8 @@ void registerArrayFunctions(const std::string& prefix) {
VELOX_REGISTER_VECTOR_FUNCTION(udf_array_sort, prefix + "array_sort");
VELOX_REGISTER_VECTOR_FUNCTION(
udf_array_sort_desc, prefix + "array_sort_desc");
VELOX_REGISTER_VECTOR_FUNCTION(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, move this registration into AggregationFuzzerTest. This function should not be available in production setup.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, done.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good % a few remaining comments.

velox/functions/prestosql/ArraySort.cpp Outdated Show resolved Hide resolved
velox/exec/tests/AggregationFuzzerTest.cpp Show resolved Hide resolved
{"map_agg", "array_sort(map_keys({}))"},
{"map_union", "array_sort(map_keys({}))"},
{"map_union_sum", "array_sort(map_keys({}))"},
{"array_agg", "\"$internal$canonicalize\"({})"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this change reduces coverage in the AggregationFuzzer. Before this change, the Fuzzer was able to verify results against DuckDB (since it supports array_sort function), but now it cannot. We are going to face same problem with using Presto as the reference query runner. A solution could be to change the Fuzzer to apply post-aggregation projections using Velox only (once to Velox results, once to results from reference DB), then compare. This is a follow-up though.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in 5afb621.

@conbench-facebook
Copy link

Conbench analyzed the 1 benchmark run on commit 5afb621b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

facebook-github-bot pushed a commit that referenced this pull request Oct 26, 2023
Summary:
In Presto, `array_sort` applied to arrays of complex types with nested
nulls may or may not fail depending on whether the sorting logic
needs to compare the nulls to decide the order.

```SQL
presto> SELECT array_sort(col0) FROM ( VALUES (array [array [1, 2, 3, 4],
array [2, null, 3]])) AS t(col0);
            _col0
------------------------------
 [[1, 2, 3, 4], [2, null, 3]]

presto> SELECT array_sort(col0) FROM ( VALUES (array [array [1, 2, 3, 4],
array [1, null, 3]])) AS t(col0);

Query 20230925_113531_00074_6r5h7 failed:
Array contains elements not supported for comparison
```

This PR checks for contains-nulls only during complex type
comparison (Presto's implementation). Add a throwOnNestedNullCompare
flag to distinguish the normal `array_sort[_desc]` function and
the internal canonicalize function `$internal$canonicalize` introduced in #6928,
which would use `NoStop` null-handling mode to compare and not throw.

Resolve #6713

Pull Request resolved: #7229

Reviewed By: xiaoxmeng, kgpai

Differential Revision: D50644882

Pulled By: mbasmanova

fbshipit-source-id: 80883ed17da1cae1ec8fbb3fc66b899c422a4dd1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Input type for array_sort should be restricted to "orderable" types
3 participants