Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: issue #8838 discard extra sort when sorted element is wrapped #9127

Merged
merged 8 commits into from
Feb 20, 2024

Conversation

Lordworms
Copy link
Contributor

@Lordworms Lordworms commented Feb 5, 2024

fix: issue #8838 discard extra sort when sorted element is wrapped

Which issue does this PR close?

improves the state in #8838

Closes #.

Rationale for this change

In the previous design, when we wanted to construct a dependency_map, we just used the order gotten from the previous node(In this issue, when we construct ProjectionExec, we just use the ordering gotten from CsvExec), however, previous design does not support a wrapper for the existing column, In this case, when we have orderings is [a DESC, b DESC] and the projection is CAST(a as BITINT), then we should change the ordering from a to CAST(a as BITINT) in order not to generate false dependency map(In this issue, when we construct the dependency map using the original expression, we would break prematurely and lose dependencies). In order to address this, we need to substitute original expression with those monotonic new expession

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@mustafasrepo
Copy link
Contributor

mustafasrepo commented Feb 15, 2024

Thanks @Lordworms for this PR. I have examined this PR and it is really neat. However,
I think we need to address a couple of points before merging:

  • During substitution you seem to be filtering out expressions unchanged. As an example, assume ordering is [a ASC, b ASC]. And projection expressions are a, b, CAST(a as BIGINT) as cast_a. In this case after projection valid orderings are [a ASC, b ASC] and [cast_a ASC, b ASC]. In this case, substitution should convert single ordering [a ASC, b ASC] possibly multiple orderings: [a ASC, b ASC], [cast_a ASC, b ASC]
  • Also substitution should work only in when function or operation has one-to-one mapping. e.g monotonicity alone is not a sufficient condition. As a counter example consider following table
a b
1.5 1
1.5 2
2.0 1
2.0 2

where table satisfies ordering [a ASC, b ASC]. If projection the projection (a, b, CEIL(a)) is applied to this table. Result would be

a b CEIL(a)
1.5 1 2.0
1.5 2 2.0
2.0 1 2.0
2.0 2 2.0

where ordering [a ASC, b ASC] is still satisfied, but [ceil(a) ASC, b ASC] is not satisfied. Hence necessary condition for substitution is monotonicity and one-to-one mapping. For the same reason, CAST should substitute only when target type is larger than the source type. (e.g cast from int16 to int32, etc.)

Currently, we do not have one-to-one mapping analysis for Scalar functions. For this reason, until we have that support adding substitution support for cast to larger type is enough.

@alamb
Copy link
Contributor

alamb commented Feb 16, 2024

@suremarc as the requester of this feature, do you have some time to review this PR?

)
LOCATION '../../testing/data/csv/aggregate_test_100.csv';

# test for substitute CAST senario
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also please add a test for times when casting may not preserve ordering?

For exmaple if the input is INT

0
1
2
10

If that is cast to UTF8 the data is now

"0"
"1"
"2"
"10"

Which is no longer sorted correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add that (maybe on the weekend)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added now

@alamb
Copy link
Contributor

alamb commented Feb 16, 2024

cc @mustafasrepo

@Lordworms
Copy link
Contributor Author

Thanks @Lordworms for this PR. I have examined this PR and it is really neat. However, I think we need to address a couple of points before merging:

  • During substitution you seem to be filtering out expressions unchanged. As an example, assume ordering is [a ASC, b ASC]. And projection expressions are a, b, CAST(a as BIGINT) as cast_a. In this case after projection valid orderings are [a ASC, b ASC] and [cast_a ASC, b ASC]. In this case, substitution should convert single ordering [a ASC, b ASC] possibly multiple orderings: [a ASC, b ASC], [cast_a ASC, b ASC]
  • Also substitution should work only in when function or operation has one-to-one mapping. e.g monotonicity alone is not a sufficient condition. As a counter example consider following table

a b
1.5 1
1.5 2
2.0 1
2.0 2
where table satisfies ordering [a ASC, b ASC]. If projection the projection (a, b, CEIL(a)) is applied to this table. Result would be

a b CEIL(a)
1.5 1 2.0
1.5 2 2.0
2.0 1 2.0
2.0 2 2.0
where ordering [a ASC, b ASC] is still satisfied, but [ceil(a) ASC, b ASC] is not satisfied. Hence necessary condition for substitution is monotonicity and one-to-one mapping. For the same reason, CAST should substitute only when target type is larger than the source type. (e.g cast from int16 to int32, etc.)

Currently, we do not have one-to-one mapping analysis for Scalar functions. For this reason, until we have that support adding substitution support for cast to larger type is enough.

I got it, so in this case I just need to focus on CAST to bigger data type and ignore the ScalarFunction?

@mustafasrepo
Copy link
Contributor

I got it, so in this case I just need to focus on CAST to bigger data type and ignore the ScalarFunction?

Exactly

@mustafasrepo
Copy link
Contributor

mustafasrepo commented Feb 19, 2024

Thanks @Lordworms for this PR. There are some small problems with this PR. See my comments below. I have filed another PR on top of the commits on this PR.

Maybe @Lordworms can bring the changes in the PR to this PR to address my comments above.

However, I think we can merge this PR as is. then continue discussion in the PR. I do not think, anything is blocking for this PR to merge.

Copy link
Contributor

@mustafasrepo mustafasrepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

c_customer_sk DESC,
c_current_cdemo_sk DESC
)
LOCATION '../../testing/data/csv/aggregate_test_100.csv';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema of the file, and this table is not consistent. Additionally file doesn't satisfy the WITH_ORDER invariant given during creation.

Comment on lines +34 to +36
SELECT
CAST(c_customer_sk AS BIGINT) AS c_customer_sk_big,
c_current_cdemo_sk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the select expressions were c_customer_sk, CAST(c_customer_sk AS BIGINT) AS c_customer_sk_big, c_current_cdemo_sk planner wouldn't remove the ORDER BY c_customer_sk_big DESC, c_current_cdemo_sk DESC sort from the plan. However, it should be possible. Because, in current implementation, substitution cannot generate more than 1 valid ordering from a single ordering.

@mustafasrepo
Copy link
Contributor

Thanks @Lordworms for this PR. Merging this PR. We can continue discussion in the PR which build on the commits of this PR.

@mustafasrepo mustafasrepo merged commit 453a45a into apache:main Feb 20, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants