Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(9870): common expression elimination optimization, should always re-find the correct expression during re-write. #9871

Merged
merged 5 commits into from
Mar 31, 2024

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Mar 29, 2024

Closes #9870

Which issue does this PR close?

We found an example where the common-expr-elimination was mutating the logical plan incorrectly. Specifically, it would replace a logical plan node with an incorrect node => leading to a failure in the type check. We have a reproducer of this error (see the first commit's expected test errors).

Rationale for this change

We isolated this error to how the IdArray is generated and used. The IdArray is generated by insertion-at-index during traversal of a first visitor. This IdArray is then read during the second visitor, also by an index, but this index is incremented differently as this second visitor traverses.

As such, the second visitor finds the wrong expr symbol, and then inserts the wrong expression. We could make small changes to fix our bug, but then broke other statements. We feel that the index-based lookup may be inherently fragile to slight changes in traversal patterns across the two visitors; as such, this PR is an alternative approach.

What changes are included in this PR?

1st commit == reproducer, with the expected test errors

2-4th commits == our steps as we introduced the refactoring:

  • 2nd commit = remove the use of an indexed array, and instead rely upon mapping by expr_identifier() (a.k.a. a stringified expr) which is constant across the two separate tree traversals.
  • 3rd commit = move the aliased symbol (which is NOT the same as the expr_identifier()) into the ExprSet
  • 4th commit = encapsulate much of the logic within the ExprSet. (we anticipate further cleanup may occur later.)

5th commit == update the reproducer tests, which are now passing.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

… not always stay in sync with the updated TreeNode traversal
…, while keeping the (stack-popped) symbol used for alias.
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Mar 30, 2024
@wiedld wiedld force-pushed the 9870/common-expr-elimination-id-tracking branch from 101eb18 to 7a658b6 Compare March 30, 2024 01:13
@wiedld wiedld force-pushed the 9870/common-expr-elimination-id-tracking branch from 7a658b6 to ff7b3d6 Compare March 30, 2024 01:23
@wiedld
Copy link
Contributor Author

wiedld commented Mar 30, 2024

Outstanding question:

There may be some changes in the performance. While consuming the IdArray, the second visitor (the rewriter) had an early termination once all symbols were consumed. We could add logic to include this early halting, but we're unclear if there are additional performance considerations that we are also missing. Do you mind sharing some of your insights @waynexia ? 🙏🏼

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- I reviewed the code carefully and it looks quite good to me. Not only does it fix the bug I think it makes the structure easier to reason about too. 🏆

I also ran the planning benchmarks which showed no change due to this PR

Benchmark Details

cargo bench --bench sql_planner

alamb@aal-dev:~/arrow-datafusion4$ critcmp main this_branch
group                                         main                                   this_branch
-----                                         ----                                   -----------
logical_aggregate_with_join                   1.00  1268.1±34.12µs        ? ?/sec    1.01  1275.3±74.80µs        ? ?/sec
logical_plan_tpch_all                         1.00     17.1±0.20ms        ? ?/sec    1.02     17.5±0.37ms        ? ?/sec
logical_select_all_from_1000                  1.01     94.7±0.43ms        ? ?/sec    1.00     94.0±0.57ms        ? ?/sec
logical_select_one_from_700                   1.00   738.4±28.13µs        ? ?/sec    1.01   743.6±11.76µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00   792.5±14.03µs        ? ?/sec    1.00    788.7±9.50µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00    756.6±9.69µs        ? ?/sec    1.00   759.3±29.50µs        ? ?/sec
physical_plan_tpch_all                        1.00    133.6±0.79ms        ? ?/sec    1.01    135.0±1.28ms        ? ?/sec
physical_plan_tpch_q1                         1.00      7.6±0.04ms        ? ?/sec    1.02      7.7±0.04ms        ? ?/sec
physical_plan_tpch_q10                        1.00      6.3±0.05ms        ? ?/sec    1.01      6.4±0.04ms        ? ?/sec
physical_plan_tpch_q11                        1.00      5.0±0.04ms        ? ?/sec    1.03      5.1±0.09ms        ? ?/sec
physical_plan_tpch_q12                        1.00      4.1±0.03ms        ? ?/sec    1.01      4.1±0.03ms        ? ?/sec
physical_plan_tpch_q13                        1.00      2.7±0.03ms        ? ?/sec    1.01      2.7±0.02ms        ? ?/sec
physical_plan_tpch_q14                        1.00      3.4±0.02ms        ? ?/sec    1.03      3.6±0.03ms        ? ?/sec
physical_plan_tpch_q16                        1.00      5.3±0.04ms        ? ?/sec    1.01      5.3±0.04ms        ? ?/sec
physical_plan_tpch_q17                        1.00      4.9±0.07ms        ? ?/sec    1.00      4.9±0.04ms        ? ?/sec
physical_plan_tpch_q18                        1.00      5.4±0.03ms        ? ?/sec    1.01      5.5±0.06ms        ? ?/sec
physical_plan_tpch_q19                        1.00     10.1±0.08ms        ? ?/sec    1.01     10.2±0.06ms        ? ?/sec
physical_plan_tpch_q2                         1.00     12.2±0.08ms        ? ?/sec    1.01     12.3±0.09ms        ? ?/sec
physical_plan_tpch_q20                        1.00      6.4±0.06ms        ? ?/sec    1.01      6.5±0.06ms        ? ?/sec
physical_plan_tpch_q21                        1.00      9.5±0.14ms        ? ?/sec    1.01      9.6±0.10ms        ? ?/sec
physical_plan_tpch_q22                        1.00      4.7±0.05ms        ? ?/sec    1.01      4.7±0.04ms        ? ?/sec
physical_plan_tpch_q3                         1.00      4.1±0.02ms        ? ?/sec    1.02      4.2±0.03ms        ? ?/sec
physical_plan_tpch_q4                         1.00      3.4±0.03ms        ? ?/sec    1.00      3.4±0.02ms        ? ?/sec
physical_plan_tpch_q5                         1.00      6.1±0.04ms        ? ?/sec    1.01      6.1±0.04ms        ? ?/sec
physical_plan_tpch_q6                         1.00      2.0±0.02ms        ? ?/sec    1.02      2.1±0.02ms        ? ?/sec
physical_plan_tpch_q7                         1.00      8.6±0.06ms        ? ?/sec    1.01      8.7±0.07ms        ? ?/sec
physical_plan_tpch_q8                         1.00     12.1±0.08ms        ? ?/sec    1.02     12.3±0.23ms        ? ?/sec
physical_plan_tpch_q9                         1.00      9.2±0.06ms        ? ?/sec    1.01      9.2±0.06ms        ? ?/sec
physical_select_all_from_1000                 1.01    691.9±1.14ms        ? ?/sec    1.00    687.8±1.48ms        ? ?/sec
physical_select_one_from_700                  1.00      4.1±0.02ms        ? ?/sec    1.01      4.2±0.05ms        ? ?/sec

cc @waynexia @peter-toth in case you would like to review

/// - counter
/// - DataType of this expression.
/// - symbol used as the identifier in the alias.
map: HashMap<Identifier, (Expr, usize, DataType, Identifier)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The amount of copying this requires is unfortunate (maybe it is N^2 in the number of exprs?) -- however, I don't think your changes make it any worse (or better).

I filed #9873 to track the issue. Maybe now that you know this code much better, a fun side project would be to make it much faster 🚀 (🎣 )

@alamb alamb mentioned this pull request Mar 30, 2024
8 tasks
@alamb
Copy link
Contributor

alamb commented Mar 30, 2024

Given this is a regression I will plan to merge this PR tomorrow unless someone else would like time to review

Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wiedld and @alamb. I love the structure of this commit history and the code!

For the performance part, I learned the overall plan optimization phase is costly, but haven't looked into the per rule detail. The planner benchmark is very helpful 👍

For this specific rule, it needs many traverse passes and many stringify in the base. I agree with @alamb that this fix doesn't make worse. But find some way to alleviate the overhead is very meaningful, especially when we extend this rule to run across plans. And the way to identify an expr might also need to change later considering #9776.

By the way, this might be in conflict with #9719.

@alamb
Copy link
Contributor

alamb commented Mar 31, 2024

For the performance part, I learned the overall plan optimization phase is costly, but haven't looked into the per rule detail. The planner benchmark is very helpful 👍

Yes, I agree. It is key to our ability to make performance improvements in planning

For this specific rule, it needs many traverse passes and many stringify in the base. I agree with @alamb that this fix doesn't make worse. But find some way to alleviate the overhead is very meaningful, especially when we extend this rule to run across plans. And the way to identify an expr might also need to change later considering #9776.

💯 to this as well. I think avoiding creating String identifiers will be the key.

@alamb
Copy link
Contributor

alamb commented Mar 31, 2024

Thank you for the review @waynexia

@alamb alamb merged commit cd7a00b into apache:main Mar 31, 2024
24 checks passed
andygrove pushed a commit that referenced this pull request Mar 31, 2024
…re-find the correct expression during re-write. (#9871)

* test(9870): reproducer of error with jumping traversal patterns in common-expr-elimination traversals

* refactor: remove the IdArray ordered idx, since the idx ordering does not always stay in sync with the updated TreeNode traversal

* refactor: use the only reproducible key (expr_identifer) for expr_set, while keeping the (stack-popped) symbol used for alias.

* refactor: encapsulate most of the logic within ExprSet, and delineate the expr_identifier from the alias symbol

* test(9870): demonstrate that the sqllogictests are now passing
@Lordworms
Copy link
Contributor

I think this one may need some change in SqlParser, currently in Sqlparser we just parse syntax like "struct(xx, xx, xx, xx)" to be parsed to Expr::Struct, but when encounter some structure like "t2.struct(xx, xx, xx, xx)" the Sqlparser would parse them into two ident. I think right now it is not ready yet.

@alamb
Copy link
Contributor

alamb commented Apr 1, 2024

I think the issue is tracked in #9891

Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Apr 1, 2024
…re-find the correct expression during re-write. (apache#9871)

* test(9870): reproducer of error with jumping traversal patterns in common-expr-elimination traversals

* refactor: remove the IdArray ordered idx, since the idx ordering does not always stay in sync with the updated TreeNode traversal

* refactor: use the only reproducible key (expr_identifer) for expr_set, while keeping the (stack-popped) symbol used for alias.

* refactor: encapsulate most of the logic within ExprSet, and delineate the expr_identifier from the alias symbol

* test(9870): demonstrate that the sqllogictests are now passing
wiedld added a commit to wiedld/arrow-datafusion that referenced this pull request Apr 1, 2024
…re-find the correct expression during re-write. (apache#9871)

* test(9870): reproducer of error with jumping traversal patterns in common-expr-elimination traversals

* refactor: remove the IdArray ordered idx, since the idx ordering does not always stay in sync with the updated TreeNode traversal

* refactor: use the only reproducible key (expr_identifer) for expr_set, while keeping the (stack-popped) symbol used for alias.

* refactor: encapsulate most of the logic within ExprSet, and delineate the expr_identifier from the alias symbol

* test(9870): demonstrate that the sqllogictests are now passing
wiedld added a commit to influxdata/arrow-datafusion that referenced this pull request Apr 1, 2024
…re-find the correct expression during re-write. (apache#9871)

* test(9870): reproducer of error with jumping traversal patterns in common-expr-elimination traversals

* refactor: remove the IdArray ordered idx, since the idx ordering does not always stay in sync with the updated TreeNode traversal

* refactor: use the only reproducible key (expr_identifer) for expr_set, while keeping the (stack-popped) symbol used for alias.

* refactor: encapsulate most of the logic within ExprSet, and delineate the expr_identifier from the alias symbol

* test(9870): demonstrate that the sqllogictests are now passing
wiedld added a commit to wiedld/arrow-datafusion that referenced this pull request Apr 2, 2024
…re-find the correct expression during re-write. (apache#9871)

* test(9870): reproducer of error with jumping traversal patterns in common-expr-elimination traversals

* refactor: remove the IdArray ordered idx, since the idx ordering does not always stay in sync with the updated TreeNode traversal

* refactor: use the only reproducible key (expr_identifer) for expr_set, while keeping the (stack-popped) symbol used for alias.

* refactor: encapsulate most of the logic within ExprSet, and delineate the expr_identifier from the alias symbol

* test(9870): demonstrate that the sqllogictests are now passing
peter-toth added a commit to peter-toth/arrow-datafusion that referenced this pull request May 1, 2024
… always re-find the correct expression during re-write. (apache#9871)"

This reverts commit cd7a00b.
peter-toth added a commit to peter-toth/arrow-datafusion that referenced this pull request May 2, 2024
… always re-find the correct expression during re-write. (apache#9871)"

This reverts commit cd7a00b.
peter-toth added a commit to peter-toth/arrow-datafusion that referenced this pull request May 2, 2024
… always re-find the correct expression during re-write. (apache#9871)"

This reverts commit cd7a00b.
peter-toth added a commit to peter-toth/arrow-datafusion that referenced this pull request May 6, 2024
… always re-find the correct expression during re-write. (apache#9871)"

This reverts commit cd7a00b.
alamb added a commit that referenced this pull request May 8, 2024
* Revert "fix(9870): common expression elimination optimization, should always re-find the correct expression during re-write. (#9871)"

This reverts commit cd7a00b.

* expr id should always contain the full expr structure, cleaner expr ids, better JumpMark handling, better variable names, code cleanup, some new todos

* move `Expr` from `expr_set`s to `affected_id`s

* better naming, docs fixes

* introduce `CommonExprs` type alias, minor todo fix

* add test

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Common expression elimination should also re-find the correct expression, during re-write.
4 participants