fix(9870): common expression elimination optimization, should always re-find the correct expression during re-write. #9871

wiedld · 2024-03-29T23:55:23Z

Which issue does this PR close?

We found an example where the common-expr-elimination was mutating the logical plan incorrectly. Specifically, it would replace a logical plan node with an incorrect node => leading to a failure in the type check. We have a reproducer of this error (see the first commit's expected test errors).

Rationale for this change

We isolated this error to how the IdArray is generated and used. The IdArray is generated by insertion-at-index during traversal of a first visitor. This IdArray is then read during the second visitor, also by an index, but this index is incremented differently as this second visitor traverses.

As such, the second visitor finds the wrong expr symbol, and then inserts the wrong expression. We could make small changes to fix our bug, but then broke other statements. We feel that the index-based lookup may be inherently fragile to slight changes in traversal patterns across the two visitors; as such, this PR is an alternative approach.

What changes are included in this PR?

1st commit == reproducer, with the expected test errors

2-4th commits == our steps as we introduced the refactoring:

2nd commit = remove the use of an indexed array, and instead rely upon mapping by expr_identifier() (a.k.a. a stringified expr) which is constant across the two separate tree traversals.
3rd commit = move the aliased symbol (which is NOT the same as the expr_identifier()) into the ExprSet
4th commit = encapsulate much of the logic within the ExprSet. (we anticipate further cleanup may occur later.)

5th commit == update the reproducer tests, which are now passing.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

…mmon-expr-elimination traversals

… not always stay in sync with the updated TreeNode traversal

…, while keeping the (stack-popped) symbol used for alias.

… the expr_identifier from the alias symbol

wiedld · 2024-03-30T01:52:46Z

Outstanding question:

There may be some changes in the performance. While consuming the IdArray, the second visitor (the rewriter) had an early termination once all symbols were consumed. We could add logic to include this early halting, but we're unclear if there are additional performance considerations that we are also missing. Do you mind sharing some of your insights @waynexia ? 🙏🏼

alamb

Thank you @wiedld -- I reviewed the code carefully and it looks quite good to me. Not only does it fix the bug I think it makes the structure easier to reason about too. 🏆

I also ran the planning benchmarks which showed no change due to this PR

Benchmark Details

cargo bench --bench sql_planner

alamb@aal-dev:~/arrow-datafusion4$ critcmp main this_branch
group                                         main                                   this_branch
-----                                         ----                                   -----------
logical_aggregate_with_join                   1.00  1268.1±34.12µs        ? ?/sec    1.01  1275.3±74.80µs        ? ?/sec
logical_plan_tpch_all                         1.00     17.1±0.20ms        ? ?/sec    1.02     17.5±0.37ms        ? ?/sec
logical_select_all_from_1000                  1.01     94.7±0.43ms        ? ?/sec    1.00     94.0±0.57ms        ? ?/sec
logical_select_one_from_700                   1.00   738.4±28.13µs        ? ?/sec    1.01   743.6±11.76µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00   792.5±14.03µs        ? ?/sec    1.00    788.7±9.50µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00    756.6±9.69µs        ? ?/sec    1.00   759.3±29.50µs        ? ?/sec
physical_plan_tpch_all                        1.00    133.6±0.79ms        ? ?/sec    1.01    135.0±1.28ms        ? ?/sec
physical_plan_tpch_q1                         1.00      7.6±0.04ms        ? ?/sec    1.02      7.7±0.04ms        ? ?/sec
physical_plan_tpch_q10                        1.00      6.3±0.05ms        ? ?/sec    1.01      6.4±0.04ms        ? ?/sec
physical_plan_tpch_q11                        1.00      5.0±0.04ms        ? ?/sec    1.03      5.1±0.09ms        ? ?/sec
physical_plan_tpch_q12                        1.00      4.1±0.03ms        ? ?/sec    1.01      4.1±0.03ms        ? ?/sec
physical_plan_tpch_q13                        1.00      2.7±0.03ms        ? ?/sec    1.01      2.7±0.02ms        ? ?/sec
physical_plan_tpch_q14                        1.00      3.4±0.02ms        ? ?/sec    1.03      3.6±0.03ms        ? ?/sec
physical_plan_tpch_q16                        1.00      5.3±0.04ms        ? ?/sec    1.01      5.3±0.04ms        ? ?/sec
physical_plan_tpch_q17                        1.00      4.9±0.07ms        ? ?/sec    1.00      4.9±0.04ms        ? ?/sec
physical_plan_tpch_q18                        1.00      5.4±0.03ms        ? ?/sec    1.01      5.5±0.06ms        ? ?/sec
physical_plan_tpch_q19                        1.00     10.1±0.08ms        ? ?/sec    1.01     10.2±0.06ms        ? ?/sec
physical_plan_tpch_q2                         1.00     12.2±0.08ms        ? ?/sec    1.01     12.3±0.09ms        ? ?/sec
physical_plan_tpch_q20                        1.00      6.4±0.06ms        ? ?/sec    1.01      6.5±0.06ms        ? ?/sec
physical_plan_tpch_q21                        1.00      9.5±0.14ms        ? ?/sec    1.01      9.6±0.10ms        ? ?/sec
physical_plan_tpch_q22                        1.00      4.7±0.05ms        ? ?/sec    1.01      4.7±0.04ms        ? ?/sec
physical_plan_tpch_q3                         1.00      4.1±0.02ms        ? ?/sec    1.02      4.2±0.03ms        ? ?/sec
physical_plan_tpch_q4                         1.00      3.4±0.03ms        ? ?/sec    1.00      3.4±0.02ms        ? ?/sec
physical_plan_tpch_q5                         1.00      6.1±0.04ms        ? ?/sec    1.01      6.1±0.04ms        ? ?/sec
physical_plan_tpch_q6                         1.00      2.0±0.02ms        ? ?/sec    1.02      2.1±0.02ms        ? ?/sec
physical_plan_tpch_q7                         1.00      8.6±0.06ms        ? ?/sec    1.01      8.7±0.07ms        ? ?/sec
physical_plan_tpch_q8                         1.00     12.1±0.08ms        ? ?/sec    1.02     12.3±0.23ms        ? ?/sec
physical_plan_tpch_q9                         1.00      9.2±0.06ms        ? ?/sec    1.01      9.2±0.06ms        ? ?/sec
physical_select_all_from_1000                 1.01    691.9±1.14ms        ? ?/sec    1.00    687.8±1.48ms        ? ?/sec
physical_select_one_from_700                  1.00      4.1±0.02ms        ? ?/sec    1.01      4.2±0.05ms        ? ?/sec

cc @waynexia @peter-toth in case you would like to review

alamb · 2024-03-30T12:34:33Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+    /// - counter
+    /// - DataType of this expression.
+    /// - symbol used as the identifier in the alias.
+    map: HashMap<Identifier, (Expr, usize, DataType, Identifier)>,


The amount of copying this requires is unfortunate (maybe it is N^2 in the number of exprs?) -- however, I don't think your changes make it any worse (or better).

I filed #9873 to track the issue. Maybe now that you know this code much better, a fun side project would be to make it much faster 🚀 (🎣 )

alamb · 2024-03-30T20:01:50Z

Given this is a regression I will plan to merge this PR tomorrow unless someone else would like time to review

waynexia

Thanks @wiedld and @alamb. I love the structure of this commit history and the code!

For the performance part, I learned the overall plan optimization phase is costly, but haven't looked into the per rule detail. The planner benchmark is very helpful 👍

For this specific rule, it needs many traverse passes and many stringify in the base. I agree with @alamb that this fix doesn't make worse. But find some way to alleviate the overhead is very meaningful, especially when we extend this rule to run across plans. And the way to identify an expr might also need to change later considering #9776.

By the way, this might be in conflict with #9719.

alamb · 2024-03-31T12:08:59Z

For the performance part, I learned the overall plan optimization phase is costly, but haven't looked into the per rule detail. The planner benchmark is very helpful 👍

Yes, I agree. It is key to our ability to make performance improvements in planning

For this specific rule, it needs many traverse passes and many stringify in the base. I agree with @alamb that this fix doesn't make worse. But find some way to alleviate the overhead is very meaningful, especially when we extend this rule to run across plans. And the way to identify an expr might also need to change later considering #9776.

💯 to this as well. I think avoiding creating String identifiers will be the key.

alamb · 2024-03-31T12:09:04Z

Thank you for the review @waynexia

…re-find the correct expression during re-write. (#9871) * test(9870): reproducer of error with jumping traversal patterns in common-expr-elimination traversals * refactor: remove the IdArray ordered idx, since the idx ordering does not always stay in sync with the updated TreeNode traversal * refactor: use the only reproducible key (expr_identifer) for expr_set, while keeping the (stack-popped) symbol used for alias. * refactor: encapsulate most of the logic within ExprSet, and delineate the expr_identifier from the alias symbol * test(9870): demonstrate that the sqllogictests are now passing

Lordworms · 2024-04-01T00:57:48Z

I think this one may need some change in SqlParser, currently in Sqlparser we just parse syntax like "struct(xx, xx, xx, xx)" to be parsed to Expr::Struct, but when encounter some structure like "t2.struct(xx, xx, xx, xx)" the Sqlparser would parse them into two ident. I think right now it is not ready yet.

alamb · 2024-04-01T11:16:04Z

I think the issue is tracked in #9891

…re-find the correct expression during re-write. (apache#9871) * test(9870): reproducer of error with jumping traversal patterns in common-expr-elimination traversals * refactor: remove the IdArray ordered idx, since the idx ordering does not always stay in sync with the updated TreeNode traversal * refactor: use the only reproducible key (expr_identifer) for expr_set, while keeping the (stack-popped) symbol used for alias. * refactor: encapsulate most of the logic within ExprSet, and delineate the expr_identifier from the alias symbol * test(9870): demonstrate that the sqllogictests are now passing

… always re-find the correct expression during re-write. (apache#9871)" This reverts commit cd7a00b.

* Revert "fix(9870): common expression elimination optimization, should always re-find the correct expression during re-write. (#9871)" This reverts commit cd7a00b. * expr id should always contain the full expr structure, cleaner expr ids, better JumpMark handling, better variable names, code cleanup, some new todos * move `Expr` from `expr_set`s to `affected_id`s * better naming, docs fixes * introduce `CommonExprs` type alias, minor todo fix * add test --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

test(9870): reproducer of error with jumping traversal patterns in co…

efb8c8d

…mmon-expr-elimination traversals

github-actions bot added the sqllogictest label Mar 29, 2024

wiedld added 2 commits March 29, 2024 16:55

refactor: remove the IdArray ordered idx, since the idx ordering does…

a4f9acf

… not always stay in sync with the updated TreeNode traversal

refactor: use the only reproducible key (expr_identifer) for expr_set…

87eb784

…, while keeping the (stack-popped) symbol used for alias.

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Mar 30, 2024

wiedld force-pushed the 9870/common-expr-elimination-id-tracking branch from 101eb18 to 7a658b6 Compare March 30, 2024 01:13

wiedld added 2 commits March 29, 2024 18:22

refactor: encapsulate most of the logic within ExprSet, and delineate…

5eb7a77

… the expr_identifier from the alias symbol

test(9870): demonstrate that the sqllogictests are now passing

ff7b3d6

wiedld force-pushed the 9870/common-expr-elimination-id-tracking branch from 7a658b6 to ff7b3d6 Compare March 30, 2024 01:23

wiedld marked this pull request as ready for review March 30, 2024 02:11

alamb mentioned this pull request Mar 30, 2024

Stop copying Exprs and LogicalPlans so much during Common Subexpression Elimination #9873

Open

alamb approved these changes Mar 30, 2024

View reviewed changes

alamb mentioned this pull request Mar 30, 2024

Release DataFusion 37.0.0 #9682

Closed

8 tasks

waynexia approved these changes Mar 31, 2024

View reviewed changes

alamb merged commit cd7a00b into apache:main Mar 31, 2024
24 checks passed

Omega359 mentioned this pull request Mar 31, 2024

expr.slt test failing in main #9891

Closed

peter-toth added a commit to peter-toth/arrow-datafusion that referenced this pull request May 1, 2024

Revert "fix(9870): common expression elimination optimization, should…

89eeaf0

… always re-find the correct expression during re-write. (apache#9871)" This reverts commit cd7a00b.

peter-toth added a commit to peter-toth/arrow-datafusion that referenced this pull request May 2, 2024

Revert "fix(9870): common expression elimination optimization, should…

cb1bc2e

… always re-find the correct expression during re-write. (apache#9871)" This reverts commit cd7a00b.

peter-toth added a commit to peter-toth/arrow-datafusion that referenced this pull request May 2, 2024

Revert "fix(9870): common expression elimination optimization, should…

d4e78c8

… always re-find the correct expression during re-write. (apache#9871)" This reverts commit cd7a00b.

peter-toth added a commit to peter-toth/arrow-datafusion that referenced this pull request May 6, 2024

Revert "fix(9870): common expression elimination optimization, should…

9dd9a2f

… always re-find the correct expression during re-write. (apache#9871)" This reverts commit cd7a00b.

peter-toth mentioned this pull request May 6, 2024

Fix and improve CommonSubexprEliminate rule #10396

Merged

peter-toth mentioned this pull request May 11, 2024

Make CommonSubexprEliminate faster by stop copying so many strings #10426

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(9870): common expression elimination optimization, should always re-find the correct expression during re-write. #9871

fix(9870): common expression elimination optimization, should always re-find the correct expression during re-write. #9871

wiedld commented Mar 29, 2024 •

edited

wiedld commented Mar 30, 2024

alamb left a comment

alamb Mar 30, 2024

alamb commented Mar 30, 2024

waynexia left a comment

alamb commented Mar 31, 2024

alamb commented Mar 31, 2024

Lordworms commented Apr 1, 2024

alamb commented Apr 1, 2024

fix(9870): common expression elimination optimization, should always re-find the correct expression during re-write. #9871

fix(9870): common expression elimination optimization, should always re-find the correct expression during re-write. #9871

Conversation

wiedld commented Mar 29, 2024 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

wiedld commented Mar 30, 2024

Outstanding question:

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 30, 2024

Choose a reason for hiding this comment

alamb commented Mar 30, 2024

waynexia left a comment

Choose a reason for hiding this comment

alamb commented Mar 31, 2024

alamb commented Mar 31, 2024

Lordworms commented Apr 1, 2024

alamb commented Apr 1, 2024

wiedld commented Mar 29, 2024 •

edited