Better CSE identifier #10473

peter-toth · 2024-05-12T18:11:59Z

This is a draft PR that implements the ideas from #10426 (comment).

Which issue does this PR close?

Closes #10426.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb

This is looking very exciting @peter-toth

alamb · 2024-05-14T13:18:33Z

datafusion/common/src/tree_node.rs

@@ -204,6 +214,24 @@ pub trait TreeNode: Sized {
        apply_impl(self, &mut f)
    }

+    fn apply_ref<'n, F: FnMut(&'n Self) -> Result<TreeNodeRecursion>>(


This API would be helpful in other areas to avoid cloning -- I am very much in favor of adding it

alamb · 2024-05-14T13:20:58Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+            (
+                Identifier {
+                    hash: 0,
+                    expr: &c_plus_a,
+                },
+                (1, DataType::UInt32),
+            ),


I wonder if we could make these tests more readable with something like this or similar

Suggested change

(

Identifier {

hash: 0,

expr: &c_plus_a,

},

(1, DataType::UInt32),

),

(

Identifier::from(&c_plus_a).with_hash(0)

(1, DataType::UInt32),

),

or

Suggested change

(

Identifier {

hash: 0,

expr: &c_plus_a,

},

(1, DataType::UInt32),

),

(

make_identifier(&c_plus_a, 0),

(1, DataType::UInt32),

),

Sure I can add Identifier::from to the PR.

alamb · 2024-05-15T19:10:42Z

I think @peter-toth plans to break this PR up into smaller ones, so marking it as a draft to make it clear it isn't waiting on more feedback. If I am mistaken, please let me know

peter-toth · 2024-05-16T10:56:10Z

I think @peter-toth plans to break this PR up into smaller ones, so marking it as a draft to make it clear it isn't waiting on more feedback. If I am mistaken, please let me know

Yes, here is the first part that adds the new TreeNode APIs: #10543

erratic-pattern · 2024-05-17T01:13:58Z

datafusion/expr/src/expr.rs

@@ -1389,6 +1390,201 @@ impl Expr {
            | Expr::Placeholder(..) => false,
        }
    }
+
+    pub fn hash_node(&self, hasher: &mut AHasher) {


what's the benefit of this vs using the derived Hash impl. I think a comment explaining the differences might be useful.

Yeah this is a good point, I will add some comments here why CSE uses special hashing.

erratic-pattern · 2024-05-17T01:19:02Z

datafusion/expr/src/expr.rs

@@ -1389,6 +1390,201 @@ impl Expr {
            | Expr::Placeholder(..) => false,
        }
    }
+
+    pub fn hash_node(&self, hasher: &mut AHasher) {


would it be easier to use std::mem::discriminant?

Suggested change

pub fn hash_node(&self, hasher: &mut AHasher) {

pub fn hash_node(&self, hasher: &mut AHasher) {

std::mem::discriminant(self).hash(hasher);

I'm not familiar with std::mem::discriminant but according to its docs it doesn't seem to take into account the data that an enum carries. But we need to take the enum's data into account (except for the subexpressions) to avoid hash collisions as much as we can.

E.g. in the case of Expr::BinaryExpr we want to take into account the operator, but we don't want to take into account the left and right subexpressions as the identifier of those subexpressions are calculated separately and those identifiers contribute to their parent's identifier when we build the parent's id in CSE.

std::mem::discriminant(self).hash(hasher) would replace all of the hasher.write_u8(incrementing_number) calls with something that is less error-prone and more robust to API changes. The rest of the hashing logic would be unchanged.

erratic-pattern · 2024-05-17T02:57:21Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

            {
                self.down_index += 1;
            }

            let expr_name = expr.display_name()?;
-            self.common_exprs.insert(expr_id.clone(), expr);
+            self.common_exprs.insert(expr_id, expr);
            // Alias this `Column` expr to it original "expr name",
            // `projection_push_down` optimizer use "expr name" to eliminate useless
            // projections.
            // TODO: do we really need to alias here?


Suggested change

// TODO: do we really need to alias here?

As far as I know we still don't know why this alias is exactly needed. Please see this thread here: #10396 (comment). I suspect that it is not needed in all cases...

My thoughts behind removing this TODO is that, with explain verbose, it is easier to troubleshoot issues if instead of common_1234 you can see common_1234 as <expr> but maybe I am confused about how it works.

erratic-pattern · 2024-05-17T03:00:33Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+
+impl From<Identifier<'_>> for String {
+    fn from(id: Identifier<'_>) -> Self {
+        format!("common_{}", id.hash)


maybe "cse" prefix is more specific than "common"?

Suggested change

format!("common_{}", id.hash)

format!("cse_{}", id.hash)

I also think this would be better as a method like .into_column_name() rather than a From impl to indicate its purpose and to prevent it from being accidentally misused.

I'm still undecided if we should use CSE identifiers for aliases too. It is ok to use them in the data structures of CSE for the elimination logic as these identifiers contain a reference to Expr too so in case we have hash collision the equality check can save us, but as soon as we use only the hash part for aliasing alias collision can happen.
I feel we should probably inspect the schema and generate unique aliases based on the existing columns...

erratic-pattern · 2024-05-17T03:10:24Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+    expr: &'n Expr,
+}
+
+const SEED: RandomState = RandomState::with_seeds(0, 0, 0, 0);


do we need to worry about DOS here? docs for with_seeds states that one of the number should be random if you need DOS resistance.

Some alternatives:

use RandomState::new to automatically create seed from a random source

change Identifier::new to take a Hasher as a second argument. Then the Hasher can be instantiated on the CommonSubexprEliminate struct.

With second option you can use ~~DefaultHasher::new~~RandomState::new from the standard library. Unless we want to use specifically ahash for some reason that I don't understand.

erratic-pattern · 2024-05-17T03:18:53Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+// fn expr_identifier<'n>(expr: &'n Expr, sub_expr_identifier: Identifier<'n>) -> Identifier<'n> {
+//     format!("{{{expr}{sub_expr_identifier}}}")
+// }


Suggested change

// fn expr_identifier<'n>(expr: &'n Expr, sub_expr_identifier: Identifier<'n>) -> Identifier<'n> {

// format!("{{{expr}{sub_expr_identifier}}}")

// }

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest labels May 12, 2024

peter-toth mentioned this pull request May 12, 2024

Make CommonSubexprEliminate faster by stop copying so many strings #10426

Open

peter-toth force-pushed the better-cse-identifier branch from f4f5383 to b89c56d Compare May 13, 2024 11:38

add reference visitor APIs

38beb1a

peter-toth force-pushed the better-cse-identifier branch 2 times, most recently from 8f9ffee to 878cc04 Compare May 14, 2024 09:17

implement hash based CSE identifier

4b0608c

peter-toth force-pushed the better-cse-identifier branch from 878cc04 to 4b0608c Compare May 14, 2024 11:04

alamb reviewed May 14, 2024

View reviewed changes

alamb mentioned this pull request May 14, 2024

Stop copying Exprs and LogicalPlans so much during Common Subexpression Elimination #9873

Open

alamb marked this pull request as draft May 15, 2024 19:10

erratic-pattern reviewed May 17, 2024

View reviewed changes

peter-toth mentioned this pull request May 21, 2024

Add reference visitor TreeNode APIs, change ExecutionPlan::children() and PhysicalExpr::children() return references #10543

Merged

This was referenced Jun 7, 2024

Rewrite CommonSubexprEliminate to avoid copies using TreeNode #10067

Closed

CSE shorthand alias #10868

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better CSE identifier #10473

Better CSE identifier #10473

peter-toth commented May 12, 2024

alamb left a comment

alamb May 14, 2024

alamb May 14, 2024

peter-toth May 14, 2024

alamb commented May 15, 2024

peter-toth commented May 16, 2024

erratic-pattern May 17, 2024

peter-toth May 17, 2024

erratic-pattern May 17, 2024

peter-toth May 17, 2024 •

edited

erratic-pattern May 17, 2024 •

edited

erratic-pattern May 17, 2024

peter-toth May 17, 2024

erratic-pattern May 17, 2024 •

edited

erratic-pattern May 17, 2024 •

edited

peter-toth May 17, 2024 •

edited

erratic-pattern May 17, 2024 •

edited

erratic-pattern May 17, 2024

	pub fn hash_node(&self, hasher: &mut AHasher) {
	pub fn hash_node(&self, hasher: &mut AHasher) {
	std::mem::discriminant(self).hash(hasher);

	// fn expr_identifier<'n>(expr: &'n Expr, sub_expr_identifier: Identifier<'n>) -> Identifier<'n> {
	// format!("{{{expr}{sub_expr_identifier}}}")
	// }

Better CSE identifier #10473

Are you sure you want to change the base?

Better CSE identifier #10473

Conversation

peter-toth commented May 12, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented May 15, 2024

peter-toth commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth May 17, 2024 • edited

Choose a reason for hiding this comment

erratic-pattern May 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erratic-pattern May 17, 2024 • edited

Choose a reason for hiding this comment

erratic-pattern May 17, 2024 • edited

Choose a reason for hiding this comment

peter-toth May 17, 2024 • edited

Choose a reason for hiding this comment

erratic-pattern May 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth May 17, 2024 •

edited

erratic-pattern May 17, 2024 •

edited

erratic-pattern May 17, 2024 •

edited

erratic-pattern May 17, 2024 •

edited

peter-toth May 17, 2024 •

edited

erratic-pattern May 17, 2024 •

edited