Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better CSE identifier #10473

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

peter-toth
Copy link
Contributor

This is a draft PR that implements the ideas from #10426 (comment).

Which issue does this PR close?

Closes #10426.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@peter-toth peter-toth force-pushed the better-cse-identifier branch 2 times, most recently from 8f9ffee to 878cc04 Compare May 14, 2024 09:17
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very exciting @peter-toth

@@ -204,6 +214,24 @@ pub trait TreeNode: Sized {
apply_impl(self, &mut f)
}

fn apply_ref<'n, F: FnMut(&'n Self) -> Result<TreeNodeRecursion>>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API would be helpful in other areas to avoid cloning -- I am very much in favor of adding it

Comment on lines +1287 to +1293
(
Identifier {
hash: 0,
expr: &c_plus_a,
},
(1, DataType::UInt32),
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could make these tests more readable with something like this or similar

Suggested change
(
Identifier {
hash: 0,
expr: &c_plus_a,
},
(1, DataType::UInt32),
),
(
Identifier::from(&c_plus_a).with_hash(0)
(1, DataType::UInt32),
),

or

Suggested change
(
Identifier {
hash: 0,
expr: &c_plus_a,
},
(1, DataType::UInt32),
),
(
make_identifier(&c_plus_a, 0),
(1, DataType::UInt32),
),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can add Identifier::from to the PR.

@alamb
Copy link
Contributor

alamb commented May 15, 2024

I think @peter-toth plans to break this PR up into smaller ones, so marking it as a draft to make it clear it isn't waiting on more feedback. If I am mistaken, please let me know

@alamb alamb marked this pull request as draft May 15, 2024 19:10
@peter-toth
Copy link
Contributor Author

I think @peter-toth plans to break this PR up into smaller ones, so marking it as a draft to make it clear it isn't waiting on more feedback. If I am mistaken, please let me know

Yes, here is the first part that adds the new TreeNode APIs: #10543

@@ -1389,6 +1390,201 @@ impl Expr {
| Expr::Placeholder(..) => false,
}
}

pub fn hash_node(&self, hasher: &mut AHasher) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the benefit of this vs using the derived Hash impl. I think a comment explaining the differences might be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a good point, I will add some comments here why CSE uses special hashing.

@@ -1389,6 +1390,201 @@ impl Expr {
| Expr::Placeholder(..) => false,
}
}

pub fn hash_node(&self, hasher: &mut AHasher) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be easier to use std::mem::discriminant?

Suggested change
pub fn hash_node(&self, hasher: &mut AHasher) {
pub fn hash_node(&self, hasher: &mut AHasher) {
std::mem::discriminant(self).hash(hasher);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with std::mem::discriminant but according to its docs it doesn't seem to take into account the data that an enum carries. But we need to take the enum's data into account (except for the subexpressions) to avoid hash collisions as much as we can.

E.g. in the case of Expr::BinaryExpr we want to take into account the operator, but we don't want to take into account the left and right subexpressions as the identifier of those subexpressions are calculated separately and those identifiers contribute to their parent's identifier when we build the parent's id in CSE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::mem::discriminant(self).hash(hasher) would replace all of the hasher.write_u8(incrementing_number) calls with something that is less error-prone and more robust to API changes. The rest of the hashing logic would be unchanged.

{
self.down_index += 1;
}

let expr_name = expr.display_name()?;
self.common_exprs.insert(expr_id.clone(), expr);
self.common_exprs.insert(expr_id, expr);
// Alias this `Column` expr to it original "expr name",
// `projection_push_down` optimizer use "expr name" to eliminate useless
// projections.
// TODO: do we really need to alias here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: do we really need to alias here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know we still don't know why this alias is exactly needed. Please see this thread here: #10396 (comment). I suspect that it is not needed in all cases...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts behind removing this TODO is that, with explain verbose, it is easier to troubleshoot issues if instead of common_1234 you can see common_1234 as <expr> but maybe I am confused about how it works.


impl From<Identifier<'_>> for String {
fn from(id: Identifier<'_>) -> Self {
format!("common_{}", id.hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "cse" prefix is more specific than "common"?

Suggested change
format!("common_{}", id.hash)
format!("cse_{}", id.hash)

I also think this would be better as a method like .into_column_name() rather than a From impl to indicate its purpose and to prevent it from being accidentally misused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still undecided if we should use CSE identifiers for aliases too. It is ok to use them in the data structures of CSE for the elimination logic as these identifiers contain a reference to Expr too so in case we have hash collision the equality check can save us, but as soon as we use only the hash part for aliasing alias collision can happen.
I feel we should probably inspect the schema and generate unique aliases based on the existing columns...

expr: &'n Expr,
}

const SEED: RandomState = RandomState::with_seeds(0, 0, 0, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to worry about DOS here? docs for with_seeds states that one of the number should be random if you need DOS resistance.

Some alternatives:

  • use RandomState::new to automatically create seed from a random source
  • change Identifier::new to take a Hasher as a second argument. Then the Hasher can be instantiated on the CommonSubexprEliminate struct.

With second option you can use DefaultHasher::newRandomState::new from the standard library. Unless we want to use specifically ahash for some reason that I don't understand.

Comment on lines +771 to +773
// fn expr_identifier<'n>(expr: &'n Expr, sub_expr_identifier: Identifier<'n>) -> Identifier<'n> {
// format!("{{{expr}{sub_expr_identifier}}}")
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// fn expr_identifier<'n>(expr: &'n Expr, sub_expr_identifier: Identifier<'n>) -> Identifier<'n> {
// format!("{{{expr}{sub_expr_identifier}}}")
// }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make CommonSubexprEliminate faster by stop copying so many strings
3 participants