-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better CSE identifier #10473
base: main
Are you sure you want to change the base?
Better CSE identifier #10473
Conversation
f4f5383
to
b89c56d
Compare
8f9ffee
to
878cc04
Compare
878cc04
to
4b0608c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very exciting @peter-toth
@@ -204,6 +214,24 @@ pub trait TreeNode: Sized { | |||
apply_impl(self, &mut f) | |||
} | |||
|
|||
fn apply_ref<'n, F: FnMut(&'n Self) -> Result<TreeNodeRecursion>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API would be helpful in other areas to avoid cloning -- I am very much in favor of adding it
( | ||
Identifier { | ||
hash: 0, | ||
expr: &c_plus_a, | ||
}, | ||
(1, DataType::UInt32), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could make these tests more readable with something like this or similar
( | |
Identifier { | |
hash: 0, | |
expr: &c_plus_a, | |
}, | |
(1, DataType::UInt32), | |
), | |
( | |
Identifier::from(&c_plus_a).with_hash(0) | |
(1, DataType::UInt32), | |
), |
or
( | |
Identifier { | |
hash: 0, | |
expr: &c_plus_a, | |
}, | |
(1, DataType::UInt32), | |
), | |
( | |
make_identifier(&c_plus_a, 0), | |
(1, DataType::UInt32), | |
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can add Identifier::from
to the PR.
I think @peter-toth plans to break this PR up into smaller ones, so marking it as a draft to make it clear it isn't waiting on more feedback. If I am mistaken, please let me know |
Yes, here is the first part that adds the new |
@@ -1389,6 +1390,201 @@ impl Expr { | |||
| Expr::Placeholder(..) => false, | |||
} | |||
} | |||
|
|||
pub fn hash_node(&self, hasher: &mut AHasher) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the benefit of this vs using the derived Hash
impl. I think a comment explaining the differences might be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a good point, I will add some comments here why CSE uses special hashing.
@@ -1389,6 +1390,201 @@ impl Expr { | |||
| Expr::Placeholder(..) => false, | |||
} | |||
} | |||
|
|||
pub fn hash_node(&self, hasher: &mut AHasher) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be easier to use std::mem::discriminant
?
pub fn hash_node(&self, hasher: &mut AHasher) { | |
pub fn hash_node(&self, hasher: &mut AHasher) { | |
std::mem::discriminant(self).hash(hasher); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with std::mem::discriminant
but according to its docs it doesn't seem to take into account the data that an enum carries. But we need to take the enum's data into account (except for the subexpressions) to avoid hash collisions as much as we can.
E.g. in the case of Expr::BinaryExpr
we want to take into account the operator, but we don't want to take into account the left and right subexpressions as the identifier of those subexpressions are calculated separately and those identifiers contribute to their parent's identifier when we build the parent's id in CSE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::mem::discriminant(self).hash(hasher)
would replace all of the hasher.write_u8(incrementing_number)
calls with something that is less error-prone and more robust to API changes. The rest of the hashing logic would be unchanged.
{ | ||
self.down_index += 1; | ||
} | ||
|
||
let expr_name = expr.display_name()?; | ||
self.common_exprs.insert(expr_id.clone(), expr); | ||
self.common_exprs.insert(expr_id, expr); | ||
// Alias this `Column` expr to it original "expr name", | ||
// `projection_push_down` optimizer use "expr name" to eliminate useless | ||
// projections. | ||
// TODO: do we really need to alias here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO: do we really need to alias here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know we still don't know why this alias is exactly needed. Please see this thread here: #10396 (comment). I suspect that it is not needed in all cases...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts behind removing this TODO is that, with explain verbose
, it is easier to troubleshoot issues if instead of common_1234
you can see common_1234 as <expr>
but maybe I am confused about how it works.
|
||
impl From<Identifier<'_>> for String { | ||
fn from(id: Identifier<'_>) -> Self { | ||
format!("common_{}", id.hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "cse" prefix is more specific than "common"?
format!("common_{}", id.hash) | |
format!("cse_{}", id.hash) |
I also think this would be better as a method like .into_column_name()
rather than a From
impl to indicate its purpose and to prevent it from being accidentally misused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still undecided if we should use CSE identifiers for aliases too. It is ok to use them in the data structures of CSE for the elimination logic as these identifiers contain a reference to Expr
too so in case we have hash collision the equality check can save us, but as soon as we use only the hash part for aliasing alias collision can happen.
I feel we should probably inspect the schema and generate unique aliases based on the existing columns...
expr: &'n Expr, | ||
} | ||
|
||
const SEED: RandomState = RandomState::with_seeds(0, 0, 0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to worry about DOS here? docs for with_seeds
states that one of the number should be random if you need DOS resistance.
Some alternatives:
- use
RandomState::new
to automatically create seed from a random source - change
Identifier::new
to take aHasher
as a second argument. Then theHasher
can be instantiated on theCommonSubexprEliminate
struct.
With second option you can use DefaultHasher::new
RandomState::new
from the standard library. Unless we want to use specifically ahash
for some reason that I don't understand.
// fn expr_identifier<'n>(expr: &'n Expr, sub_expr_identifier: Identifier<'n>) -> Identifier<'n> { | ||
// format!("{{{expr}{sub_expr_identifier}}}") | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// fn expr_identifier<'n>(expr: &'n Expr, sub_expr_identifier: Identifier<'n>) -> Identifier<'n> { | |
// format!("{{{expr}{sub_expr_identifier}}}") | |
// } |
This is a draft PR that implements the ideas from #10426 (comment).
Which issue does this PR close?
Closes #10426.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?