-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Implement basic common subexpression eliminate optimization #792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Great @waynexia let me know when it's ready for review! In the TCP-H benchmarks there are also a few that could benefit from CSE, so might make sense to inspect a few of those plans if it's implemented. Although it might not be affecting performance that much for those queries maybe. |
|
Thanks! I'm also curious to see how much impact CSE will bring. I'll try to gather the TCP-H bench result once I finish this. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking cool @waynexia - I didn't have a chance to go through the identifier bit but it looks reasonable.
|
The implementation is almost finished. I plan to add more tests and make all cases pass next. I tried to run the tpc-h benchmark and got a bad result 😕 with cse:
master:
|
Could be within the noise. Maybe you can check / log whether the optimization is applied at all and for which query? |
|
This is the result of Q1, run with this command:
I've run it a couple of times and can observe a stable difference of ~100ms.
|
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
| } => Expr::InList { | ||
| expr: rewrite_boxed(expr, rewriter)?, | ||
| list, | ||
| list: rewrite_vec(list, rewriter)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an accident.
Since this PR now contains two changes about "expr" (this one and RewriteRecursion below), perhaps it would be clearer to move these changes to another PR?
cc @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree two PRs is almost always clearer :)
please do let me know when you would like another look at this PR
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
|
Sorry for the late update. I'm afraid I got some problems when trying to fix the last two failed cases (Logs here) . They failed in the same place where the result table's column name doesn't match. The optimized will produce a column with qualified name which isn't expected. Like When performing common subexpression eliminate optimization, the common expression will be replaced by a single And in many other places we will use an expr's name generated by Everything works as expected so far. It becomes wired when comes to physical plan phase. In this phase, And for the root cause here, I've tried a few ways to work around this but they end up breaking other parts. I haven't come up with an elegant solution so far. It's really appreciated if someone could put up some discussion on this. |
|
Let me propose some relatively "big" changes:
I haven't tried these, just put them into the discussion as some of them might be unrealistic. |
|
Hi @waynexia -- I plan to review your question/comments carefully tomorrow |
|
Hi @waynexia -- I noticed that the failures in I think @houqp might be the one with the most knowledge of aliasing / fully qualified column names at the moment. |
|
I will take a closer look at it this weekend, my hunch on this is we only want to strip qualifier to direct column projections, i.e. |
AFAIK, It's general problem. A SQL without window function can also run into this problem, like SELECT AVG(c1 + 1), SUM(c1 + 1) FROM test GROUP BY c2;it is reproducible if there is some common sub-expression for this optimizer to eliminate, without the need to have a window expression.
The current implementation of |
Yes, this is done mainly for direct column projections so relations are not saved as part of the output. Take the following query as an example: SELECT table1.id FROM table1If we don't strip out the relation from the output, saving the output to a csv file will result in a column named However, for projection on compound expressions, this is not a problem because In short, I think it should be fine to include qualifiers in compound projection columns. The only thing is we need to make sure is the behavior should be consistent before and after the subexpression elimination rule has been applied. |
Agree. This optimize won't touch a single column expr, so only those with extra operators are affected by rewriting. And since a column name like |
|
Sounds good @waynexia , thanks for taking care of this! If you don't mind, it would be great if you could also add the new proposed semantic to the specification at https://github.com/apache/arrow-datafusion/blob/master/docs/specification/output-field-name-semantic.md. |
|
Hi @waynexia -- I am sorry -- I have been on vacation for a while and am now catching up. I will try and review this over the next few days. I think the python test has been fixed, so if you rebase this PR against master the tests should all pass now |
|
Thanks, @alamb! No problem, hope you had a great vacation 😃 I have merged the master, let's see how CI goes. I'm not sure if the document and test are concrete. Please let me know if anything is not clear :) |
|
Well, it still fails... 😢 |
|
Actually, never mind, it is related :P |
|
OK, so now we are running into a limitation in datafusion. Datafusion doesn't support filtering on unprojected columns yet. The rule here optimized the compound expressions into a single dummy projected column, which removes the individual column projection. I bet if you change it to the following it would work: df = df.select(
f.col("a"),
f.col("a") + f.col("b"),
f.col("a") - f.col("b"),
).filter(f.col("a") > f.lit(2)) This means the impact of this change is bigger than we expected and could break queries that are working today. The design and implementation of this change is correct, it's just it exposed a limitation of datafusion to broader range of queries :( |
|
Sorry for the late reply. I haven't set up the test environment for python yet so I tried a SQL like this and got the expected result. SELECT c1 + c2, c1 * c2 from test where c1 > 2;I'll run the python test on local later. But for this case, I'm not sure whether it is related to this change. As |
|
You are right @waynexia , the optimization rule shouldn't have kicked in for this query 🤔 |
|
It's also interesting that the adding an explicit column df = df.select(
f.col("a") + f.col("b"),
f.col("a") - f.col("b"),
f.col("a"),
).filter(f.col("a") > f.lit(2)) I am not able to reproduce this error with the Rust SQL API though. |
|
@waynexia I forced a stack trace using panic to pin point the issue: More specifically, this line: let data_type = predicate.get_type(input.schema())?;I think we can't just use the input schema here. For this failing query, the input (the projection plan) only contains those two compound expressions. You may want to use Would be good if we can create a reproducible test case using Rust's dataframe API first. |
|
Brilliant job @houqp! Appreciate for pining this down. I'll look at this tonight. |
|
Hi @houqp, here are some updates. The reproducer with rust dataframe API is here. IMO the problem we are facing is that I changed the behavior of getting schema in 64330bd. It now will get all the schemas under the |
|
Sorry for the delay @waynexia . I think going through all the schemas from the input tree is the right thing to do here. Filter and other plans like Projection should be able to access all fields from its full input query plan tree, not just the immediate query plan node. I do agree that we currently don't have a convenient way to access the full schema without asking the user to manually write a schema merge operation or changing the consumer code to work with the schema slice instead of merged schema. I think this ergonomics problem is something we can address in a separate PR. On top of this, I think it would be beneficial to also reorder the optimizer run to run the filter push down before this optimization rule for a slight performance gain. But I think the subexpression elimination optimizer should work by itself too without having a hard dependency on the filter push down optimizer rule. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an epic PR @waynexia - very nicely done. I went through the changes and tests carefully and they look good to me.
I also tried running IOx (and its tests) against this branch and they all passed which is increases my confidence for it being correct.
🏅
|
Thank you all! You do help me a lot in carrying this out ❤️ |
|
Thank you @waynexia for being patient with this PR, this is a really cool optimization rule. |
|
Thanks @waynexia ! |
Which issue does this PR close?
Resolve #566.
Rationale for this change
This pull request trying to implement the common subexpression eliminate optimization. The current implementation only supports detecting & eliminating common subexpressions under one logical plan, not cross plans.
This optimizes is consists of two parts: detecting and replacing. It first scans all expressions we are going to rewrite (all expressions under one plan here), generates identifiers for each
Exprnode for later comparison. Then for the same group of expressions, we extract common expressions, put them into aProjectionplan, and replace the original expression with a columnExpr.This implementation doesn't cover the situation of eliminating the whole plan's common subexpression. This requires considering not only the expression itself, but also sharing data with different plans or some plan will modify the data (e.g. filter) or other aspects I haven't realize. In the beginning I plan to make it cross plans but it seems better to me to do them in another PR.
What changes are included in this PR?
Add optimize rule
CommonSubexprEliminate.Are there any user-facing changes?
I think there is, as the optimization rule will rewrite input SQL.
Status
This is still a work in progress. I have only covered part of plans and am trying to make it work with the entire execution procedure.