Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consolidate all the InList rewrites into a single pass #9140

Closed
guojidan opened this issue Feb 6, 2024 · 14 comments · Fixed by #9692
Closed

consolidate all the InList rewrites into a single pass #9140

guojidan opened this issue Feb 6, 2024 · 14 comments · Fixed by #9692
Labels
good first issue Good for newcomers

Comments

@guojidan
Copy link
Contributor

guojidan commented Feb 6, 2024

          Thank you @guojidan  -- this looks good to me. I think as a follow on PR it would be nice to consolidate all the `InList` rewrites into a single pass (rather than `ShortenInListSimplifier` and `InListSimplifier` but this PR seems like a good step in that direction

Thanks again

Originally posted by @alamb in #9037 (review)

@alamb alamb added the good first issue Good for newcomers label Feb 6, 2024
@alamb
Copy link
Contributor

alamb commented Feb 6, 2024

I am not sure if @guojidan plans to work on this one. If not I think it would make a good first issue for someone

@guojidan
Copy link
Contributor Author

guojidan commented Feb 6, 2024

I don't have much time recently because the Spring Festival holiday has arrived. So If everyone is interested, let's implement this 😄

@manoj-inukolunu
Copy link
Contributor

@guojidan @alamb , would love to help here.thx. 😄

@manoj-inukolunu
Copy link
Contributor

manoj-inukolunu commented Feb 11, 2024

Hello @alamb @guojidan , I was working through a PR for this . I have a question on ShortenInListSimplifier what is the significance of THRESHOLD_INLINE_INLIST ? and why is it limited to 3? Thank you.

@alamb
Copy link
Contributor

alamb commented Feb 11, 2024

I think it is a heuristic that avoids pathalogical cases when there are large numbers constants in the IN list (sometimes machine generated SQL can have 1000s of elements)

For example:

https://github.com/apache/arrow-datafusion/blob/afb169cd069e0227fb0ef6d39f44d5eabbdc21a2/datafusion/optimizer/src/simplify_expressions/inlist_simplifier.rs#L53-L59

(it would be great if you had a chance to add some comments with rationale)

@manoj-inukolunu
Copy link
Contributor

manoj-inukolunu commented Feb 14, 2024

Thank you @alamb , I will add a comment. I attempted to do the in list simplifier in one pass . I am not really sure if its possible in a single pass. Here is the test which is failing if i try that .

// c1 IN (1,2,3,4,5,6) AND c1 IN (1,3,5,6) AND c1 IN (3,6) -> c1 = 3 OR c1 = 6

After the first pass it simplifies to

c1 IN (3,6)

After the ShortenInListSimplifier it simplifies to

 c1 = 3 OR c1 = 6

If i try to combile the ShortenInListSimplifier with the InListSimplifier , in the first pass itself

c1 IN (3,6)

is getting simplified to c1 = 3 OR c1 = 6 according to the ShortenInListSimplifier which is breaking the test . Additionaly i attempted a benchmark , the additional pass seems to be adding around 6% overhead. There are additional tests too which are failing too if we try a single pass although this is the one which majorly stood out.

@alamb
Copy link
Contributor

alamb commented Feb 20, 2024

If i try to combile the ShortenInListSimplifier with the InListSimplifier , in the first pass itself

Ths sounds like the logic for the two simplifications is being invoked in a different order (or maybe only one is being invoked? 🤔 ). Could we perhaps invoke both simplications but invoke the code to simplify first and then the code to shorten?

Additionaly i attempted a benchmark , the additional pass seems to be adding around 6% overhead.

That is good data and more reason I think to consolidate the passes. Thank you

@guojidan
Copy link
Contributor Author

hi @manoj-inukolunu , ShortenInListSimplifier should be executed after InListSimplifier has executed. one Expr will be executed recursively at each Simplifier
example:

c1 IN (1,2,3,4,5,6) AND c1 IN (1,3,5,6) AND c1 IN (3,6)

InListSimplifier

  1. c1 IN (1,2,3,4,5,6) -> c1 IN (1,2,3,4,5,6)
  2. c1 IN (1,3,5,6) -> c1 IN (1,3,5,6)
  3. c1 IN (3,6) -> c1 IN (3,6)
  4. c1 IN (1,2,3,4,5,6) AND c1 IN (1,3,5,6) -> c1 IN (1,3,5,6)
  5. c1 IN (1,3,5,6) AND c1 IN (3,6)
  6. c1 IN (3,6)

ShortenInListSimplifier
8. c1 = 3 OR c1 = 6

if we combile the ShortenInListSimplifier with the InListSimplifier, it will become below workflow:

  1. c1 IN (1,2,3,4,5,6) -> c1 IN (1,2,3,4,5,6)
  2. c1 IN (1,3,5,6) -> c1 IN (1,3,5,6)
  3. c1 IN (3,6) -> c1 = 3 OR c1 = 6
  4. c1 IN (1,2,3,4,5,6) AND c1 IN (1,3,5,6) -> c1 IN (1,3,5,6)
  5. c1 IN (1,3,5,6) AND c1 = 3 OR c1 = 6

maybe you can view older version inlist simplifier logical(before my pr), I can got more inspiration

@guojidan
Copy link
Contributor Author

So, the key of this issue is how to determine the time to execute ShortenInListSimplifier, Separate execution is the simplest 🤔

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 15, 2024

@manoj-inukolunu @guojidan Is anyone working on this currently? If not, I will take a a look on this issue.

@guojidan
Copy link
Contributor Author

@manoj-inukolunu @guojidan Is anyone working on this currently? If not, I will take a a look on this issue.

I am not working on this, thank you

@alamb
Copy link
Contributor

alamb commented Mar 15, 2024

FYI @jayzhan211 I had started playing with this at some point but didn't finish

One approach might be to port individual parts of the inlist rewrite in multiple PRs (the entire logic was quite tricky, so doing it incrementally would likely be easier to code and easier to review)

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 15, 2024

port individual parts of the inlist rewrite in multiple PRs

@alamb Are you talking about removing cloned or moving short inlist into orlist simplifier

I'm looking into the issue of moving short inlist, and I found that it may not able to move into the inlist simplifier. short inlist into orlist simplifier should be started after the other 3 rules have been done recursively But, it does not seem there is a way to enforce the order in one f_up function. Do you have an idea to merge these rules?

@alamb
Copy link
Contributor

alamb commented Mar 15, 2024

@alamb Are you talking about removing cloned or moving short inlist into orlist simplifier

I was thinking in general to simplify the problem by moving as many rules as possible into the large match statement here

https://github.com/apache/arrow-datafusion/blob/81b0a011705b17a09f494f550a5382b0c3414597/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L630

short inlist into orlist simplifier should be started after the other 3 rules have been done recursively But, it does not seem there is a way to enforce the order in one f_up function.

I agree -- f_up gets run on each Expr of the tree so you can't express "apply three rules after this other"

So maybe we could move the three rules first into the large match and then figure out what to do with the short inlist into orlist simplifier rule

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants