Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement]: Perform file filtering as early as possible when during optimizing plan process #1883

Closed
3 tasks done
Tracked by #1930
wangtaohz opened this issue Aug 24, 2023 · 0 comments · Fixed by #1886
Closed
3 tasks done
Tracked by #1930

Comments

@wangtaohz
Copy link
Contributor

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

Currently, the optimizing plan evaluates all files first, then splits tasks, and performs file filtering after task splitting. This is not intuitive and can cause several issues, such as:

  • The evaluating result is different from that after split tasks, which causes the pending data to be different from the actual optimizing data
  • Bin-packing split task is inaccurate because some files will be filtered out later
  • Verifying which DataFile files were involved in the optimizing requires recalculating after splitting tasks, which results in performance issues

How should we improve?

We should filter files as early as possible: filtering should be done before splitting tasks and not be done again after.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Subtasks

No response

Code of Conduct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant