New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Rust] [DataFusion] Optimize nested joins #26887

Closed

asfimport opened this issue Dec 18, 2020 · 2 comments

Labels

Component: Rust - DataFusion Type: enhancement

asfimport commented Dec 18, 2020

Once #8961 is merged, we have an optimization for a JOIN that operates on two tables.

The next step is to extend this optimization to work with nested joins, and this is not trivial. See discussion in #8961 for context.

Reporter: Andy Grove / @andygrove

_{Note: This issue was originally created as ARROW-10964. Please see the migration documentation for further details.}

Author

asfimport commented Dec 22, 2020

Daniël Heres / @Dandandan:
Found some nice material from Spark on this:
https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html

basically the idea to use column level statistics such as:

min/max
nr of distinct values
null count

to come up with e.g. selectivity of a filter.

Also there is a formula for (inner) join cardinality:

num(A IJ B) = num(A)*num(B)/max(distinct(A.k),distinct(B.k))

Author

asfimport commented Apr 26, 2021

Andrew Lamb / @alamb:
Migrated to github: apache/datafusion#128

asfimport closed this as completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment