Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] [DataFusion] Optimize nested joins #26887

Closed
asfimport opened this issue Dec 18, 2020 · 2 comments
Closed

[Rust] [DataFusion] Optimize nested joins #26887

asfimport opened this issue Dec 18, 2020 · 2 comments

Comments

@asfimport
Copy link

Once #8961 is merged, we have an optimization for a JOIN that operates on two tables.

The next step is to extend this optimization to work with nested joins, and this is not trivial. See discussion in #8961 for context.

 

Reporter: Andy Grove / @andygrove

Note: This issue was originally created as ARROW-10964. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Daniël Heres / @Dandandan:
Found some nice material from Spark on this:
https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html

basically the idea to use column level statistics such as:

  • min/max

  • nr of distinct values

  • null count

    to come up with e.g. selectivity of a filter.

    Also there is a formula for (inner) join cardinality:

    num(A IJ B) = num(A)*num(B)/max(distinct(A.k),distinct(B.k))

@asfimport
Copy link
Author

Andrew Lamb / @alamb:
Migrated to github: apache/datafusion#128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant