Implement semi/anti join output statistics estimation #9800

korowa · 2024-03-25T21:17:38Z

Which issue does this PR close?

Closes #.

Rationale for this change

While #9676 found that NLJoin input reordering in benchmarks doesn't happen if one of inputs is AntiJoin -- the reason is that output statistics for all Semi/Anti join types is always set as Absent. This PR propagates "outer" input statistics to Anti/Semi join output.

What changes are included in this PR?

"outer" input statistics for Anti/Semi join cases is now propagated as output statistics (worst case assumption -- filtering not occurs at all, an alternative might be either to use default_filter_statistics or use similar to regular joins algorithm based on estimated distinct values, which though could be misleading as it assumes that "all values from smaller side are present in larger side" -- not sure if it's fine for filter estimation)
the only exception is disjoint inputs for semi-joins case -- if inputs statistics values are non-overlapping, the output cardinality will be estimated as zero
avoid early returns on absent statistics value while non-overlapping statistics checks

Are these changes tested?

Added test coverage for SemiJoin cardinality estimation

Are there any user-facing changes?

Execution plans containing Semi/Anti joins might change

Dandandan · 2024-03-26T07:34:10Z

/benchmark

github-actions · 2024-03-26T07:44:43Z

Benchmark results

Benchmarks comparing 349c586 (main) and 29f1cca (PR)

Comparing 349c586 and 29f1cca
--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃  349c586 ┃  29f1cca ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 442.56ms │ 447.23ms │    no change │
│ QQuery 2     │  57.74ms │  58.90ms │    no change │
│ QQuery 3     │ 145.56ms │ 149.88ms │    no change │
│ QQuery 4     │  89.29ms │ 121.83ms │ 1.36x slower │
│ QQuery 5     │ 201.46ms │ 247.68ms │ 1.23x slower │
│ QQuery 6     │ 105.37ms │ 110.09ms │    no change │
│ QQuery 7     │ 284.15ms │ 291.72ms │    no change │
│ QQuery 8     │ 198.73ms │ 202.66ms │    no change │
│ QQuery 9     │ 310.13ms │ 313.32ms │    no change │
│ QQuery 10    │ 234.28ms │ 251.45ms │ 1.07x slower │
│ QQuery 11    │  62.78ms │  64.24ms │    no change │
│ QQuery 12    │ 129.96ms │ 130.64ms │    no change │
│ QQuery 13    │ 177.83ms │ 187.69ms │ 1.06x slower │
│ QQuery 14    │ 128.79ms │ 135.48ms │ 1.05x slower │
│ QQuery 15    │ 192.30ms │ 197.11ms │    no change │
│ QQuery 16    │  52.21ms │  51.71ms │    no change │
│ QQuery 17    │ 313.30ms │ 328.26ms │    no change │
│ QQuery 18    │ 455.63ms │ 469.28ms │    no change │
│ QQuery 19    │ 236.01ms │ 237.34ms │    no change │
│ QQuery 20    │ 194.17ms │ 202.15ms │    no change │
│ QQuery 21    │ 335.41ms │ 334.95ms │    no change │
│ QQuery 22    │  54.90ms │  55.99ms │    no change │
└──────────────┴──────────┴──────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (349c586)   │ 4402.55ms │
│ Total Time (29f1cca)   │ 4589.59ms │
│ Average Time (349c586) │  200.12ms │
│ Average Time (29f1cca) │  208.62ms │
│ Queries Faster         │         0 │
│ Queries Slower         │         5 │
│ Queries with No Change │        17 │
└────────────────────────┴───────────┘

korowa · 2024-03-26T18:20:54Z

Checked on the fork (1 and 2) for the same base/PR commits -- the results don't seem to be reliable

Dandandan · 2024-03-26T18:23:14Z

/benchmark

Dandandan · 2024-03-26T18:29:49Z

Checked on the fork (1 and 2) for the same base/PR commits -- the results don't seem to be reliable

Yes, it seems not too reliable at the moment unfortunately. Based on that, seems there is no significant difference.
I started another run for a last check.

github-actions · 2024-03-26T18:34:04Z

Benchmark results

Benchmarks comparing 39f4aaf (main) and 29f1cca (PR)

Comparing 39f4aaf and 29f1cca
--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃  39f4aaf ┃  29f1cca ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 446.24ms │ 505.53ms │ 1.13x slower │
│ QQuery 2     │  58.17ms │  77.06ms │ 1.32x slower │
│ QQuery 3     │ 148.50ms │ 151.10ms │    no change │
│ QQuery 4     │  89.58ms │  90.13ms │    no change │
│ QQuery 5     │ 204.90ms │ 209.82ms │    no change │
│ QQuery 6     │ 110.61ms │ 109.62ms │    no change │
│ QQuery 7     │ 281.84ms │ 293.58ms │    no change │
│ QQuery 8     │ 194.27ms │ 203.05ms │    no change │
│ QQuery 9     │ 319.48ms │ 312.89ms │    no change │
│ QQuery 10    │ 249.44ms │ 245.80ms │    no change │
│ QQuery 11    │  62.59ms │  63.01ms │    no change │
│ QQuery 12    │ 127.26ms │ 127.85ms │    no change │
│ QQuery 13    │ 181.71ms │ 191.56ms │ 1.05x slower │
│ QQuery 14    │ 132.04ms │ 132.48ms │    no change │
│ QQuery 15    │ 212.20ms │ 209.90ms │    no change │
│ QQuery 16    │  51.83ms │  51.61ms │    no change │
│ QQuery 17    │ 326.16ms │ 350.28ms │ 1.07x slower │
│ QQuery 18    │ 451.92ms │ 482.47ms │ 1.07x slower │
│ QQuery 19    │ 236.19ms │ 237.95ms │    no change │
│ QQuery 20    │ 191.50ms │ 199.72ms │    no change │
│ QQuery 21    │ 326.33ms │ 329.98ms │    no change │
│ QQuery 22    │  54.79ms │  53.14ms │    no change │
└──────────────┴──────────┴──────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (39f4aaf)   │ 4457.57ms │
│ Total Time (29f1cca)   │ 4628.56ms │
│ Average Time (39f4aaf) │  202.62ms │
│ Average Time (29f1cca) │  210.39ms │
│ Queries Faster         │         0 │
│ Queries Slower         │         5 │
│ Queries with No Change │        17 │
└────────────────────────┴───────────┘

Dandandan · 2024-03-26T18:38:05Z

Queries running slower seem inconsistent again :)

korowa · 2024-03-26T18:45:52Z

Probably this is the case described here as a concern regarding GH runners.

korowa · 2024-03-26T18:50:15Z

On the other side, as far as I understand, at this moment this benchmarking is aimed to detect larger scale performance regressions like x3+ (while default benchmark report has 5% diff threshold as "no change")

alamb

Thank you @korowa -- other than the anti-join calculation this PR looks good to me.

In response to @Dandandan 's comments about potential performance degredation: In general I think our join ordering optimization / statistics are fairly simplistic

Rather than trying to make the built on models for datafusion more and more sophisticated, I think a better longer term strategy is to focus on extensibe APIs (with some reasonable default implementation / cost model behind them)

I don't have much time to drive this kind of refactoring for a while (I am pretty tied up with the functions extraction and other things for InfluxData), though I think it would really help DataFusion's long term story for more complex joins

alamb · 2024-03-27T13:34:03Z

datafusion/physical-plan/src/joins/utils.rs

+///
+/// The estimation result is either zero, in cases inputs statistics are non-overlapping
+/// or equal to number of rows for outer input.
+fn estimate_semi_join_cardinality(


Long term it would be really nice to pull these types of calculations into some trait (aka an extensibility API)

alamb · 2024-03-27T13:36:47Z

datafusion/physical-plan/src/joins/utils.rs

-        | JoinType::LeftAnti
-        | JoinType::RightAnti => None,
+        JoinType::LeftSemi | JoinType::LeftAnti => {
+            let cardinality = estimate_semi_join_cardinality(


it doesn't seem correct to me that the same calculation is used for both Semi and Anti joins (shouldn't they be the inverse of each other?)

Indeed, they were not correct. I've changed estimations a bit -- now disjoint statistics affects only semi-joins (filtering outer table should produce zero rows). For anti-joins, disjoint inputs don't seem to make much sense -- if statistics are non-overlapping the result will be equal to outer num_rows side, otherwise (having no info or overlapping statistics) -- it still will be estimated as outer side, since we know nothing about actual distribution besides min/max, and assuming that all rows will be filtered out is too much (may significantly affect further planning)

Dandandan · 2024-03-27T22:24:01Z

Just to aid the discussions on benchmark variance, let's try and run the newer version once more
/benchmark

Dandandan · 2024-03-28T15:06:02Z

/benchmark

gruuya · 2024-03-28T20:37:53Z

/benchmark

I think the SF 10 in-memory bench OOM-ed the run, so we should remove it for now.

alamb

Thank you @korowa -- I think this is a good step forward.

In general, I believe that the cardinality estimation / join reordering is too tightly tied into the core logic at the moment. My experience is that these cost models are very hard to get right and different systems will have divergent needs for their models (e.g. accuracy vs robustness)

What do you think about trying to extract the cost models (e.g. cardinality estimation) into some API? For example

pub trait CardinalityEstimator {
fn estimate_join_cardinality(
    &self, 
    join_type: &JoinType,
    left_stats: Statistics,
    right_stats: Statistics,
    on: &JoinOn) -> Option<PartialJoinStatistics> ;

  fn estimate_filter_selectivity(..);
  ..
}

maybe @ozankabak / @metesynnada have some other ideas

alamb · 2024-03-29T15:01:11Z

What do you think about trying to extract the cost models (e.g. cardinality estimation) into some API? For example

TO be clear this is a suggestion for some other potential future PR, not this one

Dandandan · 2024-03-29T15:03:05Z

/benchmark
Final one

github-actions · 2024-03-29T15:30:46Z

Benchmark results

Benchmarks comparing 179179c (main) and 9bd30ea (PR)

Comparing 179179c and 9bd30ea
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_mem_sf1.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_mem_sf1.json does not exist
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_sf1.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_sf1.json does not exist
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_sf10.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_sf10.json does not exist

gruuya · 2024-03-29T15:41:02Z

Benchmark results

Benchmarks comparing 179179c (main) and 9bd30ea (PR)

Comparing 179179c and 9bd30ea
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_mem_sf1.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_mem_sf1.json does not exist
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_sf1.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_sf1.json does not exist
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_sf10.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_sf10.json does not exist

So what seem to have happened here is that the step benchmarking the PR changes uses it's version of the benchmarks/bench.sh file, and that is different from the one on main now due to this https://github.com/apache/arrow-datafusion/blob/179179c0b719a7f9e33d138ab728fdc2b0e1e1d8/benchmarks/bench.sh#L317

I needed to add a distinction between the results since previously SF1/SF10 would overwrite the same file.

So the resolution is just to rebase on/merge from main and re-run the benches again. 👀

github-actions · 2024-03-29T16:07:22Z

Benchmark results

Benchmarks comparing 179179c (main) and 9bd30ea (PR)

Comparing 179179c and 9bd30ea
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_mem_sf1.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_mem_sf1.json does not exist
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_sf1.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_sf1.json does not exist
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_sf10.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_sf10.json does not exist

korowa · 2024-03-31T18:05:49Z

What do you think about trying to extract the cost models (e.g. cardinality estimation) into some API?

@alamb, I don't have any strong opinion here (probably I'm lacking knowledge of usecases for this), and if I got the idea right -- on one side it might help by adding versatility to DF usage (there is already available, AFAIU, an option to customize physical optimizer, and this API should allow to reuse optimizer with custom cost model), but on the other side, if end goal is an (extensible/customizable) API, providing data required for physical plan optimization, I'm not sure that statistics estimation API will be enough, as there are more attributes affecting phyiscal plan significantly (e.g. partitioning and ordering related attributes), and as a result, to provide all required inputs random external planner needs, we may end up with +- same ExecutionPlan.

Maybe it'll be better to start with internal estimator API (maybe not "API", but just set of functions, like we have now across multiple various utility-files, but better organized), and, for now, provide statistics through operators (as it's working right now) using this utility functional?

* semi/anti join output statistics * fix antijoin cardinality estimation

alamb · 2024-04-02T18:43:54Z

🤔 In my mind the way a cost based optimizer (CBO) typically works is that there are:

A set of heuristics that take things like partitioning/ordering into account to create potential plans (with potentially different join orders)
A cost model that is then used to pick between the potential plans

I was thinking if we could decouple the "make some potential plans" and "what would it cost to run this query" parts, we could let people implement their own cost based optimizer (and we could pull the basic cardinality estimation code into the "build in" cost model)

I don't have time to pursue the idea now, however

semi/anti join output statistics

29f1cca

korowa mentioned this pull request Mar 26, 2024

Run DataFusion benchmarks regularly and track performance history over time #5504

Open

alamb changed the title ~~fix: semi/anti join output statistics~~ Implement semi/anti join output statistics estimation Mar 27, 2024

alamb reviewed Mar 27, 2024

View reviewed changes

alamb mentioned this pull request Mar 27, 2024

DataFusion weekly project plan (Andrew Lamb) - March 25, 2024 #9796

Closed

6 tasks

fix antijoin cardinality estimation

9bd30ea

korowa force-pushed the semijoin_statistics branch from 8cdb65e to 9bd30ea Compare March 28, 2024 19:07

alamb approved these changes Mar 29, 2024

View reviewed changes

gruuya mentioned this pull request Mar 29, 2024

Remove vestigal conbench integration #9855

Merged

alamb mentioned this pull request Mar 29, 2024

Minor: Remove TPCH scale factor 10 from PR benchmark checks #9856

Closed

gruuya mentioned this pull request Mar 29, 2024

Minor: Remove the bench most likely to cause OOM in CI #9858

Merged

Dandandan approved these changes Mar 29, 2024

View reviewed changes

Dandandan merged commit 21fe0b7 into apache:main Mar 30, 2024
24 checks passed

Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Apr 1, 2024

Implement semi/anti join output statistics estimation (apache#9800)

7d427a9

* semi/anti join output statistics * fix antijoin cardinality estimation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement semi/anti join output statistics estimation #9800

Implement semi/anti join output statistics estimation #9800

korowa commented Mar 25, 2024 •

edited

Loading

Dandandan commented Mar 26, 2024

github-actions bot commented Mar 26, 2024

korowa commented Mar 26, 2024

Dandandan commented Mar 26, 2024

Dandandan commented Mar 26, 2024 •

edited

Loading

github-actions bot commented Mar 26, 2024

Dandandan commented Mar 26, 2024

korowa commented Mar 26, 2024

korowa commented Mar 26, 2024

alamb left a comment

alamb Mar 27, 2024

alamb Mar 27, 2024

korowa Mar 28, 2024 •

edited

Loading

Dandandan commented Mar 27, 2024

Dandandan commented Mar 28, 2024

gruuya commented Mar 28, 2024

alamb left a comment

alamb commented Mar 29, 2024

Dandandan commented Mar 29, 2024

github-actions bot commented Mar 29, 2024

gruuya commented Mar 29, 2024

Benchmark results

github-actions bot commented Mar 29, 2024

korowa commented Mar 31, 2024

alamb commented Apr 2, 2024 •

edited

Loading

Implement semi/anti join output statistics estimation #9800

Implement semi/anti join output statistics estimation #9800

Conversation

korowa commented Mar 25, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Dandandan commented Mar 26, 2024

github-actions bot commented Mar 26, 2024

Benchmark results

korowa commented Mar 26, 2024

Dandandan commented Mar 26, 2024

Dandandan commented Mar 26, 2024 • edited Loading

github-actions bot commented Mar 26, 2024

Benchmark results

Dandandan commented Mar 26, 2024

korowa commented Mar 26, 2024

korowa commented Mar 26, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 27, 2024

Choose a reason for hiding this comment

alamb Mar 27, 2024

Choose a reason for hiding this comment

korowa Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Dandandan commented Mar 27, 2024

Dandandan commented Mar 28, 2024

gruuya commented Mar 28, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Mar 29, 2024

Dandandan commented Mar 29, 2024

github-actions bot commented Mar 29, 2024

Benchmark results

gruuya commented Mar 29, 2024

Benchmark results

github-actions bot commented Mar 29, 2024

Benchmark results

korowa commented Mar 31, 2024

alamb commented Apr 2, 2024 • edited Loading

korowa commented Mar 25, 2024 •

edited

Loading

Dandandan commented Mar 26, 2024 •

edited

Loading

korowa Mar 28, 2024 •

edited

Loading

alamb commented Apr 2, 2024 •

edited

Loading