Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement semi/anti join output statistics estimation #9800

Merged
merged 2 commits into from
Mar 30, 2024

Conversation

korowa
Copy link
Contributor

@korowa korowa commented Mar 25, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

While #9676 found that NLJoin input reordering in benchmarks doesn't happen if one of inputs is AntiJoin -- the reason is that output statistics for all Semi/Anti join types is always set as Absent. This PR propagates "outer" input statistics to Anti/Semi join output.

What changes are included in this PR?

  • "outer" input statistics for Anti/Semi join cases is now propagated as output statistics (worst case assumption -- filtering not occurs at all, an alternative might be either to use default_filter_statistics or use similar to regular joins algorithm based on estimated distinct values, which though could be misleading as it assumes that "all values from smaller side are present in larger side" -- not sure if it's fine for filter estimation)
  • the only exception is disjoint inputs for semi-joins case -- if inputs statistics values are non-overlapping, the output cardinality will be estimated as zero
  • avoid early returns on absent statistics value while non-overlapping statistics checks

Are these changes tested?

Added test coverage for SemiJoin cardinality estimation

Are there any user-facing changes?

Execution plans containing Semi/Anti joins might change

@Dandandan
Copy link
Contributor

/benchmark

Copy link

Benchmark results

Benchmarks comparing 349c586 (main) and 29f1cca (PR)
Comparing 349c586 and 29f1cca
--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃  349c586 ┃  29f1cca ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 442.56ms │ 447.23ms │    no change │
│ QQuery 2     │  57.74ms │  58.90ms │    no change │
│ QQuery 3     │ 145.56ms │ 149.88ms │    no change │
│ QQuery 4     │  89.29ms │ 121.83ms │ 1.36x slower │
│ QQuery 5     │ 201.46ms │ 247.68ms │ 1.23x slower │
│ QQuery 6     │ 105.37ms │ 110.09ms │    no change │
│ QQuery 7     │ 284.15ms │ 291.72ms │    no change │
│ QQuery 8     │ 198.73ms │ 202.66ms │    no change │
│ QQuery 9     │ 310.13ms │ 313.32ms │    no change │
│ QQuery 10    │ 234.28ms │ 251.45ms │ 1.07x slower │
│ QQuery 11    │  62.78ms │  64.24ms │    no change │
│ QQuery 12    │ 129.96ms │ 130.64ms │    no change │
│ QQuery 13    │ 177.83ms │ 187.69ms │ 1.06x slower │
│ QQuery 14    │ 128.79ms │ 135.48ms │ 1.05x slower │
│ QQuery 15    │ 192.30ms │ 197.11ms │    no change │
│ QQuery 16    │  52.21ms │  51.71ms │    no change │
│ QQuery 17    │ 313.30ms │ 328.26ms │    no change │
│ QQuery 18    │ 455.63ms │ 469.28ms │    no change │
│ QQuery 19    │ 236.01ms │ 237.34ms │    no change │
│ QQuery 20    │ 194.17ms │ 202.15ms │    no change │
│ QQuery 21    │ 335.41ms │ 334.95ms │    no change │
│ QQuery 22    │  54.90ms │  55.99ms │    no change │
└──────────────┴──────────┴──────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (349c586)   │ 4402.55ms │
│ Total Time (29f1cca)   │ 4589.59ms │
│ Average Time (349c586) │  200.12ms │
│ Average Time (29f1cca) │  208.62ms │
│ Queries Faster         │         0 │
│ Queries Slower         │         5 │
│ Queries with No Change │        17 │
└────────────────────────┴───────────┘

@korowa
Copy link
Contributor Author

korowa commented Mar 26, 2024

Checked on the fork (1 and 2) for the same base/PR commits -- the results don't seem to be reliable

@Dandandan
Copy link
Contributor

/benchmark

@Dandandan
Copy link
Contributor

Dandandan commented Mar 26, 2024

Checked on the fork (1 and 2) for the same base/PR commits -- the results don't seem to be reliable

Yes, it seems not too reliable at the moment unfortunately. Based on that, seems there is no significant difference.
I started another run for a last check.

Copy link

Benchmark results

Benchmarks comparing 39f4aaf (main) and 29f1cca (PR)
Comparing 39f4aaf and 29f1cca
--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃  39f4aaf ┃  29f1cca ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 446.24ms │ 505.53ms │ 1.13x slower │
│ QQuery 2     │  58.17ms │  77.06ms │ 1.32x slower │
│ QQuery 3     │ 148.50ms │ 151.10ms │    no change │
│ QQuery 4     │  89.58ms │  90.13ms │    no change │
│ QQuery 5     │ 204.90ms │ 209.82ms │    no change │
│ QQuery 6     │ 110.61ms │ 109.62ms │    no change │
│ QQuery 7     │ 281.84ms │ 293.58ms │    no change │
│ QQuery 8     │ 194.27ms │ 203.05ms │    no change │
│ QQuery 9     │ 319.48ms │ 312.89ms │    no change │
│ QQuery 10    │ 249.44ms │ 245.80ms │    no change │
│ QQuery 11    │  62.59ms │  63.01ms │    no change │
│ QQuery 12    │ 127.26ms │ 127.85ms │    no change │
│ QQuery 13    │ 181.71ms │ 191.56ms │ 1.05x slower │
│ QQuery 14    │ 132.04ms │ 132.48ms │    no change │
│ QQuery 15    │ 212.20ms │ 209.90ms │    no change │
│ QQuery 16    │  51.83ms │  51.61ms │    no change │
│ QQuery 17    │ 326.16ms │ 350.28ms │ 1.07x slower │
│ QQuery 18    │ 451.92ms │ 482.47ms │ 1.07x slower │
│ QQuery 19    │ 236.19ms │ 237.95ms │    no change │
│ QQuery 20    │ 191.50ms │ 199.72ms │    no change │
│ QQuery 21    │ 326.33ms │ 329.98ms │    no change │
│ QQuery 22    │  54.79ms │  53.14ms │    no change │
└──────────────┴──────────┴──────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (39f4aaf)   │ 4457.57ms │
│ Total Time (29f1cca)   │ 4628.56ms │
│ Average Time (39f4aaf) │  202.62ms │
│ Average Time (29f1cca) │  210.39ms │
│ Queries Faster         │         0 │
│ Queries Slower         │         5 │
│ Queries with No Change │        17 │
└────────────────────────┴───────────┘

@Dandandan
Copy link
Contributor

Queries running slower seem inconsistent again :)

@korowa
Copy link
Contributor Author

korowa commented Mar 26, 2024

Probably this is the case described here as a concern regarding GH runners.

@korowa
Copy link
Contributor Author

korowa commented Mar 26, 2024

On the other side, as far as I understand, at this moment this benchmarking is aimed to detect larger scale performance regressions like x3+ (while default benchmark report has 5% diff threshold as "no change")

@alamb alamb changed the title fix: semi/anti join output statistics Implement semi/anti join output statistics estimation Mar 27, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @korowa -- other than the anti-join calculation this PR looks good to me.

In response to @Dandandan 's comments about potential performance degredation: In general I think our join ordering optimization / statistics are fairly simplistic

Rather than trying to make the built on models for datafusion more and more sophisticated, I think a better longer term strategy is to focus on extensibe APIs (with some reasonable default implementation / cost model behind them)

I don't have much time to drive this kind of refactoring for a while (I am pretty tied up with the functions extraction and other things for InfluxData), though I think it would really help DataFusion's long term story for more complex joins

///
/// The estimation result is either zero, in cases inputs statistics are non-overlapping
/// or equal to number of rows for outer input.
fn estimate_semi_join_cardinality(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term it would be really nice to pull these types of calculations into some trait (aka an extensibility API)

| JoinType::LeftAnti
| JoinType::RightAnti => None,
JoinType::LeftSemi | JoinType::LeftAnti => {
let cardinality = estimate_semi_join_cardinality(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't seem correct to me that the same calculation is used for both Semi and Anti joins (shouldn't they be the inverse of each other?)

Copy link
Contributor Author

@korowa korowa Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, they were not correct. I've changed estimations a bit -- now disjoint statistics affects only semi-joins (filtering outer table should produce zero rows). For anti-joins, disjoint inputs don't seem to make much sense -- if statistics are non-overlapping the result will be equal to outer num_rows side, otherwise (having no info or overlapping statistics) -- it still will be estimated as outer side, since we know nothing about actual distribution besides min/max, and assuming that all rows will be filtered out is too much (may significantly affect further planning)

@Dandandan
Copy link
Contributor

Just to aid the discussions on benchmark variance, let's try and run the newer version once more
/benchmark

@Dandandan
Copy link
Contributor

/benchmark

@gruuya
Copy link
Contributor

gruuya commented Mar 28, 2024

/benchmark

I think the SF 10 in-memory bench OOM-ed the run, so we should remove it for now.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @korowa -- I think this is a good step forward.

In general, I believe that the cardinality estimation / join reordering is too tightly tied into the core logic at the moment. My experience is that these cost models are very hard to get right and different systems will have divergent needs for their models (e.g. accuracy vs robustness)

What do you think about trying to extract the cost models (e.g. cardinality estimation) into some API? For example

pub trait CardinalityEstimator {
fn estimate_join_cardinality(
    &self, 
    join_type: &JoinType,
    left_stats: Statistics,
    right_stats: Statistics,
    on: &JoinOn) -> Option<PartialJoinStatistics> ;

  fn estimate_filter_selectivity(..);
  ..
}

maybe @ozankabak / @metesynnada have some other ideas

@alamb
Copy link
Contributor

alamb commented Mar 29, 2024

What do you think about trying to extract the cost models (e.g. cardinality estimation) into some API? For example

TO be clear this is a suggestion for some other potential future PR, not this one

@Dandandan
Copy link
Contributor

/benchmark
Final one

Copy link

Benchmark results

Benchmarks comparing 179179c (main) and 9bd30ea (PR)
Comparing 179179c and 9bd30ea
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_mem_sf1.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_mem_sf1.json does not exist
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_sf1.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_sf1.json does not exist
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_sf10.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_sf10.json does not exist

@gruuya
Copy link
Contributor

gruuya commented Mar 29, 2024

Benchmark results

Benchmarks comparing 179179c (main) and 9bd30ea (PR)

Comparing 179179c and 9bd30ea
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_mem_sf1.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_mem_sf1.json does not exist
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_sf1.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_sf1.json does not exist
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_sf10.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_sf10.json does not exist

So what seem to have happened here is that the step benchmarking the PR changes uses it's version of the benchmarks/bench.sh file, and that is different from the one on main now due to this https://github.com/apache/arrow-datafusion/blob/179179c0b719a7f9e33d138ab728fdc2b0e1e1d8/benchmarks/bench.sh#L317

I needed to add a distinction between the results since previously SF1/SF10 would overwrite the same file.

So the resolution is just to rebase on/merge from main and re-run the benches again. 👀

Copy link

Benchmark results

Benchmarks comparing 179179c (main) and 9bd30ea (PR)
Comparing 179179c and 9bd30ea
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_mem_sf1.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_mem_sf1.json does not exist
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_sf1.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_sf1.json does not exist
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/179179c/tpch_sf10.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/9bd30ea/tpch_sf10.json does not exist

@Dandandan Dandandan merged commit 21fe0b7 into apache:main Mar 30, 2024
24 checks passed
@korowa
Copy link
Contributor Author

korowa commented Mar 31, 2024

What do you think about trying to extract the cost models (e.g. cardinality estimation) into some API?

@alamb, I don't have any strong opinion here (probably I'm lacking knowledge of usecases for this), and if I got the idea right -- on one side it might help by adding versatility to DF usage (there is already available, AFAIU, an option to customize physical optimizer, and this API should allow to reuse optimizer with custom cost model), but on the other side, if end goal is an (extensible/customizable) API, providing data required for physical plan optimization, I'm not sure that statistics estimation API will be enough, as there are more attributes affecting phyiscal plan significantly (e.g. partitioning and ordering related attributes), and as a result, to provide all required inputs random external planner needs, we may end up with +- same ExecutionPlan.

Maybe it'll be better to start with internal estimator API (maybe not "API", but just set of functions, like we have now across multiple various utility-files, but better organized), and, for now, provide statistics through operators (as it's working right now) using this utility functional?

Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Apr 1, 2024
* semi/anti join output statistics

* fix antijoin cardinality estimation
@alamb
Copy link
Contributor

alamb commented Apr 2, 2024

🤔 In my mind the way a cost based optimizer (CBO) typically works is that there are:

  1. A set of heuristics that take things like partitioning/ordering into account to create potential plans (with potentially different join orders)
  2. A cost model that is then used to pick between the potential plans

I was thinking if we could decouple the "make some potential plans" and "what would it cost to run this query" parts, we could let people implement their own cost based optimizer (and we could pull the basic cardinality estimation code into the "build in" cost model)

I don't have time to pursue the idea now, however

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants