Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: choosing worse plans with stats present on TPC-H queries than without stats #46701

Open
6 tasks
yuzefovich opened this issue Mar 28, 2020 · 18 comments
Open
6 tasks
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Mar 28, 2020

After encountering a suboptimal plan for query 6 (as described in #46677), I decided to run all TPC-H queries without automatic stats present and with automatic stats present. All queries ran on a single node cluster on my laptop. I extracted all queries for which the runtimes seem to significantly increase once statistics are collected on tables.

  • Query 5

Without stats:

Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=5 --vectorize=off
I200328 01:31:19.495950 24 workload/tpch/tpch.go:465  [q5] returned 5 rows after 6.95 seconds
I200328 01:25:02.100306 24 workload/tpch/tpch.go:465  [q5] returned 5 rows after 6.92 seconds
I200328 01:18:47.022713 24 workload/tpch/tpch.go:465  [q5] returned 5 rows after 6.84 seconds

With stats:

Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=5 --vectorize=off
I200328 02:08:52.579355 26 workload/tpch/tpch.go:465  [q5] returned 5 rows after 11.78 seconds
I200328 02:09:04.705098 26 workload/tpch/tpch.go:465  [q5] returned 5 rows after 12.13 seconds
I200328 02:09:17.202039 26 workload/tpch/tpch.go:465  [q5] returned 5 rows after 12.50 seconds
  • Query 6

Without stats:

Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=6 --vectorize=off
I200328 01:18:53.018741 24 workload/tpch/tpch.go:465  [q6] returned 1 rows after 6.00 seconds
I200328 01:25:08.091755 24 workload/tpch/tpch.go:465  [q6] returned 1 rows after 5.99 seconds
I200328 01:31:25.576590 24 workload/tpch/tpch.go:465  [q6] returned 1 rows after 6.08 seconds

With stats:

Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=6 --vectorize=off
I200328 02:09:56.601256 27 workload/tpch/tpch.go:465  [q6] returned 1 rows after 9.36 seconds
I200328 02:10:06.343776 27 workload/tpch/tpch.go:465  [q6] returned 1 rows after 9.74 seconds
I200328 02:10:16.183090 27 workload/tpch/tpch.go:465  [q6] returned 1 rows after 9.84 seconds
  • Query 11

Without stats:

Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=11 --vectorize=off
I200328 01:34:57.235773 24 workload/tpch/tpch.go:465  [q11] returned 1048 rows after 1.19 seconds
I200328 01:28:39.006315 24 workload/tpch/tpch.go:465  [q11] returned 1048 rows after 1.16 seconds
I200328 01:22:21.042526 24 workload/tpch/tpch.go:465  [q11] returned 1048 rows after 1.17 seconds

With stats:

Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=11 --vectorize=off
I200328 02:10:59.227409 52 workload/tpch/tpch.go:465  [q11] returned 1048 rows after 1.74 seconds
I200328 02:11:00.848808 52 workload/tpch/tpch.go:465  [q11] returned 1048 rows after 1.62 seconds
I200328 02:11:02.515854 52 workload/tpch/tpch.go:465  [q11] returned 1048 rows after 1.67 seconds
  • Query 12

Without stats:

Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=12 --vectorize=off
I200328 01:22:28.445019 24 workload/tpch/tpch.go:465  [q12] returned 2 rows after 7.40 seconds
I200328 01:35:04.550451 24 workload/tpch/tpch.go:465  [q12] returned 2 rows after 7.31 seconds
I200328 01:28:46.288655 24 workload/tpch/tpch.go:465  [q12] returned 2 rows after 7.28 seconds

With stats:

Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=12 --vectorize=off
I200328 02:13:54.828276 39 workload/tpch/tpch.go:465  [q12] returned 2 rows after 10.76 seconds
I200328 02:14:05.913452 39 workload/tpch/tpch.go:465  [q12] returned 2 rows after 11.09 seconds
I200328 02:14:17.418031 39 workload/tpch/tpch.go:465  [q12] returned 2 rows after 11.50 seconds
  • Query 18

Without stats:

Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=18 --vectorize=off
I200328 01:35:47.223437 24 workload/tpch/tpch.go:465  [q18] returned 57 rows after 8.63 seconds
I200328 01:23:10.517948 24 workload/tpch/tpch.go:465  [q18] returned 57 rows after 8.58 seconds
I200328 01:29:28.141919 24 workload/tpch/tpch.go:465  [q18] returned 57 rows after 8.50 seconds

With stats:

Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=18 --vectorize=off
I200328 02:15:00.433611 44 workload/tpch/tpch.go:465  [q18] returned 57 rows after 10.46 seconds
I200328 02:15:10.924977 44 workload/tpch/tpch.go:465  [q18] returned 57 rows after 10.49 seconds
I200328 02:15:21.399717 44 workload/tpch/tpch.go:465  [q18] returned 57 rows after 10.47 seconds
  • Query 20 (note this was with vectorize=on)

Without stats:

Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=20 --vectorize=on
I200328 02:02:54.919533 24 workload/tpch/tpch.go:465  [q20] returned 186 rows after 7.50 seconds
I200328 02:03:02.351887 24 workload/tpch/tpch.go:465  [q20] returned 186 rows after 7.43 seconds
I200328 02:03:09.847841 24 workload/tpch/tpch.go:465  [q20] returned 186 rows after 7.50 seconds

With stats:

Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=20 --vectorize=on
I200328 02:02:03.829129 40 workload/tpch/tpch.go:465  [q20] returned 186 rows after 13.21 seconds
I200328 02:02:16.951691 40 workload/tpch/tpch.go:465  [q20] returned 186 rows after 13.12 seconds
I200328 02:02:30.060743 40 workload/tpch/tpch.go:465  [q20] returned 186 rows after 13.11 seconds

Jira issue: CRDB-5065

@yuzefovich yuzefovich added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 28, 2020
@yuzefovich yuzefovich added this to Triage in BACKLOG, NO NEW ISSUES: SQL Optimizer via automation Mar 28, 2020
@RaduBerinde RaduBerinde moved this from Triage to Planned for 20.2 in BACKLOG, NO NEW ISSUES: SQL Optimizer Mar 30, 2020
@andy-kimball
Copy link
Contributor

Q18: The sum(l_quantity) > 300 clause is estimated by the optimizer to return 500,000 rows, when in reality it returns 57 rows. This causes the optimizer to use merge + hash joins rather than lookup joins, as in the no-stats case. I don't see a good way to do a better estimate of the selectivity of that clause, and I actually think the optimizer is making a reasonable choice here. Instead, we should try to make lookup-join cheaper to execute. If it was about 2x cheaper, I think we'd select it rather than hash/merge (based on reducing row estimate to 250,000 and having plan switch to use lookup joins).

@andy-kimball
Copy link
Contributor

andy-kimball commented Apr 15, 2020

Q5: This is due to our reorder_joins_limit=4 default setting. If that is increased to 6, then we pick the best plan. This is a problem of our join ordering analysis not being good enough yet, rather than a stats problem (no-stats just gets lucky).

@andy-kimball
Copy link
Contributor

Q6: The stats are quite accurate on this one (estimate=938K rows, actual=909K rows). This means it's the coster that's at fault. In particular, it appears that vectorize=on has made scanning the primary index much faster than it was. Since the coster was tuned for vectorize=off, it's returning the wrong cost.

@yuzefovich
Copy link
Member Author

Note that all queries (except for Q20) were run with vectorize=off. (Q20 was run with vectorize=on because it was hitting memory budget error with vectorize=off at that time; this has been fixed, but I haven't updated the issue.)

@andy-kimball
Copy link
Contributor

Q11: Again, the stats are quite accurate on this one (estimate=32,258 rows, actual=31,680 rows). This is another case where the primary index scan is now faster, but the coster was not updated to reflect that.

@yuzefovich
Copy link
Member Author

yuzefovich commented Apr 15, 2020

Actually Radu already took a deeper look at Q6 in #46677.

@andy-kimball
Copy link
Contributor

OK, then it's a problem with the coster, even without vectorized=off. Sounds like Radu's working theory is that it's due to mis-costing on tables with many columns, which seems plausible to me.

@andy-kimball
Copy link
Contributor

Both Q6 & Q11 are the same issue: using a secondary index + index-join vs. using a primary index.

@andy-kimball
Copy link
Contributor

Q12: Looks like similar issue to Q6 and Q11 - mis-costing of secondary index + index-join vs. primary index.

@andy-kimball
Copy link
Contributor

Q20: Same as Q12, Q6, and Q11.

@andy-kimball
Copy link
Contributor

andy-kimball commented Apr 15, 2020

In summary:
Q6, Q11, Q12, Q20: mis-costing of secondary index + index-join vs. primary index - hopefully #46677 will help here
Q5: join-reorder-limit needs to be at least 6 - hopefully opt team can invest in join ordering performance improvements, so that we can increase this safely
Q18: if we can speed up lookup join by 2x, it may fix this one; otherwise, I don't see a good way to fix it

@rytaft
Copy link
Collaborator

rytaft commented Apr 15, 2020

Thanks for doing this analysis, @andy-kimball!

@RaduBerinde RaduBerinde moved this from Planned for 20.2 to Triage in BACKLOG, NO NEW ISSUES: SQL Optimizer Apr 18, 2020
@RaduBerinde RaduBerinde moved this from Triage to Plan enhancements (higher priority) in BACKLOG, NO NEW ISSUES: SQL Optimizer Apr 18, 2020
@rytaft
Copy link
Collaborator

rytaft commented Aug 7, 2020

Has the recent TPC-H testing changed the status of this issue?

@yuzefovich
Copy link
Member Author

I no longer see "no stats" case being faster on Q5, Q11, Q18, and Q20.

Q12

  • no stats
Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=12
I200810 19:27:51.554477 1 workload/cli/run.go:356  creating load generator...
I200810 19:27:51.554971 1 workload/cli/run.go:387  creating load generator... done (took 497µs)
I200810 19:27:58.836772 31 workload/tpch/tpch.go:479  [q12] returned 2 rows after 7.28 seconds
I200810 19:28:06.004504 31 workload/tpch/tpch.go:479  [q12] returned 2 rows after 7.17 seconds
I200810 19:28:13.434918 31 workload/tpch/tpch.go:479  [q12] returned 2 rows after 7.43 seconds
  • with stats
Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=12
I200810 19:30:52.136604 1 workload/cli/run.go:356  creating load generator...
I200810 19:30:52.136831 1 workload/cli/run.go:387  creating load generator... done (took 228µs)
I200810 19:31:03.119541 12 workload/tpch/tpch.go:479  [q12] returned 2 rows after 10.98 seconds
I200810 19:31:13.731810 12 workload/tpch/tpch.go:479  [q12] returned 2 rows after 10.61 seconds
I200810 19:31:24.540449 12 workload/tpch/tpch.go:479  [q12] returned 2 rows after 10.81 seconds

We also seem to have regressed on query 20 (in absolute terms, not in "no stats" vs "with stats" comparison).

  • no stats
Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=20
I200810 19:28:47.987395 1 workload/cli/run.go:356  creating load generator...
I200810 19:28:47.987735 1 workload/cli/run.go:387  creating load generator... done (took 342µs)
I200810 19:29:05.335575 26 workload/tpch/tpch.go:479  [q20] returned 186 rows after 17.35 seconds
I200810 19:29:20.484648 26 workload/tpch/tpch.go:479  [q20] returned 186 rows after 15.15 seconds
I200810 19:29:35.486445 26 workload/tpch/tpch.go:479  [q20] returned 186 rows after 15.00 seconds
  • with stats
Yahors-MacBook-Pro:cockroach yuzefovich$ ./bin/workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=20
I200810 19:32:24.109137 1 workload/cli/run.go:356  creating load generator...
I200810 19:32:24.109614 1 workload/cli/run.go:387  creating load generator... done (took 481µs)
I200810 19:32:39.805945 82 workload/tpch/tpch.go:479  [q20] returned 186 rows after 15.70 seconds
I200810 19:32:56.295731 82 workload/tpch/tpch.go:479  [q20] returned 186 rows after 16.49 seconds
I200810 19:33:11.850279 82 workload/tpch/tpch.go:479  [q20] returned 186 rows after 15.55 seconds

@yuzefovich
Copy link
Member Author

I looked over a recent run of tpchvec/perf and tpchvec/perf_no_stats , and there appears to be also problems on Q7:

  • no stats [4.9 4.97 4.98]
  • with stats [7.21 7.4 7.41]

@andy-kimball
Copy link
Contributor

+@DrewKimball, who can provide insight into what's causing these issues. Some of these TPC-H queries are deliberately built to fool standard stats algorithms.

@DrewKimball
Copy link
Collaborator

DrewKimball commented Aug 20, 2020

Poor plan-choosing for Q7 is tracked in #52159. It looks to be due to mis-costing, since stats are actually pretty accurate.

@jlinder jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@DrewKimball DrewKimball removed this from Plan enhancements (higher priority) in BACKLOG, NO NEW ISSUES: SQL Optimizer Sep 14, 2023
@RaduBerinde RaduBerinde removed their assignment Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team
Projects
Status: Cold Storage
Development

No branches or pull requests

6 participants