opt: reoptimize main query with scalar subquery results #51820

xinyuliu12 · 2020-07-23T04:16:27Z

Hello everyone!

First off, I am sorry for gestures vaguely all of this. Second, I think we have found a performance bug related to a suboptimal plan selection when evaluating predicate.

Here is a pair of TPC-H queries that exhibit this behavior. They both ask the database system to project a column in the orders table only if the predicate on the nation table is true.

SELECT o_orderpriority
FROM   orders
WHERE  (SELECT n_comment
        FROM   nation
        ORDER  BY n_comment
        LIMIT  1) IS NOT DISTINCT FROM 'random string';

SELECT o_orderpriority
FROM  orders
LEFT OUTER JOIN
                (SELECT   *
                 FROM     nation
                 ORDER BY n_comment limit 1) AS t ON TRUE
WHERE t.n_comment IS NOT DISTINCT FROM 'random string';

Actual Behavior
The first query takes ~8.3 seconds on v20.1.3, while the second query only takes ~44 milliseconds.
For the first query, the database system scans the large orders table (7500000 rows), even when the predicate on the nation table can be evaluated within a few milliseconds. (only 25 rows). Its "EXPLAIN ANALYZE" result is here.
For the second query, the database system evaluates the predicate on the nation table first, which can avoid the unnecessary scan of the orders table if the predicate turns out to be false. Its "EXPLAIN ANALYZE" result is here.

Expected Behavior
I would have expected the database system run these two queries with similar execution time, given that they both have the same semantics. To this end, the optimizer should generate a better execution plan for the first query, such as the plan generated for the second query.
Here are the steps for reproducing our observations:

Test environment

Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic add weighted reservoir sampling (AE-S) #44-Ubuntu SMP Tue Jun 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux"
Cockroachdb v20.1.3
Database: TPC-H benchmark (with scale factor 5)

Reproduce Bug

Install cockroachdb v20.1.3

    $ wget -qO- https://binaries.cockroachdb.com/cockroach-v20.1.3.src.tgz | tar  xvz        
    $ cd cockroach-v20.1.3
    $ make build
    $ sudo make install

start a cockroachdb node in your preferred working directory

    $ cockroach start --insecure --store=node1 --listen-addr=localhost:26257 --http-addr=localhost:8080 --join=localhost:26257,localhost:26258,localhost:26259 --background
    $ cockroach init --insecure --host=localhost:26257

Set up TPC-H test benchmark (if you already have a TPC-H benchmark set up, you can skip step2; starting from step2, all commands should run in the directory ./tpch5 after downloading and extracting it)

Download TPC-H (scale factor of 5) and extract it
Download the dataset from the link: https://drive.google.com/file/d/1ZHldLW2iPfpsRh6lL-slcUoKazfai4P7/view?usp=sharing

    $ tar xzvf tpch5.tar.gz
    $ cd tpch5
    $ chmod +x setup.sh

Create DB and TPC-H schema
$ cockroach sql --insecure --host=localhost:26257
In CockroachDB's built-in SQL client to create database:

    $ create database tpch5;
    $ exit

Create TPC-H Schema
$ cockroach sql --insecure --host=localhost:26257 -d tpch5 < dss.ddl
Import benchmark
$ ./setup.sh

Test SQL query that exhibits performance issue

Execute the queries
$ cockroach sql --insecure --host=localhost:26257 -d tpch5 < query.sql

Jira issue: CRDB-4007

Epic CRDB-1491

The text was updated successfully, but these errors were encountered:

RaduBerinde · 2020-07-24T19:20:04Z

Hi! Thank you for the detailed information.

This is a pretty strange query - we either return the entire table or nothing. I think the fact that the second plan runs faster is to some extent an accident - the join process exits early when it sees that one side has no rows. I guess we could add some extra checking in the scan if the filter turns out to be constant and false. This is part of a greater problem that we currently can't reoptimize parts of the query with the result from a subquery (#23264).

Is there a reason you can't check the condition on the client side instead?

You could also write it this way which is I think more clear, and it should work as expected:

SELECT
  o_orderpriority
FROM
  orders
LIMIT
  CASE
  WHEN (SELECT n_comment FROM nation ORDER BY n_comment LIMIT 1) IS NOT DISTINCT FROM 'random string'
  THEN 1000000000
  ELSE 0
  END

xinyuliu12 · 2020-08-05T03:00:11Z

Thanks for your detailed reply!
It would be great if you could comment on the challenges associated with optimizing these queries wherein there is a dependency on the sub-query.

RaduBerinde · 2020-08-05T03:42:21Z

We optimize the entire expression tree (which includes any subqueries) once, and then we run the subqueries, followed by the "main" query. We don't go back and re-optimize the main query with the actual results of the subquery (which in this case would either result in removing the scan operator altogether, or removing the filter). This is mainly because of architectural reasons that were carried over from previous iterations of our system. It's something we will fix at some point.

xinyuliu12 · 2020-08-05T23:49:55Z

Thank you so much for your explanation!

RaduBerinde · 2020-08-31T16:56:09Z

See #53653 for another case.

mgartner · 2022-06-30T12:54:29Z

See #82503 (comment) for another case.

Subqueries that are in equality expressions with a variable are now hoisted. When these expressions exist in a filter, hoisting the subquery can allow the main query to plan a lookup join, rather than an inefficient full-table scan. For example, consider the table and query: CREATE TABLE t ( a INT, INDEX (a) ); SELECT * FROM t WHERE a = (SELECT max(a) FROM t); Prior to this commit, the query plan for this query required a full table scan: select ├── columns: a:1 ├── scan t@t_a_idx │ ├── columns: a:1 │ └── constraint: /1/2: (/NULL - ] └── filters └── eq ├── a:1 └── subquery └── scalar-group-by ├── columns: max:9 ├── scan t@t_a_idx,rev │ ├── columns: a:5 │ ├── constraint: /5/6: (/NULL - ] │ └── limit: 1(rev) └── aggregations └── const-agg [as=max:9, outer=(5)] └── a:5 By hoisting the subquery, the full table scan is replaced with a lookup join: project ├── columns: a:1 └── inner-join (lookup t@t_a_idx) ├── columns: a:1 max:9 ├── key columns: [9] = [1] ├── scalar-group-by │ ├── columns: max:9 │ ├── scan t@t_a_idx,rev │ │ ├── columns: a:5 │ │ ├── constraint: /5/6: (/NULL - ] │ │ └── limit: 1(rev) │ └── aggregations │ └── const-agg [as=max:9, outer=(5)] │ └── a:5 └── filters (true) This hoisting is enabled by default, but can be disabled by setting the `optimizer_hoist_uncorrelated_equality_subqueries` session setting to `false`. Fixes cockroachdb#83392 Informs cockroachdb#51820 Informs cockroachdb#93829 Informs cockroachdb#100855 Release note (performance improvement): Queries that have subqueries in equality expressions are now more efficiently planned by the optimizer.

100881: opt: hoist uncorrelated equality subqueries r=mgartner a=mgartner #### opt: hoist uncorrelated equality subqueries Subqueries that are in equality expressions with a variable are now hoisted. When these expressions exist in a filter, hoisting the subquery can allow the main query to plan a lookup join, rather than an inefficient full-table scan. For example, consider the table and query: CREATE TABLE t ( a INT, INDEX (a) ); SELECT * FROM t WHERE a = (SELECT max(a) FROM t); Prior to this commit, the query plan for this query required a full table scan: select ├── columns: a:1 ├── scan t@t_a_idx │ ├── columns: a:1 │ └── constraint: /1/2: (/NULL - ] └── filters └── eq ├── a:1 └── subquery └── scalar-group-by ├── columns: max:9 ├── scan t@t_a_idx,rev │ ├── columns: a:5 │ ├── constraint: /5/6: (/NULL - ] │ └── limit: 1(rev) └── aggregations └── const-agg [as=max:9, outer=(5)] └── a:5 By hoisting the subquery, the full table scan is replaced with a lookup join: project ├── columns: a:1 └── inner-join (lookup t@t_a_idx) ├── columns: a:1 max:9 ├── key columns: [9] = [1] ├── scalar-group-by │ ├── columns: max:9 │ ├── scan t@t_a_idx,rev │ │ ├── columns: a:5 │ │ ├── constraint: /5/6: (/NULL - ] │ │ └── limit: 1(rev) │ └── aggregations │ └── const-agg [as=max:9, outer=(5)] │ └── a:5 └── filters (true) This hoisting is enabled by default, but can be disabled by setting the `optimizer_hoist_uncorrelated_equality_subqueries` session setting to `false`. Fixes #83392 Informs #51820 Informs #93829 Informs #100855 Release note (performance improvement): Queries that have subqueries in equality expressions are now more efficiently planned by the optimizer. Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>

Subqueries that are in equality expressions with a variable are now hoisted. When these expressions exist in a filter, hoisting the subquery can allow the main query to plan a lookup join, rather than an inefficient full-table scan. For example, consider the table and query: CREATE TABLE t ( a INT, INDEX (a) ); SELECT * FROM t WHERE a = (SELECT max(a) FROM t); Prior to this commit, the query plan for this query required a full table scan: select ├── columns: a:1 ├── scan t@t_a_idx │ ├── columns: a:1 │ └── constraint: /1/2: (/NULL - ] └── filters └── eq ├── a:1 └── subquery └── scalar-group-by ├── columns: max:9 ├── scan t@t_a_idx,rev │ ├── columns: a:5 │ ├── constraint: /5/6: (/NULL - ] │ └── limit: 1(rev) └── aggregations └── const-agg [as=max:9, outer=(5)] └── a:5 By hoisting the subquery, the full table scan is replaced with a lookup join: project ├── columns: a:1 └── inner-join (lookup t@t_a_idx) ├── columns: a:1 max:9 ├── key columns: [9] = [1] ├── scalar-group-by │ ├── columns: max:9 │ ├── scan t@t_a_idx,rev │ │ ├── columns: a:5 │ │ ├── constraint: /5/6: (/NULL - ] │ │ └── limit: 1(rev) │ └── aggregations │ └── const-agg [as=max:9, outer=(5)] │ └── a:5 └── filters (true) This hoisting is enabled by default, but can be disabled by setting the `optimizer_hoist_uncorrelated_equality_subqueries` session setting to `false`. Fixes #83392 Informs #51820 Informs #93829 Informs #100855 Release note (performance improvement): Queries that have subqueries in equality expressions are now more efficiently planned by the optimizer.

Subqueries that are in equality expressions with a variable are now hoisted. When these expressions exist in a filter, hoisting the subquery can allow the main query to plan a lookup join, rather than an inefficient full-table scan. For example, consider the table and query: CREATE TABLE t ( a INT, INDEX (a) ); SELECT * FROM t WHERE a = (SELECT max(a) FROM t); Prior to this commit, the query plan for this query required a full table scan: select ├── columns: a:1 ├── scan t@t_a_idx │ ├── columns: a:1 │ └── constraint: /1/2: (/NULL - ] └── filters └── eq ├── a:1 └── subquery └── scalar-group-by ├── columns: max:9 ├── scan t@t_a_idx,rev │ ├── columns: a:5 │ ├── constraint: /5/6: (/NULL - ] │ └── limit: 1(rev) └── aggregations └── const-agg [as=max:9, outer=(5)] └── a:5 By hoisting the subquery, the full table scan is replaced with a lookup join: project ├── columns: a:1 └── inner-join (lookup t@t_a_idx) ├── columns: a:1 max:9 ├── key columns: [9] = [1] ├── scalar-group-by │ ├── columns: max:9 │ ├── scan t@t_a_idx,rev │ │ ├── columns: a:5 │ │ ├── constraint: /5/6: (/NULL - ] │ │ └── limit: 1(rev) │ └── aggregations │ └── const-agg [as=max:9, outer=(5)] │ └── a:5 └── filters (true) This hoisting is enabled by default, but can be disabled by setting the `optimizer_hoist_uncorrelated_equality_subqueries` session setting to `false`. Fixes cockroachdb#83392 Informs cockroachdb#51820 Informs cockroachdb#93829 Informs cockroachdb#100855 Release note (performance improvement): Queries that have subqueries in equality expressions are now more efficiently planned by the optimizer.

Subqueries that are in equality expressions with a variable are now hoisted. When these expressions exist in a filter, hoisting the subquery can allow the main query to plan a lookup join, rather than an inefficient full-table scan. For example, consider the table and query: CREATE TABLE t ( a INT, INDEX (a) ); SELECT * FROM t WHERE a = (SELECT max(a) FROM t); Prior to this commit, the query plan for this query required a full table scan: select ├── columns: a:1 ├── scan t@t_a_idx │ ├── columns: a:1 │ └── constraint: /1/2: (/NULL - ] └── filters └── eq ├── a:1 └── subquery └── scalar-group-by ├── columns: max:9 ├── scan t@t_a_idx,rev │ ├── columns: a:5 │ ├── constraint: /5/6: (/NULL - ] │ └── limit: 1(rev) └── aggregations └── const-agg [as=max:9, outer=(5)] └── a:5 By hoisting the subquery, the full table scan is replaced with a lookup join: project ├── columns: a:1 └── inner-join (lookup t@t_a_idx) ├── columns: a:1 max:9 ├── key columns: [9] = [1] ├── scalar-group-by │ ├── columns: max:9 │ ├── scan t@t_a_idx,rev │ │ ├── columns: a:5 │ │ ├── constraint: /5/6: (/NULL - ] │ │ └── limit: 1(rev) │ └── aggregations │ └── const-agg [as=max:9, outer=(5)] │ └── a:5 └── filters (true) This hoisting is disabled by default, but can be enabled by setting the `optimizer_hoist_uncorrelated_equality_subqueries` session setting to `true`. Fixes cockroachdb#83392 Informs cockroachdb#51820 Informs cockroachdb#93829 Informs cockroachdb#100855 Release note (performance improvement): Queries that have subqueries in equality expressions are now more efficiently planned by the optimizer when `optimizer_hoist_uncorrelated_equality_subqueries` is set to `true`.

Subqueries that are in equality expressions with a variable are now hoisted. When these expressions exist in a filter, hoisting the subquery can allow the main query to plan a lookup join, rather than an inefficient full-table scan. For example, consider the table and query: CREATE TABLE t ( a INT, INDEX (a) ); SELECT * FROM t WHERE a = (SELECT max(a) FROM t); Prior to this commit, the query plan for this query required a full table scan: select ├── columns: a:1 ├── scan t@t_a_idx │ ├── columns: a:1 │ └── constraint: /1/2: (/NULL - ] └── filters └── eq ├── a:1 └── subquery └── scalar-group-by ├── columns: max:9 ├── scan t@t_a_idx,rev │ ├── columns: a:5 │ ├── constraint: /5/6: (/NULL - ] │ └── limit: 1(rev) └── aggregations └── const-agg [as=max:9, outer=(5)] └── a:5 By hoisting the subquery, the full table scan is replaced with a lookup join: project ├── columns: a:1 └── inner-join (lookup t@t_a_idx) ├── columns: a:1 max:9 ├── key columns: [9] = [1] ├── scalar-group-by │ ├── columns: max:9 │ ├── scan t@t_a_idx,rev │ │ ├── columns: a:5 │ │ ├── constraint: /5/6: (/NULL - ] │ │ └── limit: 1(rev) │ └── aggregations │ └── const-agg [as=max:9, outer=(5)] │ └── a:5 └── filters (true) This hoisting is disabled by default, but can be enabled by setting the `optimizer_hoist_uncorrelated_equality_subqueries` session setting to `true`. Fixes #83392 Informs #51820 Informs #93829 Informs #100855 Release note (performance improvement): Queries that have subqueries in equality expressions are now more efficiently planned by the optimizer when `optimizer_hoist_uncorrelated_equality_subqueries` is set to `true`.

Subqueries that are in equality expressions with a variable are now hoisted. When these expressions exist in a filter, hoisting the subquery can allow the main query to plan a lookup join, rather than an inefficient full-table scan. For example, consider the table and query: CREATE TABLE t ( a INT, INDEX (a) ); SELECT * FROM t WHERE a = (SELECT max(a) FROM t); Prior to this commit, the query plan for this query required a full table scan: select ├── columns: a:1 ├── scan t@t_a_idx │ ├── columns: a:1 │ └── constraint: /1/2: (/NULL - ] └── filters └── eq ├── a:1 └── subquery └── scalar-group-by ├── columns: max:9 ├── scan t@t_a_idx,rev │ ├── columns: a:5 │ ├── constraint: /5/6: (/NULL - ] │ └── limit: 1(rev) └── aggregations └── const-agg [as=max:9, outer=(5)] └── a:5 By hoisting the subquery, the full table scan is replaced with a lookup join: project ├── columns: a:1 └── inner-join (lookup t@t_a_idx) ├── columns: a:1 max:9 ├── key columns: [9] = [1] ├── scalar-group-by │ ├── columns: max:9 │ ├── scan t@t_a_idx,rev │ │ ├── columns: a:5 │ │ ├── constraint: /5/6: (/NULL - ] │ │ └── limit: 1(rev) │ └── aggregations │ └── const-agg [as=max:9, outer=(5)] │ └── a:5 └── filters (true) This hoisting is disabled by default, but can be enabled by setting the `optimizer_hoist_uncorrelated_equality_subqueries` session setting to `true`. Fixes cockroachdb#83392 Informs cockroachdb#51820 Informs cockroachdb#93829 Informs cockroachdb#100855 Release note (performance improvement): Queries that have subqueries in equality expressions are now more efficiently planned by the optimizer when `optimizer_hoist_uncorrelated_equality_subqueries` is set to `true`.

blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Jul 23, 2020

blathers-crl bot added this to Triage in Disaster Recovery Backlog Jul 23, 2020

blathers-crl bot added this to Triage in SQL Features (Deprecated - use SQL Experience board) Jul 23, 2020

dt added this to Triage in BACKLOG, NO NEW ISSUES: SQL Optimizer via automation Jul 23, 2020

dt removed this from Triage in Disaster Recovery Backlog Jul 23, 2020

dt removed this from Triage in SQL Features (Deprecated - use SQL Experience board) Jul 23, 2020

RaduBerinde added the C-investigation Further steps needed to qualify. C-label will change. label Jul 24, 2020

cockroachdb deleted a comment from blathers-crl bot Jul 24, 2020

awoods187 added the A-apollo label Aug 5, 2020

RaduBerinde mentioned this issue Aug 5, 2020

Accelerate queries using eager predicate evaluation #51361

Closed

RaduBerinde mentioned this issue Aug 31, 2020

OPT: Query planner chooses a suboptimal plan for case statement evaluation #53653

Closed

RaduBerinde changed the title ~~OPT: Query planner chooses a suboptimal plan for predicate evaluation~~ opt: reoptimize main query with scalar subquery results Aug 31, 2020

RaduBerinde moved this from Triage to Plan enhancements (higher priority) in BACKLOG, NO NEW ISSUES: SQL Optimizer Aug 31, 2020

jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021

RaduBerinde mentioned this issue Jan 25, 2022

Push LIMIT into scan of outer table in uncorrelated subquery #75301

Closed

mgartner mentioned this issue Jun 30, 2022

opt: fold volatile functions in uncorrelated subqueries #83227

Open

mgartner mentioned this issue Jun 30, 2022

[optimizer] query optimizer fail to optimize gen_random_uuid() #82503

Closed

andy-kimball mentioned this issue Jul 6, 2022

opt: hoist uncorrelated scalar subquery with equality #83392

Closed

mgartner mentioned this issue Jan 6, 2023

sql: investigate using apply joins to execute subqueries #93829

Open

This was referenced Apr 6, 2023

opt: hoist uncorrelated equality subqueries #100881

Merged

opt: explore lookup join with IS NOT DISTINCT FROM equality #100855

Open

blathers-crl bot mentioned this issue Apr 17, 2023

release-23.1: opt: hoist uncorrelated equality subqueries #101686

Closed

mgartner mentioned this issue Apr 18, 2023

release-23.1.0: opt: hoist uncorrelated equality subqueries #101738

Closed

mgartner mentioned this issue Apr 18, 2023

release-22.2: opt: hoist uncorrelated equality subqueries #101744

Merged

mgartner mentioned this issue Apr 18, 2023

release-23.1.0: opt: hoist uncorrelated equality subqueries #101751

Closed

mgartner mentioned this issue Apr 18, 2023

release-23.1: opt: hoist uncorrelated equality subqueries #101753

Merged

mgartner added this to Triage in SQL Queries via automation Jul 20, 2023

mgartner removed this from Plan enhancements (higher priority) in BACKLOG, NO NEW ISSUES: SQL Optimizer Jul 20, 2023

mgartner added this to Triage in Disaster Recovery Backlog via automation Jul 20, 2023

blathers-crl bot added the T-disaster-recovery label Jul 20, 2023

mgartner removed this from Triage in Disaster Recovery Backlog Jul 20, 2023

mgartner moved this from Triage to New Backlog in SQL Queries Jul 20, 2023

cockroachdb deleted a comment from blathers-crl bot Jul 20, 2023

shermanCRL removed the T-disaster-recovery label Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: reoptimize main query with scalar subquery results #51820

opt: reoptimize main query with scalar subquery results #51820

xinyuliu12 commented Jul 23, 2020 •

edited by exalate-issue-sync bot

RaduBerinde commented Jul 24, 2020

xinyuliu12 commented Aug 5, 2020

RaduBerinde commented Aug 5, 2020 •

edited

xinyuliu12 commented Aug 5, 2020

RaduBerinde commented Aug 31, 2020

mgartner commented Jun 30, 2022

opt: reoptimize main query with scalar subquery results #51820

opt: reoptimize main query with scalar subquery results #51820

Comments

xinyuliu12 commented Jul 23, 2020 • edited by exalate-issue-sync bot

RaduBerinde commented Jul 24, 2020

xinyuliu12 commented Aug 5, 2020

RaduBerinde commented Aug 5, 2020 • edited

xinyuliu12 commented Aug 5, 2020

RaduBerinde commented Aug 31, 2020

mgartner commented Jun 30, 2022

xinyuliu12 commented Jul 23, 2020 •

edited by exalate-issue-sync bot

RaduBerinde commented Aug 5, 2020 •

edited