Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: collect stats for virtual computed columns #68254

Closed
mgartner opened this issue Jul 29, 2021 · 4 comments · Fixed by #120875
Closed

opt: collect stats for virtual computed columns #68254

mgartner opened this issue Jul 29, 2021 · 4 comments · Fixed by #120875
Assignees
Labels
A-sql-table-stats Table statistics (and their automatic refresh). branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) docs-done docs-known-limitation GA-blocker T-sql-queries SQL Queries Team
Projects

Comments

@mgartner
Copy link
Collaborator

mgartner commented Jul 29, 2021

Once #68186 is addressed, we will no longer collect stats for virtual computed columns. This can prevent the optimizer from accurately calculating the cost of scanning an index on a virtual column, and, transitively, the cost of scanning an expression index.

Here's an example:

CREATE TABLE t (k INT PRIMARY KEY, a INT, INDEX ((a%3)));

INSERT INTO t SELECT i, 1 FROM generate_series (1, 10) as s(i);
INSERT INTO t SELECT i, 2 FROM generate_series (11, 100) as s(i);
INSERT INTO t SELECT i, 3 FROM generate_series (101, 1000) as s(i);

ANALYZE t;

EXPLAIN SELECT k FROM t WHERE a%3 = 0;
                                      info
---------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • scan
    estimated row count: 10 (1.0% of the table; stats collected 12 seconds ago)
    table: t@t_expr_idx
    spans: [/0 - /0]
(7 rows)

EXPLAIN SELECT k FROM t WHERE a%3 = 1;
                                      info
---------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • scan
    estimated row count: 10 (1.0% of the table; stats collected 15 seconds ago)
    table: t@t_expr_idx
    spans: [/1 - /1]
(7 rows)

EXPLAIN SELECT k FROM t WHERE a%3 = 2;
                                      info
---------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • scan
    estimated row count: 10 (1.0% of the table; stats collected 17 seconds ago)
    table: t@t_expr_idx
    spans: [/2 - /2]
(7 rows)

Notice that the estimated row count for each scan is the same, despite the fact that there are 10 rows where a%3 = 1, 90 rows where a%3 = 2, and 900 rows where a%3 = 0.

In order to better estimated row counts in queries like this, we will have to:

  1. Collect statistics for virtual computed columns. Currently column statistics are sampled from the primary index. Virtual columns are not included in primary indexes, so there will be somewhere to sample virtual columns from indexes on them.
  2. Propagate virtual column statistics in statistics builder. A canonical scan on a primary index does not produce a virtual column. Instead, virtual column values are calculated at query execution time by wrapping a scan with a projection that produces the virtual column. Scans only contain statistics for columns they produce, so the challenge will be either producing virtual column statistics from a scan or from the projection that produces them.

Jira issue: CRDB-13888

Epic: CRDB-8949

@mgartner mgartner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-table-stats Table statistics (and their automatic refresh). labels Jul 29, 2021
@mgartner mgartner added this to Triage in SQL Queries via automation Jul 29, 2021
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jul 29, 2021
@mgartner mgartner moved this from Triage to Backlog in SQL Queries Jul 29, 2021
@rytaft rytaft moved this from Backlog to 22.2 Medium Likelihood (60%) in SQL Queries May 13, 2022
@rytaft rytaft moved this from 22.2 Release to 23.1 Release in SQL Queries Jul 21, 2022
@mgartner mgartner moved this from 23.1 Release to Backlog in SQL Queries Jan 5, 2023
@rytaft rytaft moved this from Backlog to 23.2 Release in SQL Queries Feb 17, 2023
@rytaft
Copy link
Collaborator

rytaft commented Feb 17, 2023

We have some customer requests to get this done in 23.2

@michae2
Copy link
Collaborator

michae2 commented Mar 7, 2023

Also see #85176 and #78181 which are similar.

@michae2
Copy link
Collaborator

michae2 commented Oct 6, 2023

Another example of this coming up with an expression index:

CREATE TABLE t (i INT PRIMARY KEY, j JSONB);
INSERT INTO t SELECT i, json_build_object('a', i % 10) FROM generate_series(0, 9999) AS s(i);

-- Say we wanted to find rows with j->'a' = '3' but we did not want to index all of j.
-- In 23.2 we could use a forward index on j->'a' instead of an inverted index:
CREATE INDEX ON t ((j->'a'));
ANALYZE t;

-- This does use the index, but the lack of stats means we estimate 10 rows instead of 1000:
EXPLAIN SELECT * FROM t WHERE j->'a' = '3';

-- A workaround until stats are collected on virtual computed columns is to index a stored computed column:
DROP INDEX t_expr_idx;
ALTER TABLE t ADD COLUMN ja JSONB AS (j->'a') STORED;
CREATE INDEX ON t (ja);
ANALYZE t;

-- Now this uses the index and has good stats:
EXPLAIN SELECT * FROM t WHERE j->'a' = '3';

michae2 added a commit to michae2/cockroach that referenced this issue Jan 23, 2024
Add rendering of virtual computed column expressions to the CREATE
STATISTICS distsql plan. This rendering should always be added as post
processors on the TableReader nodes that feed the Samplers.

With this change we now collect table statistics on virtual computed
columns. Future PRs will make use of the statistsics in statistics
builder, and will add settings to enable or disable collection of stats
on virtual computed columns.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Jan 30, 2024
Add rendering of virtual computed column expressions to the CREATE
STATISTICS distsql plan. This rendering should always be added as post
processors on the TableReader nodes that feed the Samplers.

With this change we now collect table statistics on virtual computed
columns. A future PR will make use of the statistsics in statistics
builder.

Collection of partial statistics (USING EXTREMES) on virtual computed
columns simply works after this change because partial statistics
collection utilizes secondary indexes, where virtual computed columns
are regular stored columns.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note (sql change): Add a new cluster setting,
`sql.stats.virtual_computed_columns.enabled`, which when set enables
collection of table statistics on virtual computed columns.
michae2 added a commit to michae2/cockroach that referenced this issue Jan 31, 2024
Add rendering of virtual computed column expressions to the CREATE
STATISTICS distsql plan. This rendering should always be added as post
processors on the TableReader nodes that feed the Samplers.

With this change we now collect table statistics on virtual computed
columns. A future PR will make use of the statistsics in statistics
builder.

Collection of partial statistics (USING EXTREMES) on virtual computed
columns simply works after this change because partial statistics
collection utilizes secondary indexes, where virtual computed columns
are regular stored columns.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note (sql change): Add a new cluster setting,
`sql.stats.virtual_computed_columns.enabled`, which when set enables
collection of table statistics on virtual computed columns.
craig bot pushed a commit that referenced this issue Jan 31, 2024
118241: sql: collect table statistics on virtual computed columns r=yuzefovich,rytaft a=michae2

Add rendering of virtual computed column expressions to the CREATE STATISTICS distsql plan. This rendering should always be added as post processors on the TableReader nodes that feed the Samplers.

With this change we now collect table statistics on virtual computed columns. A future PR will make use of the statistsics in statistics builder.

Collection of partial statistics (USING EXTREMES) on virtual computed columns simply works after this change because partial statistics collection utilizes secondary indexes, where virtual computed columns are regular stored columns.

Informs: #68254

Epic: CRDB-8949

Release note (sql change): Add a new cluster setting, `sql.stats.virtual_computed_columns.enabled`, which when set enables collection of table statistics on virtual computed columns.

Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
wenyihu6 pushed a commit to wenyihu6/cockroach that referenced this issue Feb 21, 2024
Add rendering of virtual computed column expressions to the CREATE
STATISTICS distsql plan. This rendering should always be added as post
processors on the TableReader nodes that feed the Samplers.

With this change we now collect table statistics on virtual computed
columns. A future PR will make use of the statistsics in statistics
builder.

Collection of partial statistics (USING EXTREMES) on virtual computed
columns simply works after this change because partial statistics
collection utilizes secondary indexes, where virtual computed columns
are regular stored columns.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note (sql change): Add a new cluster setting,
`sql.stats.virtual_computed_columns.enabled`, which when set enables
collection of table statistics on virtual computed columns.

This comment was marked as resolved.

@michae2 michae2 added the branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 label Mar 6, 2024
@craig craig bot closed this as completed in de846df Mar 26, 2024
michae2 added a commit to michae2/cockroach that referenced this issue Mar 26, 2024
As of cockroachdb#118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.

To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.

This commit adds VirtualCols but does not yet use it.

Note that we cannot currently pass VirtualCols up through set operations
or with-scans, due to the column ID translation they use.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Mar 26, 2024
Informs: cockroachdb#68254

Epic: CRDB-8949

Release note (sql change): Add new session variable
`optimizer_use_virtual_computed_column_stats`. When this variable is
enabled, the optimizer will make use of table statistics on virtual
computed columns.
michae2 added a commit to michae2/cockroach that referenced this issue Mar 26, 2024
Throughout statistics builder we use OutputCols to determine which
columns come from the input to an expression. We then typically call
colStatXXX with those columns as part of statistics calculation.

In order to use statistics on virtual computed columns, we need to call
colStatXXX on any virtual columns that could come from our input, even
if they are not passed upward through OutputCols. To do this we extend
OutputCols with the VirtualCols set we built in a previous commit. This
commit replaces almost all usages of OutputCols in statistics builder
with a call to helper function colStatCols, which returns a union of
OutputCols and VirtualCols.

This is enough to get the optimizer to use statistics on virtual
computed columns in some simple plans. More complex plans will require
matching the virtual column scalar expressions, which will be in the
next PR. I've left some TODOs marking spots where this next PR will
touch.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Mar 26, 2024
As of cockroachdb#120668 we now use statistics on virtual computed columns in
statistics builder. Simple queries that synthesize virtual columns in
Project expressions already benefit, because they use the virtual column
ID when synthesizing the virtual column.

Other expressions, however, do not directly use the virtual column ID
when synthesizing a virtual column. This includes Select expressions,
joins, constrained scans, and some Project expressions.

For example, consider a query like the following:

```
CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b));
SELECT * FROM ab WHERE a % 10 > 3;
```

Even though the filter condition is in terms of `a`, we'd like to use
the statistics on virtual computed column `b` since the expression
matches.

In order to do this, we replace `a % 10` with `b` in a copy of the
filter condition before doing any stats calculations. Then we perform
our normal stats calculations, using `b`.

Fixes: cockroachdb#68254
Fixes: cockroachdb#110146

Epic: CRDB-8949

Release note: None
cockroach-dev-inf pushed a commit that referenced this issue Mar 27, 2024
With optimizer_use_virtual_computed_column_stats set to false,
constrained scans were still sometimes using stats on virtual computed
columns. This commit adds a check to makeTableStatistics which prevents
creation of any statistics referencing a virtual computed column, which
is a stronger check than existed before.

With this check, the VirtualCols sets will always be empty when
optimizer_use_virtual_computed_column_stats is false.

Informs: #68254

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Mar 27, 2024
Add rendering of virtual computed column expressions to the CREATE
STATISTICS distsql plan. This rendering should always be added as post
processors on the TableReader nodes that feed the Samplers.

With this change we now collect table statistics on virtual computed
columns. A future PR will make use of the statistsics in statistics
builder.

Collection of partial statistics (USING EXTREMES) on virtual computed
columns simply works after this change because partial statistics
collection utilizes secondary indexes, where virtual computed columns
are regular stored columns.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note (sql change): Add a new cluster setting,
`sql.stats.virtual_computed_columns.enabled`, which when set enables
collection of table statistics on virtual computed columns.
craig bot pushed a commit that referenced this issue Mar 27, 2024
121171: opt/memo: fix optimizer_use_virtual_computed_column_stats r=DrewKimball a=michae2

With optimizer_use_virtual_computed_column_stats set to false, constrained scans were still sometimes using stats on virtual computed columns. This commit adds a check to makeTableStatistics which prevents creation of any statistics referencing a virtual computed column, which is a stronger check than existed before.

With this check, the VirtualCols sets will always be empty when optimizer_use_virtual_computed_column_stats is false.

Informs: #68254

Epic: CRDB-8949

Release note: None

Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
michae2 added a commit to michae2/cockroach that referenced this issue Mar 29, 2024
As of cockroachdb#118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.

To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.

This commit adds VirtualCols but does not yet use it.

Note that we cannot currently pass VirtualCols up through set operations
or with-scans, due to the column ID translation they use.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Mar 29, 2024
Informs: cockroachdb#68254

Epic: CRDB-8949

Release note (sql): Add new session variable
`optimizer_use_virtual_computed_column_stats`. When this variable is
enabled, the optimizer will make use of table statistics on virtual
computed columns.
michae2 added a commit to michae2/cockroach that referenced this issue Mar 29, 2024
Throughout statistics builder we use OutputCols to determine which
columns come from the input to an expression. We then typically call
colStatXXX with those columns as part of statistics calculation.

In order to use statistics on virtual computed columns, we need to call
colStatXXX on any virtual columns that could come from our input, even
if they are not passed upward through OutputCols. To do this we extend
OutputCols with the VirtualCols set we built in a previous commit. This
commit replaces almost all usages of OutputCols in statistics builder
with a call to helper function colStatCols, which returns a union of
OutputCols and VirtualCols.

This is enough to get the optimizer to use statistics on virtual
computed columns in some simple plans. More complex plans will require
matching the virtual column scalar expressions, which will be in the
next PR. I've left some TODOs marking spots where this next PR will
touch.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Mar 29, 2024
As of cockroachdb#120668 we now use statistics on virtual computed columns in
statistics builder. Simple queries that synthesize virtual columns in
Project expressions already benefit, because they use the virtual column
ID when synthesizing the virtual column.

Other expressions, however, do not directly use the virtual column ID
when synthesizing a virtual column. This includes Select expressions,
joins, constrained scans, and some Project expressions.

For example, consider a query like the following:

```
CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b));
SELECT * FROM ab WHERE a % 10 > 3;
```

Even though the filter condition is in terms of `a`, we'd like to use
the statistics on virtual computed column `b` since the expression
matches.

In order to do this, we replace `a % 10` with `b` in a copy of the
filter condition before doing any stats calculations. Then we perform
our normal stats calculations, using `b`.

Fixes: cockroachdb#68254
Fixes: cockroachdb#110146

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Mar 29, 2024
With optimizer_use_virtual_computed_column_stats set to false,
constrained scans were still sometimes using stats on virtual computed
columns. This commit adds a check to makeTableStatistics which prevents
creation of any statistics referencing a virtual computed column, which
is a stronger check than existed before.

With this check, the VirtualCols sets will always be empty when
optimizer_use_virtual_computed_column_stats is false.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Mar 29, 2024
As of cockroachdb#118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.

To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.

This commit adds VirtualCols but does not yet use it.

Note that we cannot currently pass VirtualCols up through set operations
or with-scans, due to the column ID translation they use.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Mar 29, 2024
Informs: cockroachdb#68254

Epic: CRDB-8949

Release note (sql): Add new session variable
`optimizer_use_virtual_computed_column_stats`. When this variable is
enabled, the optimizer will make use of table statistics on virtual
computed columns.
michae2 added a commit to michae2/cockroach that referenced this issue Mar 29, 2024
Throughout statistics builder we use OutputCols to determine which
columns come from the input to an expression. We then typically call
colStatXXX with those columns as part of statistics calculation.

In order to use statistics on virtual computed columns, we need to call
colStatXXX on any virtual columns that could come from our input, even
if they are not passed upward through OutputCols. To do this we extend
OutputCols with the VirtualCols set we built in a previous commit. This
commit replaces almost all usages of OutputCols in statistics builder
with a call to helper function colStatCols, which returns a union of
OutputCols and VirtualCols.

This is enough to get the optimizer to use statistics on virtual
computed columns in some simple plans. More complex plans will require
matching the virtual column scalar expressions, which will be in the
next PR. I've left some TODOs marking spots where this next PR will
touch.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Mar 29, 2024
As of cockroachdb#120668 we now use statistics on virtual computed columns in
statistics builder. Simple queries that synthesize virtual columns in
Project expressions already benefit, because they use the virtual column
ID when synthesizing the virtual column.

Other expressions, however, do not directly use the virtual column ID
when synthesizing a virtual column. This includes Select expressions,
joins, constrained scans, and some Project expressions.

For example, consider a query like the following:

```
CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b));
SELECT * FROM ab WHERE a % 10 > 3;
```

Even though the filter condition is in terms of `a`, we'd like to use
the statistics on virtual computed column `b` since the expression
matches.

In order to do this, we replace `a % 10` with `b` in a copy of the
filter condition before doing any stats calculations. Then we perform
our normal stats calculations, using `b`.

Fixes: cockroachdb#68254
Fixes: cockroachdb#110146

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this issue Mar 29, 2024
With optimizer_use_virtual_computed_column_stats set to false,
constrained scans were still sometimes using stats on virtual computed
columns. This commit adds a check to makeTableStatistics which prevents
creation of any statistics referencing a virtual computed column, which
is a stronger check than existed before.

With this check, the VirtualCols sets will always be empty when
optimizer_use_virtual_computed_column_stats is false.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-table-stats Table statistics (and their automatic refresh). branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) docs-done docs-known-limitation GA-blocker T-sql-queries SQL Queries Team
Projects
Archived in project
SQL Queries
23.2 Release
Development

Successfully merging a pull request may close this issue.

4 participants