sql: loose index scan #24584

godofdream · 2018-04-07T20:37:10Z

Mysql supports loose index scans and in postgres they can be simulated by using a recursive CTE.
See: https://dev.mysql.com/doc/refman/5.7/en/group-by-optimization.html

One usecase example:

select distinct on (a) a,timestamp,c from mytable ORDER BY (a, timestamp DESC, c);

Index over (a,timestamp DESC,c)
a and timestamp are not unique but c is.
Normally a full-tablescan is used but a loose index scan would be benefitable

Importance: Nice-To-Have performance enhancement

Jira issue: CRDB-5753

The text was updated successfully, but these errors were encountered:

jordanlewis · 2018-04-07T23:01:03Z

For future readers, a loose index scan skips over an index using multiple seeks. For example, take the following schema and table:

CREATE TABLE t (a int, b int, c string, PRIMARY KEY(a, b));
SELECT DISTINCT(a) FROM t;

That SELECT currently does a full table scan. If there are a ton of different b values per value of a, that full table scan will be less efficient than doing seeks to a+1 after every fresh value of a that's found.

petermattis · 2018-04-08T17:41:21Z

To see benefit, the "loose index scan" would need to be implemented down at the level of the KV Scan operation. @andreimatei this is an example of additional intelligence (beyond filtering, projection and aggergation) that we may want to add to the Scan op.

jordanlewis · 2019-06-03T15:30:35Z

I don't agree with @petermattis that this would need to be implemented at the kv layer to see benefit.

Imagine a table that has 3 columns, a b c. There are 10,000,000 values of b for every value of a, but just 2 values of a.

In this case, today, a query like

SELECT * from t WHERE b = 2000

would need to scan 20,000,000 rows, which works out to be a lot of roundtrips to the kv layer.

If we implemented a loose index scan without kv support, we'd need only 5 roundtrips:

Search for [minKey of a, 2000]. This will return the first value in the table, or a match if a contains minKey. If there's a match, we return the row, increment a, and try again. If there's no match, we save the first value of a, and move on to step 2.
Assuming there's no match, we search for [first value of a, 2000]. If there's a match, we return all rows that match a, 2000.
Then, regardless of whether there's a match in step 2, we increment a and search again.

Assuming the table has a values 0 and 10, here's what the request flow would look like:

search for [minKey, 2000]. get back [0, 0] and 1 batch of rows. Assume that the batch size was smaller than 2000.
search for [0, 2000]. Get back a match, return it.
search for [1, 2000]. Get back [10, 0] and 1 batch of rows.
search for [10, 2000]. Get back a match, return it.
search for [11, 2000]. Get back end of table. End.

petermattis · 2019-06-04T00:29:08Z

@jordanlewis The approach you describe here is interesting. With per-column histograms, or perhaps even just distinct counts, I think the optimizer could determine when this type of index scan is beneficial. Pushing support into the KV layer would still be desirable at some point, as I think this type of scan could be used in more situations (i.e. when the win isn't as clear cut as your scenario).

awoods187 · 2019-06-06T22:41:11Z

Linking to #38082

andreimatei · 2019-06-15T03:43:42Z

Oracle calls this "index skip scans"

https://gerardnico.com/db/oracle/index_scans#index_skip_scans

Requested in cockroachdb#24584. Release note (sql change): Added support for a table reader that performs a loose index scan over the underlying table. The index scan table reader uses information about the index being scanned to skip unnecessary rows while scanning the table allowing for some optimizations to be used for some types of queries.

rohany · 2019-06-25T19:07:21Z

implemented in #38216

This processor is not used. It was implemented for cockroachdb#24584, but the PR to start planning it stalled since Ridwan left, a year ago (cockroachdb#39668). This guy is marginally in my way, and, worse, it's dead code. Unless someone promises to finish that PR imminently :). Release note: None

andreimatei · 2020-07-08T22:48:34Z

This appears to have been closed prematurely. Fixing it needs #39668, which is stalled at the moment.

51178: sql/rowexec: delete indexSkipTableReader r=andreimatei a=andreimatei This processor is not used. It was implemented for #24584, but the PR to start planning it stalled since Ridwan left, a year ago (#39668). This guy is marginally in my way, and, worse, it's dead code. Unless someone promises to finish that PR imminently :). Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

mgartner · 2020-08-05T22:59:25Z

A loose index scan can also be beneficial in a simpler case than the one described by @jordanlewis above—querying for distinct values.

CREATE TABLE t (a INT, INDEX (a));
SELECT DISTINCT(a) FROM t;

If there are, for example, 1000 unique values of a and there are 1,000,000 rows in the table, you can use a similar method as described above to avoid a full table scan. This should be beneficial whenever the ratio of distinct values to number of rows is low.

Example:

Assume a has values 0, 100, 200, 300, 400, 500...

Search for the minimum key in [minKey, maxKey], batch size of 1. Get back 0.
Search for the minimum key in (0, maxKey], batch size of 1. Get back 100.
Search for the minimum key in (100, maxKey], batch size of 1. Get back 200.
Search for the minimum key in (200, maxKey], batch size of 1. Get back 300.
...
n. When you get nothing back, you've found all unique values of a.

Postgres doesn't have loose index scans, but you can mimic a loose index scan using a recursive CTE: https://malisper.me/the-missing-postgres-scan-the-loose-index-scan

mgartner · 2022-07-21T14:35:16Z

Here's another type of query that would benefit from loose index scans. Consider the table and query:

CREATE TABLE t (a INT, b INT, c INT, INDEX a_b_c_idx (a, b, c));
SELECT * FROM t WHERE a = 1 AND c = 3;

Currently, we would scan all KVs in a_b_c_idx where a=1, then filter on c=3. With a loose index scan approach, we don't need to scan all KVs and apply a filter. We can skip over repeated values of b where c!=3. This could be more efficient in cases where a = 1 AND c = 3 is highly selective and the cardinality of b for rows that satisfy that filter is low.

jordanlewis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Apr 7, 2018

knz added the O-community Originated from the community label May 14, 2018

knz added this to Triage in (DEPRECATED) SQL execution via automation May 14, 2018

knz added the A-sql-execution Relating to SQL execution. label May 14, 2018

jordanlewis moved this from Triage to Backlog in (DEPRECATED) SQL execution Jun 20, 2018

jordanlewis added this to Triage in [DEPRECATED] Old SQLExec board. Don't move stuff here Apr 23, 2019

jordanlewis moved this from Triage to Lower priority backlog in [DEPRECATED] Old SQLExec board. Don't move stuff here May 7, 2019

jordanlewis mentioned this issue Jun 18, 2019

sql: Queries that can benefit from the loose index scan optimizations #38268

Closed

rohany self-assigned this Jun 18, 2019

rohany moved this from Lower priority backlog to 19.2.4 in [DEPRECATED] Old SQLExec board. Don't move stuff here Jun 18, 2019

rohany closed this as completed Jun 25, 2019

andreimatei mentioned this issue Jul 8, 2020

sql/rowexec: delete indexSkipTableReader #51178

Merged

andreimatei reopened this Jul 8, 2020

yuzefovich unassigned rohany Sep 27, 2020

yuzefovich added this to Triage in BACKLOG, NO NEW ISSUES: SQL Execution via automation Sep 27, 2020

asubiotto moved this from Triage to [GENERAL BACKLOG] Enhancements/Features/Investigations in BACKLOG, NO NEW ISSUES: SQL Execution Sep 29, 2020

yuzefovich added this to Triage in SQL Queries via automation Apr 28, 2021

yuzefovich removed this from [GENERAL BACKLOG] Enhancements/Features/Investigations in BACKLOG, NO NEW ISSUES: SQL Execution Apr 28, 2021

yuzefovich moved this from Triage to Backlog in SQL Queries Apr 28, 2021

jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021

DrewKimball mentioned this issue Oct 13, 2022

opt: do a better job of optimizing recursive ctes #89954

Open

6 tasks

mgartner moved this from Backlog to New Backlog in SQL Queries Feb 16, 2023

This comment was marked as resolved.

Sign in to view

github-actions bot added the no-issue-activity label Jan 15, 2024

mgartner removed the no-issue-activity label Jan 16, 2024

michae2 added the X-nostale Marks an issue/pr that should be ignored by the stale bot label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: loose index scan #24584

sql: loose index scan #24584

godofdream commented Apr 7, 2018 •

edited by cockroach-jira-scripts

jordanlewis commented Apr 7, 2018

petermattis commented Apr 8, 2018

jordanlewis commented Jun 3, 2019

petermattis commented Jun 4, 2019

awoods187 commented Jun 6, 2019

andreimatei commented Jun 15, 2019

rohany commented Jun 25, 2019

andreimatei commented Jul 8, 2020

mgartner commented Aug 5, 2020 •

edited

mgartner commented Jul 21, 2022

This comment was marked as resolved.

sql: loose index scan #24584

sql: loose index scan #24584

Comments

godofdream commented Apr 7, 2018 • edited by cockroach-jira-scripts

jordanlewis commented Apr 7, 2018

petermattis commented Apr 8, 2018

jordanlewis commented Jun 3, 2019

petermattis commented Jun 4, 2019

awoods187 commented Jun 6, 2019

andreimatei commented Jun 15, 2019

rohany commented Jun 25, 2019

andreimatei commented Jul 8, 2020

mgartner commented Aug 5, 2020 • edited

mgartner commented Jul 21, 2022

This comment was marked as resolved.

godofdream commented Apr 7, 2018 •

edited by cockroach-jira-scripts

mgartner commented Aug 5, 2020 •

edited