Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: don't touch ranges unnecessarily during limited scans #47044

Open
RaduBerinde opened this issue Apr 4, 2020 · 12 comments
Open

sql: don't touch ranges unnecessarily during limited scans #47044

RaduBerinde opened this issue Apr 4, 2020 · 12 comments
Labels
A-sql-execution Relating to SQL execution. C-performance Perf of queries or internals. Solution not expected to change functional behavior. E-quick-win Likely to be a quick win for someone experienced. T-sql-queries SQL Queries Team
Projects

Comments

@RaduBerinde
Copy link
Member

RaduBerinde commented Apr 4, 2020

A user saw a case where selecting all rows from a (small) partitioned table was significantly faster than selecting one row using LIMIT 1.

The reason was that the first region was also the farthest away, and it only had one row. There is a single TableReader planned in that region (because of the limit); the kv fetcher requests two keys in this case, and the scan ends up going to a range on another region.

In general, we need to fetch one more key so that we're sure we got all keys for a row (if there are multiple column families). But we could get the same signal if we knew that we hit the end of a range.

Having a KV API that allows stopping a scan at the end of the range would be useful here. CC @tbg @andreimatei who have been thinking about the APIs between KV and SQL.

EDIT (@erikgrinaker): The KV API is available as of #70763.

Jira issue: CRDB-5048

@RaduBerinde RaduBerinde added this to Triage in BACKLOG, NO NEW ISSUES: SQL Execution via automation Apr 4, 2020
@RaduBerinde RaduBerinde added this to Incoming in KV via automation Apr 4, 2020
@drewdeally drewdeally assigned drewdeally and unassigned drewdeally Apr 4, 2020
@tbg
Copy link
Member

tbg commented Apr 6, 2020

You really need an option to stop the scan at a range boundary once there's been a result, right? If the first range it hits is empty you want to keep going.

@tbg
Copy link
Member

tbg commented Apr 6, 2020

Or were you thinking you'd have the fetcher do the pagination directly?

@tbg
Copy link
Member

tbg commented Apr 6, 2020

Maybe what we should do is abstract over all of the various ways you may want to consume results and let the author of the batch decide. For example, you could pass in a closure that gets to view the new result every time you get one (i.e. roughly once per range, absent limits), and decides whether that's enough.

@RaduBerinde
Copy link
Member Author

You really need an option to stop the scan at a range boundary once there's been a result, right?

Correct.

@yuzefovich yuzefovich moved this from Triage to [GENERAL BACKLOG] Enhancements/Features in BACKLOG, NO NEW ISSUES: SQL Execution Apr 17, 2020
@RaduBerinde
Copy link
Member Author

@drewdeally ran into this again. It would be great if we could get something done in this release. Could find an assignee for the issue?

@yuzefovich
Copy link
Member

Hm, to me it seems like the work is mostly in the KV layer with SQL Execution being responsible for reusing the new API. cc @asubiotto @jordanlewis

@andreimatei
Copy link
Contributor

FWIW, we used to have an API for scanning until the end of a range, but we deleted it cause it wasn't used. But that's at the level of a request to a single range. It seems to me that what we want here, more importantly, is control over how the DistSender fans out requests with limits - should it speculatively fan them out with parallel requests or should it read range by range in some order.

@lunevalex lunevalex added the A-kv-client Relating to the KV client and the KV interface. label Jul 29, 2020
@lunevalex lunevalex moved this from Incoming to Transactions in KV Jul 29, 2020
@lunevalex lunevalex moved this from Transactions to Cold storage in KV Apr 23, 2021
@jlinder jlinder added T-sql-queries SQL Queries Team T-kv KV Team labels Jun 16, 2021
@erikgrinaker
Copy link
Contributor

I'm working on some related changes for #70564 and #68050, will see if I can address this while I'm at it. From what's said above, it sounds like #70763 might fit the bill. I'm curious though:

In general, we need to fetch one more key so that we're sure we got all keys for a row (if there are multiple column families). But we could get the same signal if we knew that we hit the end of a range.

Could you elaborate a bit on this? How would the range boundary help? Presumably we're using a prefix-style scan to get all keys for a row, and we either need to get all of them or none of them? And if we're not using a prefix-style scan, what's stopping the keys of interest from straddling the range boundary?

@tbg
Copy link
Member

tbg commented Sep 28, 2021 via email

@erikgrinaker
Copy link
Contributor

A KV API to return partial results on range boundaries has been implemented in #70763. Leaving this issue open to update the SQL bits.

@erikgrinaker erikgrinaker removed their assignment Nov 13, 2021
@erikgrinaker erikgrinaker removed this from On Hold in KV Nov 13, 2021
@erikgrinaker erikgrinaker added A-sql-execution Relating to SQL execution. C-performance Perf of queries or internals. Solution not expected to change functional behavior. and removed T-kv KV Team A-kv-client Relating to the KV client and the KV interface. labels Nov 13, 2021
@craig craig bot closed this as completed in 71a7d6b Nov 13, 2021
@erikgrinaker erikgrinaker reopened this Nov 13, 2021
BACKLOG, NO NEW ISSUES: SQL Execution automation moved this from [GENERAL BACKLOG] Enhancements/Features/Investigations to Triage Nov 13, 2021
@yuzefovich yuzefovich added this to Triage in SQL Queries via automation Nov 18, 2021
@yuzefovich yuzefovich added the E-quick-win Likely to be a quick win for someone experienced. label Nov 23, 2021
@yuzefovich yuzefovich moved this from Triage to Backlog in SQL Queries Nov 23, 2021
@mgartner mgartner moved this from Backlog (DO NOT ADD NEW ISSUES) to New Backlog in SQL Queries Jun 22, 2023
@mgartner
Copy link
Collaborator

@yuzefovich any thoughts on if this is relevant/impactful?

@yuzefovich
Copy link
Member

This is still present, and it is the most impactful in multi-region deployments when the table has multiple column families (which is probably an edge case, so we can wait until a customer explicitly complains about it).

Here is a quick example to show the impact:

CREATE TABLE t (k INT PRIMARY KEY, v1 INT, v2 INT, FAMILY (k, v1), FAMILY (v2));
ALTER TABLE t SPLIT AT VALUES (2);
INSERT INTO t VALUES (1, 1, NULL), (2, 2, NULL);

EXPLAIN ANALYZE SELECT * FROM t LIMIT 1;
-- KV pairs read: 2

To satisfy this query we had to fetch both keys present in the table, even though the second key is on a separate range and cannot be part of the row contained within the first range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-performance Perf of queries or internals. Solution not expected to change functional behavior. E-quick-win Likely to be a quick win for someone experienced. T-sql-queries SQL Queries Team
Projects
Status: Backlog
SQL Queries
New Backlog
Development

No branches or pull requests

9 participants