distsql: uncertainty reads under DistSQL don't benefit from read span refresh mechanism #24798

andreimatei · 2018-04-13T20:57:09Z

When a regular Scan encounters a ReadWithinUncertaintyInterval error, the TxnCoordSender will immediately try to refresh the txn's read spans and, if successful, retry the batch. This doesn't apply to DistSQL reads which don't go through the TxnCoordSender.
We should figure out another level at which to retry.
Separately, if the whole flow is scheduled on the gateway, everything could go through the TxnCoordSender, I think.

Jira issue: CRDB-5744

The text was updated successfully, but these errors were encountered:

andreimatei · 2018-05-04T16:41:52Z

As the only-visible-to-crdb referenced issue above shows, this is suspected to cause a significant regression in higher percentage select latency on a customer workload.

jordanlewis · 2018-08-21T11:59:16Z

Hmm, @andreimatei can we talk about this? Seems like something we should tackle soon.

andreimatei · 2019-02-07T23:43:58Z

Months later, DistSQL reads go through the TxnCoordSender, but the txnSpanRefresher is neutered.
Remote flows do return their read spans to the gateway so, amusingly, the root can attempt to refresh if an error is encountered later by some other query, but not when an error is encountered by DistSQL itself.

Before this patch, races between ingesting leaf txn metadata into the root and the root performing span refreshes could lead to the failure to refresh some spans and thus write skew (see cockroachdb#41173). This patches fixes that by suspending root refreshes while there are leaves in operation - namely while DistSQL flows that use leaves (either remotely or locally) are running. So, with this patch, while a distributed query is running there's going to be no refreshes, but once it finishes and all leaf metadata has been collected, refreshes are enabled again. Refreshes are disabled at different levels depending on the reason: - they're disabled at the DistSQLPlanner.Run() level for distributed queries - they're disabled at the FlowBase level for flows that use leaves because of concurrency between Processors - they're disabled at the vectorizedFlow level for vectorized flows that use leaves internal in their operators The former two bullets build on facilities built in the previous commit for detecting concurrency within flows. Fixes cockroachdb#41173 Touches cockroachdb#24798 Release justification: bug fix Release note (bug fix): Fix a bug possibly leading to write skew after distributed queries (cockroachdb#41173).

Before this patch, a DistSQL flow running on its gateway node would use the RootTxn for all its processors for row-based flows / all of its operators for vectorized flows if there are no remote flows. Some of these processors/operator can execute concurrently with one another. RootTxns don't support concurrent requests (see cockroachdb#25329), resulting in some reads possibly missing to see the transaction's own writes. This patch fixes things by using a LeafTxn on the gateway in case there's concurrency on the gateway or if there's any remote flows. In other words, the Root is used only if there's no remote flows and no concurrency. This is sufficient for supporting mutations (which need the Root), because mutations force everything to be planned on the gateway and so, thanks to the previous commit, there's no concurrency if that's the case. Fixes cockroachdb#40487 Touches cockroachdb#24798 Release justification: Fixes bad bugs. Release note: Fix a bug possibly leading to transactions missing to see their own previous writes (cockroachdb#40487).

andreimatei · 2019-11-22T23:46:42Z

I've been thinking about this again, because I'm trying to also think through how refreshes should work in a world where we use transactions concurrently (and DistSQL is concurrent + distributed).
The crux of the problem here is the fundamental difference between having a single TxnCoordSender that all reads go through before their results are presented to clients, versus not having it. In the local case, when one requests wants the read timestamp to be advanced, the TCS can refresh all the previous writes and then simply retry the respective batch. However, in the distributed case, nobody knows all the read spans until DistSQL drains all the metadata at the end of the query.

Distilling more, if we want our queries to use a consistent snapshot of the data, then a read r can only use a forwarded read timestamp ts2 if:
a) all the results that have been delivered to the client before any results resulting from r can be refreshed to ts2.
b) all the results that will be delivered to the client after any results resulting from r either come from reads at ts2 or can be refreshed to ts2.

How can we get that in a DistSQL setting, where nodes read independently and results flow through various paths from where they're read to the gateway? The only way I see is by attaching a token to every row flowing through DistSQL, tracking the highest timestamp of a read that contributed to that row. For example, if a row was computed by joining something read at ts 10 with something read at ts 20, we'd say that the resulting row is tagged with 20.
Upon receiving such a row, the DistSQLReceiver would have to make sure that all the nodes that contributed to it can refresh their reads to 20, by contacting all the nodes in the flow and asking them to refresh. Obviously, all the tracking here would be very course. Then, leaves could be allowed to refresh independently (and it can also be the leaves that initiate the refreshing on all the other leaves).

It's worth mentioning that there appears to be a way to completely eliminate the need for refreshes mid-DistSQL flows. Being read-only, it's only uncertainty that causes a DistSQL flow to want to forward its read timestamp. So, before starting a distributed query we could observe the timestamps on all the nodes involved (one round trip), then ratchet everybody to the highest observed timestamp (second rounds trip), then forward our txn's timestamp to the highest observed one (generally, by refreshing) and only then start the flow with no uncertainty remaining. But, it seems pretty pessimistic and expensive...

cc my friends @knz, @bdarnell, @nvanbenschoten, @tbg to see if there's opinions

knz · 2019-11-25T13:05:23Z

The only way I see is by attaching a token to every row flowing through DistSQL, tracking the highest timestamp of a read that contributed to that row.

What do you do with a filter, an aggregation or an anti-join, where the row carrying the tag is filtered out?

andreimatei · 2019-11-25T15:44:18Z

I would carry forward the tag even when a row is filtered by infecting all the next rows. I'd have each processor keep track of the highest timestamp that any of its input rows have been tagged with, and I'd tag every output row with that (and also tag the "absence of any output rows" by including this timestamp in each processor's trailing metadata (collected when processors drain). I think that works?

Now that I think about it again, I'm not sure why I phrased this as "tagging rows" rather than describing it in terms of broadcasting metadata and taking advantage of the DistSQL ordered message streams: processors that do KV operations (TableReader, IndexJoiner, etc) would notice when a scan they've done was actually performed at a new (higher) timestamp and would broadcast this information to all their consumers as a ProducerMetadata before sending any more rows to any consumer (the "all consumers" part is important; for example a hash router would send this along on all its output streams). Then, every other processor would respect the convention that such a metadata record is forwarded immediately (as opposed to how we currently handle metadata by deferring its forwarding to later).

knz · 2019-11-25T15:46:13Z

A distsql processor can have no output row. Indeed it seems like something that's not part of the flow but instead part of the "metadata"

(I really think that this word "metadata" is really bad and should never have been used. The better abstraction is a difference between control plane and data plane. You're playing with the control plane here regardless of what flows data-wise.)

knz · 2019-11-25T15:51:50Z

There's another challenge in there though. Suppose you have two concurrent processors A and B.

Processor A fails with a logic error (says some SQL built-in function errors out).
Concurrently B is scanning ahead some data.

Today the repatriation of the "metadata" payload will cause the logic error to cancel out whatever result comes from B. That would trash the information bits needed in your algorithm.

If we ever implement savepoint rollbacks in combination with txn refreshes, it's important that the magic that you want to implement does not get invalidated by such a logic error.

github-actions · 2021-06-06T02:17:41Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

github-actions · 2023-09-26T11:08:17Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

knz · 2023-10-10T11:41:50Z

seems still relevant

andreimatei added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 13, 2018

andreimatei added this to the 2.1 milestone Apr 13, 2018

andreimatei self-assigned this Apr 13, 2018

andreimatei added A-kv-client Relating to the KV client and the KV interface. A-sql-execution Relating to SQL execution. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels May 4, 2018

knz added this to Triage in (DEPRECATED) SQL execution via automation May 14, 2018

jordanlewis moved this from Triage to Backlog in (DEPRECATED) SQL execution Aug 21, 2018

tbg removed the A-kv-client Relating to the KV client and the KV interface. label Aug 21, 2018

jordanlewis self-assigned this Aug 21, 2018

jordanlewis moved this from Backlog to Bugfix milestone in (DEPRECATED) SQL execution Aug 21, 2018

jordanlewis mentioned this issue Aug 21, 2018

sql: tpcc returns retriable errors to the user #28898

Closed

jordanlewis moved this from Bugfix milestone to Backlog in (DEPRECATED) SQL execution Sep 11, 2018

jordanlewis modified the milestones: 2.1, 2.2 Sep 26, 2018

petermattis removed this from the 2.2 milestone Oct 5, 2018

jordanlewis moved this from Backlog to Lower Priority Backlog in (DEPRECATED) SQL execution Oct 17, 2018

jordanlewis removed their assignment Feb 15, 2019

jordanlewis added this to Triage in [DEPRECATED] Old SQLExec board. Don't move stuff here Apr 23, 2019

jordanlewis moved this from Triage to Lower priority backlog in [DEPRECATED] Old SQLExec board. Don't move stuff here May 15, 2019

andreimatei mentioned this issue Sep 27, 2019

sql: race between gateway refresh and reads performed by remote flows leads to write skew #41173

Closed

andreimatei mentioned this issue Sep 30, 2019

sql: no more concurrent requests on RootTxns #41102

Closed

asubiotto moved this from Lower priority backlog to [TENT] SQL Exec in [DEPRECATED] Old SQLExec board. Don't move stuff here Apr 2, 2020

github-actions bot added the no-issue-activity label Jun 6, 2021

knz removed the no-issue-activity label Jun 7, 2021

knz added this to Triage in SQL Queries via automation Jun 7, 2021

knz unassigned andreimatei Jun 7, 2021

jordanlewis removed this from Triage in SQL Queries Jun 7, 2021

jordanlewis added this to Triage in BACKLOG, NO NEW ISSUES: SQL Execution via automation Jun 7, 2021

jordanlewis moved this from Triage to [GENERAL BACKLOG] Enhancements/Features/Investigations in BACKLOG, NO NEW ISSUES: SQL Execution Jun 7, 2021

jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021

nvanbenschoten mentioned this issue Jul 26, 2021

kvcoord: add TxnCoordSender method to refresh read spans #68051

Open

github-actions bot added the no-issue-activity label Sep 26, 2023

github-actions bot added the X-stale label Oct 10, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 10, 2023

exalate-issue-sync bot closed this as completed Oct 10, 2023

knz reopened this Oct 10, 2023

knz added T-sql-queries SQL Queries Team and removed X-stale T-sql-queries SQL Queries Team no-issue-activity labels Oct 10, 2023

knz removed this from [GENERAL BACKLOG] Enhancements/Features/Investigations in BACKLOG, NO NEW ISSUES: SQL Execution Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: uncertainty reads under DistSQL don't benefit from read span refresh mechanism #24798

distsql: uncertainty reads under DistSQL don't benefit from read span refresh mechanism #24798

andreimatei commented Apr 13, 2018 •

edited by cockroach-jira-scripts

andreimatei commented May 4, 2018

jordanlewis commented Aug 21, 2018

andreimatei commented Feb 7, 2019

andreimatei commented Nov 22, 2019

knz commented Nov 25, 2019

andreimatei commented Nov 25, 2019

knz commented Nov 25, 2019

knz commented Nov 25, 2019

github-actions bot commented Jun 6, 2021

github-actions bot commented Sep 26, 2023

knz commented Oct 10, 2023

distsql: uncertainty reads under DistSQL don't benefit from read span refresh mechanism #24798

distsql: uncertainty reads under DistSQL don't benefit from read span refresh mechanism #24798

Comments

andreimatei commented Apr 13, 2018 • edited by cockroach-jira-scripts

andreimatei commented May 4, 2018

jordanlewis commented Aug 21, 2018

andreimatei commented Feb 7, 2019

andreimatei commented Nov 22, 2019

knz commented Nov 25, 2019

andreimatei commented Nov 25, 2019

knz commented Nov 25, 2019

knz commented Nov 25, 2019

github-actions bot commented Jun 6, 2021

github-actions bot commented Sep 26, 2023

knz commented Oct 10, 2023

andreimatei commented Apr 13, 2018 •

edited by cockroach-jira-scripts