Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql: uncertainty reads under DistSQL don't benefit from read span refresh mechanism #24798

Open
andreimatei opened this issue Apr 13, 2018 · 11 comments
Labels
A-sql-execution Relating to SQL execution. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Apr 13, 2018

When a regular Scan encounters a ReadWithinUncertaintyInterval error, the TxnCoordSender will immediately try to refresh the txn's read spans and, if successful, retry the batch. This doesn't apply to DistSQL reads which don't go through the TxnCoordSender.
We should figure out another level at which to retry.
Separately, if the whole flow is scheduled on the gateway, everything could go through the TxnCoordSender, I think.

Jira issue: CRDB-5744

@andreimatei andreimatei added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 13, 2018
@andreimatei andreimatei added this to the 2.1 milestone Apr 13, 2018
@andreimatei andreimatei self-assigned this Apr 13, 2018
@andreimatei andreimatei added A-kv-client Relating to the KV client and the KV interface. A-sql-execution Relating to SQL execution. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels May 4, 2018
@andreimatei
Copy link
Contributor Author

As the only-visible-to-crdb referenced issue above shows, this is suspected to cause a significant regression in higher percentage select latency on a customer workload.

@knz knz added this to Triage in (DEPRECATED) SQL execution via automation May 14, 2018
@jordanlewis
Copy link
Member

Hmm, @andreimatei can we talk about this? Seems like something we should tackle soon.

@jordanlewis jordanlewis moved this from Triage to Backlog in (DEPRECATED) SQL execution Aug 21, 2018
@tbg tbg removed the A-kv-client Relating to the KV client and the KV interface. label Aug 21, 2018
@jordanlewis jordanlewis self-assigned this Aug 21, 2018
@jordanlewis jordanlewis moved this from Backlog to Bugfix milestone in (DEPRECATED) SQL execution Aug 21, 2018
@jordanlewis jordanlewis moved this from Bugfix milestone to Backlog in (DEPRECATED) SQL execution Sep 11, 2018
@jordanlewis jordanlewis modified the milestones: 2.1, 2.2 Sep 26, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@jordanlewis jordanlewis moved this from Backlog to Lower Priority Backlog in (DEPRECATED) SQL execution Oct 17, 2018
@andreimatei
Copy link
Contributor Author

Months later, DistSQL reads go through the TxnCoordSender, but the txnSpanRefresher is neutered.
Remote flows do return their read spans to the gateway so, amusingly, the root can attempt to refresh if an error is encountered later by some other query, but not when an error is encountered by DistSQL itself.

@jordanlewis jordanlewis removed their assignment Feb 15, 2019
@jordanlewis jordanlewis moved this from Triage to Lower priority backlog in [DEPRECATED] Old SQLExec board. Don't move stuff here May 15, 2019
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 30, 2019
Before this patch, races between ingesting leaf txn metadata into the
root and the root performing span refreshes could lead to the failure to
refresh some spans and thus write skew (see cockroachdb#41173).
This patches fixes that by suspending root refreshes while there are
leaves in operation - namely while DistSQL flows that use leaves (either
remotely or locally) are running. So, with this patch, while a
distributed query is running there's going to be no refreshes, but once
it finishes and all leaf metadata has been collected, refreshes are
enabled again.

Refreshes are disabled at different levels depending on the reason:
- they're disabled at the DistSQLPlanner.Run() level for distributed
queries
- they're disabled at the FlowBase level for flows that use leaves
because of concurrency between Processors
- they're disabled at the vectorizedFlow level for vectorized flows that
use leaves internal in their operators

The former two bullets build on facilities built in the previous commit
for detecting concurrency within flows.

Fixes cockroachdb#41173
Touches cockroachdb#24798

Release justification: bug fix

Release note (bug fix): Fix a bug possibly leading to write skew after distributed
queries (cockroachdb#41173).
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 9, 2019
Before this patch, a DistSQL flow running on its gateway node would use
the RootTxn for all its processors for row-based flows / all of its
operators for vectorized flows if there are no remote flows. Some of
these processors/operator can execute concurrently with one another.
RootTxns don't support concurrent requests (see cockroachdb#25329), resulting in
some reads possibly missing to see the transaction's own writes.

This patch fixes things by using a LeafTxn on the gateway in case
there's concurrency on the gateway or if there's any remote flows. In
other words, the Root is used only if there's no remote flows and no
concurrency. This is sufficient for supporting mutations (which need the
Root), because mutations force everything to be planned on the gateway
and so, thanks to the previous commit, there's no concurrency if that's
the case.

Fixes cockroachdb#40487
Touches cockroachdb#24798

Release justification: Fixes bad bugs.

Release note: Fix a bug possibly leading to transactions missing to see
their own previous writes (cockroachdb#40487).
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 9, 2019
Before this patch, a DistSQL flow running on its gateway node would use
the RootTxn for all its processors for row-based flows / all of its
operators for vectorized flows if there are no remote flows. Some of
these processors/operator can execute concurrently with one another.
RootTxns don't support concurrent requests (see cockroachdb#25329), resulting in
some reads possibly missing to see the transaction's own writes.

This patch fixes things by using a LeafTxn on the gateway in case
there's concurrency on the gateway or if there's any remote flows. In
other words, the Root is used only if there's no remote flows and no
concurrency. This is sufficient for supporting mutations (which need the
Root), because mutations force everything to be planned on the gateway
and so, thanks to the previous commit, there's no concurrency if that's
the case.

Fixes cockroachdb#40487
Touches cockroachdb#24798

Release justification: Fixes bad bugs.

Release note: Fix a bug possibly leading to transactions missing to see
their own previous writes (cockroachdb#40487).
@andreimatei
Copy link
Contributor Author

I've been thinking about this again, because I'm trying to also think through how refreshes should work in a world where we use transactions concurrently (and DistSQL is concurrent + distributed).
The crux of the problem here is the fundamental difference between having a single TxnCoordSender that all reads go through before their results are presented to clients, versus not having it. In the local case, when one requests wants the read timestamp to be advanced, the TCS can refresh all the previous writes and then simply retry the respective batch. However, in the distributed case, nobody knows all the read spans until DistSQL drains all the metadata at the end of the query.

Distilling more, if we want our queries to use a consistent snapshot of the data, then a read r can only use a forwarded read timestamp ts2 if:
a) all the results that have been delivered to the client before any results resulting from r can be refreshed to ts2.
b) all the results that will be delivered to the client after any results resulting from r either come from reads at ts2 or can be refreshed to ts2.

How can we get that in a DistSQL setting, where nodes read independently and results flow through various paths from where they're read to the gateway? The only way I see is by attaching a token to every row flowing through DistSQL, tracking the highest timestamp of a read that contributed to that row. For example, if a row was computed by joining something read at ts 10 with something read at ts 20, we'd say that the resulting row is tagged with 20.
Upon receiving such a row, the DistSQLReceiver would have to make sure that all the nodes that contributed to it can refresh their reads to 20, by contacting all the nodes in the flow and asking them to refresh. Obviously, all the tracking here would be very course. Then, leaves could be allowed to refresh independently (and it can also be the leaves that initiate the refreshing on all the other leaves).

It's worth mentioning that there appears to be a way to completely eliminate the need for refreshes mid-DistSQL flows. Being read-only, it's only uncertainty that causes a DistSQL flow to want to forward its read timestamp. So, before starting a distributed query we could observe the timestamps on all the nodes involved (one round trip), then ratchet everybody to the highest observed timestamp (second rounds trip), then forward our txn's timestamp to the highest observed one (generally, by refreshing) and only then start the flow with no uncertainty remaining. But, it seems pretty pessimistic and expensive...

cc my friends @knz, @bdarnell, @nvanbenschoten, @tbg to see if there's opinions

@knz
Copy link
Contributor

knz commented Nov 25, 2019

The only way I see is by attaching a token to every row flowing through DistSQL, tracking the highest timestamp of a read that contributed to that row.

What do you do with a filter, an aggregation or an anti-join, where the row carrying the tag is filtered out?

@andreimatei
Copy link
Contributor Author

I would carry forward the tag even when a row is filtered by infecting all the next rows. I'd have each processor keep track of the highest timestamp that any of its input rows have been tagged with, and I'd tag every output row with that (and also tag the "absence of any output rows" by including this timestamp in each processor's trailing metadata (collected when processors drain). I think that works?

Now that I think about it again, I'm not sure why I phrased this as "tagging rows" rather than describing it in terms of broadcasting metadata and taking advantage of the DistSQL ordered message streams: processors that do KV operations (TableReader, IndexJoiner, etc) would notice when a scan they've done was actually performed at a new (higher) timestamp and would broadcast this information to all their consumers as a ProducerMetadata before sending any more rows to any consumer (the "all consumers" part is important; for example a hash router would send this along on all its output streams). Then, every other processor would respect the convention that such a metadata record is forwarded immediately (as opposed to how we currently handle metadata by deferring its forwarding to later).

@knz
Copy link
Contributor

knz commented Nov 25, 2019

A distsql processor can have no output row. Indeed it seems like something that's not part of the flow but instead part of the "metadata"

(I really think that this word "metadata" is really bad and should never have been used. The better abstraction is a difference between control plane and data plane. You're playing with the control plane here regardless of what flows data-wise.)

@knz
Copy link
Contributor

knz commented Nov 25, 2019

There's another challenge in there though. Suppose you have two concurrent processors A and B.

Processor A fails with a logic error (says some SQL built-in function errors out).
Concurrently B is scanning ahead some data.

Today the repatriation of the "metadata" payload will cause the logic error to cancel out whatever result comes from B. That would trash the information bits needed in your algorithm.

If we ever implement savepoint rollbacks in combination with txn refreshes, it's important that the magic that you want to implement does not get invalidated by such a logic error.

@asubiotto asubiotto moved this from Lower priority backlog to [TENT] SQL Exec in [DEPRECATED] Old SQLExec board. Don't move stuff here Apr 2, 2020
@github-actions
Copy link

github-actions bot commented Jun 6, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@knz knz added this to Triage in SQL Queries via automation Jun 7, 2021
@jordanlewis jordanlewis removed this from Triage in SQL Queries Jun 7, 2021
@jordanlewis jordanlewis added this to Triage in BACKLOG, NO NEW ISSUES: SQL Execution via automation Jun 7, 2021
@jordanlewis jordanlewis moved this from Triage to [GENERAL BACKLOG] Enhancements/Features/Investigations in BACKLOG, NO NEW ISSUES: SQL Execution Jun 7, 2021
@jlinder jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 10, 2023
@knz
Copy link
Contributor

knz commented Oct 10, 2023

seems still relevant

@knz knz reopened this Oct 10, 2023
@knz knz added T-sql-queries SQL Queries Team and removed X-stale T-sql-queries SQL Queries Team no-issue-activity labels Oct 10, 2023
@knz knz removed this from [GENERAL BACKLOG] Enhancements/Features/Investigations in BACKLOG, NO NEW ISSUES: SQL Execution Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

6 participants