Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: circuit-break requests to unavailable ranges #33007

Closed
derkan opened this issue Dec 11, 2018 · 18 comments · Fixed by #71806
Closed

kvserver: circuit-break requests to unavailable ranges #33007

derkan opened this issue Dec 11, 2018 · 18 comments · Fixed by #71806
Assignees
Labels
A-kv-recovery A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) N-followup Needs followup. O-community Originated from the community O-postmortem Originated from a Postmortem action item. O-sre For issues SRE opened or otherwise cares about tracking. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-kv KV Team

Comments

@derkan
Copy link

derkan commented Dec 11, 2018

Describe the problem

I'm testing multi-region deployment. I've used performance tuning docs, installed multi-datacenter cluster. DC names are DCPRI and DCSEC with 3 virtual machines in each.
When I shutdown system in DCSEC, cluster on DCPRI is unresponsive, no SQL's work, just waits. Also, admin GUI timeouts.

Is it because my configuration, bug or is it expected behaviour?

Here is start parameters of cockroach instances:

# DCPRI/Node 1
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcpri --listen-addr=10.35.14.101 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# DCPRI/Node 4
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcpri --listen-addr=10.35.14.102 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# DCPRI/Node 3
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcpri --listen-addr=10.35.14.103 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# Node 4 DCSEC
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcsec --listen-addr=10.35.14.104 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# Node 5 DCSEC
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcsec --listen-addr=10.35.14.105 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# Node 6 DCSEC
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcsec --listen-addr=10.35.14.106 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25

Logs from the nodes in DCPRI
DCPRI/Node1:

W181210 15:30:45.781750 136672 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.105:26257: context canceled; please retry.
W181210 15:30:45.781790 139 storage/store.go:3910  [n1,s1] handle raft ready: 1.3s [processed=0]
I181210 15:30:46.795946 136580 server/authentication.go:374  Web session error: http: named cookie not present
I181210 15:30:46.799351 136228 server/authentication.go:374  Web session error: http: named cookie not present
W181210 15:30:46.909608 136729 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.104:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.104:26257: connect: no route to host". Reconnecting...
W181210 15:30:46.909690 136729 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.104:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:30:46.909704 136729 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.104:26257: grpc: the connection is closing; please retry.
W181210 15:30:46.909709 142 storage/store.go:3910  [n1,s1] handle raft ready: 1.8s [processed=0]

DCPRI/Node4:

W181210 15:31:22.968999 71750 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.105:26257: connect: no route to host". Reconnecting...
I181210 15:31:22.969066 136 rpc/nodedialer/nodedialer.go:91  [n4] unable to connect to n6: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.10.14.105:26257: connect: no route to host"
W181210 15:31:22.969068 71750 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:22.969084 136 storage/store.go:3910  [n4,s4] handle raft ready: 1.1s [processed=0]
W181210 15:31:22.969095 71750 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.105:26257: context canceled; please retry.
W181210 15:31:23.772921 71770 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.106:26257: connect: no route to host". Reconnecting...
W181210 15:31:23.772980 71770 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:23.772988 71770 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.106:26257: grpc: the connection is closing; please retry.
W181210 15:31:23.773029 112 storage/store.go:3910  [n4,s4] handle raft ready: 1.3s [processed=0]

DCPRI/Node3:

I181210 15:31:50.649944 139 server/status/runtime.go:465  [n3] runtime stats: 253 MiB RSS, 224 goroutines, 140 MiB/18 MiB/175 MiB GO alloc/idle/total, 56 MiB/72 MiB CGO alloc/total, 24.1 CGO/sec, 0.7/0.3 %(u/s)time, 0.0 %gc (0x), 108 KiB/80 KiB (r/w)net
W181210 15:31:50.941579 69897 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.106:26257: connect: no route to host". Reconnecting...
W181210 15:31:50.941670 69897 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:50.941683 69897 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.106:26257: grpc: the connection is closing; please retry.
W181210 15:31:51.217542 69854 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.105:26257: connect: no route to host". Reconnecting...
W181210 15:31:51.217615 31 storage/store.go:3910  [n3,s3] handle raft ready: 1.2s [processed=0]
W181210 15:31:51.217622 69854 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:51.217648 69854 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.105:26257: context canceled; please retry.

Epic: CRDB-2553

Jira issue: CRDB-6349

@derkan
Copy link
Author

derkan commented Dec 11, 2018

We had a chat on gitter with @mberhault . He kindly gave information how CockroachDB's cluster works on multiple datacenters. @mberhault suggested this doc and this doc which were helpful.

But unfortunately, while having 2 datacenters, if one DC is lost, db cluster in other DC will not be available for our services, that means no data is available afterwards. (Same thing for 3 DC cluster if 2 of 3 DC are lost).

I think for CockroachDB, we need a mechanism to promote last available DC to be online if others are down. It maybe manually or automatic. I had issue #32524 in single DC-3 node cluster and still I can not recover some rows from some tables. SQL's on these tables waits forever. Also this forever wait behaviour should not be exist, instead it should raise exception that some ranges are not accessible. Because applications are blocked with these waits.

@knz knz added O-community Originated from the community C-question A question rather than an issue. No code/spec/doc change needed. labels Dec 11, 2018
@knz
Copy link
Contributor

knz commented Dec 11, 2018

@tim-o please accompany this discussion

@tim-o
Copy link
Contributor

tim-o commented Dec 11, 2018

Hey @derkan - it sounds like there are two requests here:

  1. Raise an exception if ranges are not accessible, and
  2. Support a use case where a cluster can continue running after losing half of its replicas.

Is that correct? If so, the first certainly seems reasonable but the second would be a very large change. We do have work ongoing to allow single replicas to return reads if quorum can't be reached ("follower reads"), but this would not allow writes.

Let me know if that would suit your use case and if I've captured your feedback above.

Thanks!

@derkan
Copy link
Author

derkan commented Dec 12, 2018

Thanks @tim-o

  • Raise an exception for unaccessible ranges : Yes. It is better than digging in logs for the cause of wating and locking services using DB(of course with a reasonable timeout).
  • Make cluster continue running if half of the replicas are lost: Yes. We're 'cockroach`, right? :-) It makes sense for allowing reads in this situation, giving time to DBA to add new nodes to cluster. And an important thing is, it will allow DBA to make a backup. If it is possible, maybe active nodes can let writing to the replicas which they own?

@tim-o
Copy link
Contributor

tim-o commented Dec 13, 2018

The first definitely seems uncontroversial to me. @awoods187 can you comment on the second request above? Does this fall under an existing project or is it net new / unspecced?

@derkan we definitely do want to be fault tolerant, but I'd be concerned that there's no way to break a 'tie' in a world where we allow a cluster to operate with consensus from only 50% of the nodes. I'll leave it to someone more familiar with core to comment on that though.

@tbg
Copy link
Member

tbg commented Jan 15, 2019

Falling back to some kind of "inconsistent reads" mode when quorum is lost is very difficult to implement correctly. For starters, none of the internal invariants of the SQL layer are expected to hold in that situation.

@knz knz changed the title Multi-region install, if one DC is down, other cluster is frozen. storage: make requests to unavailable ranges time out with a clear error message Jan 22, 2019
@knz knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. A-kv-replication Relating to Raft, consensus, and coordination. and removed C-question A question rather than an issue. No code/spec/doc change needed. labels Jan 22, 2019
@knz knz added this to Incoming in KV via automation Jan 22, 2019
@knz
Copy link
Contributor

knz commented Jan 22, 2019

I have requalified the issue to become a feature request for a timeout on requests to unavailable ranges.

Unfortunately at this time our internal work backlog is a bit congested, but we'll be on the lookout for this UX improvement. Thank you for suggesting.

@awoods187 @piyush-singh this request pertains to the UX of partially unavailable clusters. Maybe this fits into a larger roadmap picture you already have?

@tbg
Copy link
Member

tbg commented Jan 22, 2019

FWIW, this is straightforward to implement. Tentatively putting this on my plate for 2.2, it would also aid our debugging immensely.

@tbg tbg self-assigned this Jan 22, 2019
craig bot pushed a commit that referenced this issue Feb 11, 2019
34296: storage: improve message on slow Raft proposal r=petermattis a=tbg

Touches #33007.

Release note: None

34589: importccl: fix flaky test TestImportCSVStmt r=rytaft a=rytaft

`TestImportCSVStmt` tests that `IMPORT` jobs appear in a certain order
in the `system.jobs` table. Automatic statistics were causing this
test to be flaky since `CreateStats` jobs were present in the jobs
table as well, in an unpredictable order. This commit fixes the problem
by only selecting `IMPORT` jobs from the jobs table.

Fixes #34568

Release note: None

34660: storage: make RaftTruncatedState unreplicated r=bdarnell a=tbg

This isn't 100% polished yet, but generally ready for review.

----

See #34287.

Today, Raft (or preemptive) snapshots include the past Raft log, that is,
log entries which are already reflected in the state of the snapshot.
Fundamentally, this is because we have historically used a replicated
TruncatedState.

TruncatedState essentially tells us what the first index in the log is
(though it also includes a Term). If the TruncatedState cannot diverge
across replicas, we *must* send the whole log in snapshots, as the first
log index must match what the TruncatedState claims it is.

The Raft log is typically, but not necessarily small. Log truncations are
driven by a queue and use a complex decision process. That decision process
can be faulty and even if it isn't, the queue could be held up. Besides,
even when the Raft log contains only very few entries, these entries may be
quite large (see SSTable ingestion during RESTORE).

All this motivates that we don't want to (be forced to) send the Raft log
as part of snapshots, and in turn we need the TruncatedState to be
unreplicated.

This change migrates the TruncatedState into unreplicated keyspace. It does
not yet allow snapshots to avoid sending the past Raft log, but that is a
relatively straightforward follow-up change.

VersionUnreplicatedRaftTruncatedState, when active, moves the truncated
state into unreplicated keyspace on log truncations.

The migration works as follows:

1. at any log position, the replicas of a Range either use the new
(unreplicated) key or the old one, and exactly one of them exists.

2. When a log truncation evaluates under the new cluster version, it
initiates the migration by deleting the old key. Under the old cluster
version, it behaves like today, updating the replicated truncated state.

3. The deletion signals new code downstream of Raft and triggers a write to
the new, unreplicated, key (atomic with the deletion of the old key).

4. Future log truncations don't write any replicated data any more, but
(like before) send along the TruncatedState which is written downstream of
Raft atomically with the deletion of the log entries. This actually uses
the same code as 3. What's new is that the truncated state needs to be
verified before replacing a previous one. If replicas disagree about their
truncated state, it's possible for replica X at FirstIndex=100 to apply a
truncated state update that sets FirstIndex to, say, 50 (proposed by a
replica with a "longer" historical log). In that case, the truncated state
update must be ignored (this is straightforward downstream-of-Raft code).

5. When a split trigger evaluates, it seeds the RHS with the legacy key iff
the LHS uses the legacy key, and the unreplicated key otherwise. This makes
sure that the invariant that all replicas agree on the state of the
migration is upheld.

6. When a snapshot is applied, the receiver is told whether the snapshot
contains a legacy key. If not, it writes the truncated state (which is part
of the snapshot metadata) in its unreplicated version. Otherwise it doesn't
have to do anything (the range will migrate later).

The following diagram visualizes the above. Note that it abuses sequence
diagrams to get a nice layout; the vertical lines belonging to NewState and
OldState don't imply any particular ordering of operations.

```
┌────────┐                            ┌────────┐
│OldState│                            │NewState│
└───┬────┘                            └───┬────┘
   │                        Bootstrap under old version
   │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─ ─
   │                                     │
   │                                     │     Bootstrap under new version
   │                                     │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─ ─
   │                                     │
   │─ ─ ┐
   │    | Log truncation under old version
   │< ─ ┘
   │                                     │
   │─ ─ ┐                                │
   │    | Snapshot                       │
   │< ─ ┘                                │
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Snapshot
   │                                     │< ─ ┘
   │                                     │
   │   Log truncation under new version  │
   │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Log truncation under new version
   │                                     │< ─ ┘
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Log truncation under old version
   │                                     │< ─ ┘ (necessarily running new binary)
```

Release note: None

34762: distsqlplan: fix error in union planning r=jordanlewis a=jordanlewis

Previously, if 2 inputs to a UNION ALL had identical post processing
except for renders, further post processing on top of that union all
could invalidate the plan and cause errors or crashes.

Fixes #34437.

Release note (bug fix): fix a planning crash during UNION ALL operations
that had projections, filters or renders directly on top of the UNION
ALL in some cases.

34767: sql: reuse already allocated memory for the cache in a row container r=yuzefovich a=yuzefovich

Previously, we would always allocate new memory for every row that
we put in the cache of DiskBackedIndexedRowContainer and simply
discard the memory underlying the row that we remove from the cache.
Now, we're reusing that memory.

Release note: None

34779: opt: add stats to tpch xform test r=justinj a=justinj

Since we have stats by default now, this should be the default testing
mechanism. I left in tpch-no-stats since we also have that for tpcc,
just for safety.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Justin Jaffray <justin@cockroachlabs.com>
@lunevalex lunevalex moved this from Incoming to Replication in KV Jul 27, 2020
@tbg
Copy link
Member

tbg commented Feb 16, 2021

@nvanbenschoten also points out that with parallel commits in a single phase, even txns that have their first write to an unavailable range may not realize they should stop there. This is due to pipelining. From that pov it also makes sense to fail-fast so that these txns don't keep their unrelated intents alive. Also, our deadlock detection won't work if txns are anchored on unavailable ranges.

@tbg tbg added the O-postmortem Originated from a Postmortem action item. label Feb 22, 2021
@tbg tbg changed the title storage: make requests to unavailable ranges time out with a clear error message kvserver: make requests to unavailable ranges time out with a clear error message Feb 23, 2021
@tbg tbg added this to To be considered in KV 21.1 Stability Period via automation Feb 23, 2021
@tbg tbg moved this from To be considered to STABILITY PERIOD in KV 21.1 Stability Period Feb 23, 2021
tbg added a commit to tbg/cockroach that referenced this issue Mar 3, 2022
This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica
circuit breakers (cockroachdb#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses cockroachdb#74799.

Release note: None
Release justification: 22.1 project work
tbg added a commit to tbg/cockroach that referenced this issue Mar 3, 2022
This adds a roachtest that runs TPCC (with a low warehouse count and
`--tolerate-errors`), loses quorum half-way through, and checks the
prometheus metrics of the workload for fail-fast behavior.

Touches cockroachdb#33007.

Release note: None
Release justification: testing-only change
tbg added a commit to tbg/cockroach that referenced this issue Mar 3, 2022
This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica
circuit breakers (cockroachdb#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses cockroachdb#74799.

Release note: None
Release justification: 22.1 project work
tbg added a commit to tbg/cockroach that referenced this issue Mar 3, 2022
This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica
circuit breakers (cockroachdb#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses cockroachdb#74799.

Release note: None
Release justification: 22.1 project work
tbg added a commit to tbg/cockroach that referenced this issue Mar 3, 2022
This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica
circuit breakers (cockroachdb#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses cockroachdb#74799.

Release note: None
Release justification: 22.1 project work
tbg added a commit to tbg/cockroach that referenced this issue Mar 3, 2022
This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica
circuit breakers (cockroachdb#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses cockroachdb#74799.

Release note: None
Release justification: 22.1 project work
craig bot pushed a commit that referenced this issue Mar 3, 2022
76858: kvserver: allow circuit-breaker to serve reads r=erikgrinaker a=tbg

This commit revamps an earlier implementation (#71806) of per-Replica
circuit breakers (#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

`@nvanbenschoten` suggested in #74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses #74799.

Release note: None
Release justification: 22.1 project work

77221: changefeedccl: Fix flaky test. r=miretskiy a=miretskiy

Fix flaky TestChangefeedHandlesDrainingNodes test.
The source of the flake was that cluster setting updates propagate
asynchronously to the other nodes in the cluster.  Thus, it was possible
for the test to flake because some of the nodes were observing the
old value for the setting.

The flake is fixed by introducing testing utility function that
sets the setting and ensures the setting propagates to all nodes in
the test cluster.

Fixes #76806

Release Notes: none

77317: util/log: remove Safe in favor of redact.Safe r=yuzefovich a=yuzefovich

My desire to make this change is to break the dependency of `treewindow`
on `util/log` (which is a part of the effort to clean up the
dependencies of `execgen`).

Addresses: #77234.

Release note: None

Release justification: low risk change to clean up the dependencies
a bit.

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
tbg added a commit to tbg/cockroach that referenced this issue Mar 4, 2022
`ReplicaUnavailableError` was previously a leaf error, so the only
way to attach additional information "on the way out" was to wrap *it*
with wrapper errors. This made it difficult to read the messages since
the result and cause were reversed.

This became exacerbated with the recent addition of `PoisonedError`,
which should really be a cause of `ReplicaUnavailableError` too.

In this commit, we make ReplicaUnavailableError a wrapper error and
rearrange breaker errors such that if a PoisonError occurs, it is
a cause of the ReplicaUnavailableError.

The change in `pkg/kv/kvserver/testdata/replica_unavailable_error.txt`
illustrates the improved error users will see (and which will thus be
reported to us).

As a wrapping error, I needed to register the error with
`cockroachdb/errors` to allow for proper encoding/decoding. I'm unsure
whether this worked properly before, but now it definitely does (as it
is tested).

Testing was improved to check for presence of `ReplicaUnavailableError`
in all breaker errors.

Touches cockroachdb#33007.

Release justification: UX improvement for existing functionality
Release note: None
@tbg
Copy link
Member

tbg commented Mar 4, 2022

Moving earlier checklist comment here, for better visibility.

22.1:

Other issues, either related to per-replica circuit breakers or to the problem area of handling "slow"/unservable requests across the stack, but they're not currently in scope:

RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
It's usually 1, but it can't hurt to detect cases in which we're looping
around. This came up when writing tests for cockroachdb#33007.

Release note: None
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
`RangeFeed`s are long-running operations with a high fixed cost
(catch-up scan). They can fall behind for various reasons, replicas
becoming unavailable being one of them. However, when a replica's
circuit breaker trips, this should not abort any pending (or new, for
that matter) RangeFeeds.

Add a test that asserts that this is the case.

Touches cockroachdb#33007.
Touches cockroachdb#76146.

Release note: None
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
While running the nightly roachtest suite with circuit breakers enabled
(cockroachdb#76146) I observed undesirable interactions between circuit breakers
and bulk operations.

Bulk operations tend to overload the cluster. When this happens, circuit
breakers may fire "spuriously" (at the very least if liveness is lost
across multiple nodes for some time), which returns hard errors to the
bulk job. The job will then roll back any work it has done, which may
also fail due to the circuit breaker (leaving the job in limbo).

For long-running bulk jobs, it seems desirable to bypass the circuit
breaker entirely. When unavailability occurs, the job will not be able
to make progress, but it should be left to the operator whether to
cancel it or not; after all, once the outage resolves the job may
be able to resume normally.

This PR adds machinery that allows request types to bypass circuit
breakers. Concretely, any batch that contains either of the following
request types:

- Export
- AddSSTable
- RevertRange
- ClearRange
- GC
- Probe

Touches cockroachdb#33007.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Mar 7, 2022
`ReplicaUnavailableError` was previously a leaf error, so the only
way to attach additional information "on the way out" was to wrap *it*
with wrapper errors. This made it difficult to read the messages since
the result and cause were reversed.

This became exacerbated with the recent addition of `PoisonedError`,
which should really be a cause of `ReplicaUnavailableError` too.

In this commit, we make ReplicaUnavailableError a wrapper error and
rearrange breaker errors such that if a PoisonError occurs, it is
a cause of the ReplicaUnavailableError.

The change in `pkg/kv/kvserver/testdata/replica_unavailable_error.txt`
illustrates the improved error users will see (and which will thus be
reported to us).

As a wrapping error, I needed to register the error with
`cockroachdb/errors` to allow for proper encoding/decoding. I'm unsure
whether this worked properly before, but now it definitely does (as it
is tested).

Testing was improved to check for presence of `ReplicaUnavailableError`
in all breaker errors.

Touches cockroachdb#33007.

Release justification: UX improvement for existing functionality
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Mar 7, 2022
`ReplicaUnavailableError` was previously a leaf error, so the only
way to attach additional information "on the way out" was to wrap *it*
with wrapper errors. This made it difficult to read the messages since
the result and cause were reversed.

This became exacerbated with the recent addition of `PoisonedError`,
which should really be a cause of `ReplicaUnavailableError` too.

In this commit, we make ReplicaUnavailableError a wrapper error and
rearrange breaker errors such that if a PoisonError occurs, it is
a cause of the ReplicaUnavailableError.

The change in `pkg/kv/kvserver/testdata/replica_unavailable_error.txt`
illustrates the improved error users will see (and which will thus be
reported to us).

As a wrapping error, I needed to register the error with
`cockroachdb/errors` to allow for proper encoding/decoding. I'm unsure
whether this worked properly before, but now it definitely does (as it
is tested).

Testing was improved to check for presence of `ReplicaUnavailableError`
in all breaker errors.

Touches cockroachdb#33007.

Release justification: UX improvement for existing functionality
Release note: None
craig bot pushed a commit that referenced this issue Mar 7, 2022
77365: roachpb,kvserver: make ReplicaUnavailableError a wrapping error r=erikgrinaker a=tbg

`ReplicaUnavailableError` was previously a leaf error, so the only
way to attach additional information "on the way out" was to wrap *it*
with wrapper errors. This made it difficult to read the messages since
the result and cause were reversed.

This became exacerbated with the recent addition of `PoisonedError`,
which should really be a cause of `ReplicaUnavailableError` too.

In this commit, we make ReplicaUnavailableError a wrapper error and
rearrange breaker errors such that if a PoisonError occurs, it is
a cause of the ReplicaUnavailableError.

The change in `pkg/kv/kvserver/testdata/replica_unavailable_error.txt`
illustrates the improved error users will see (and which will thus be
reported to us).

As a wrapping error, I needed to register the error with
`cockroachdb/errors` to allow for proper encoding/decoding. I'm unsure
whether this worked properly before, but now it definitely does (as it
is tested).

Testing was improved to check for presence of `ReplicaUnavailableError`
in all breaker errors.

Touches #33007.

Release justification: UX improvement for existing functionality
Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
craig bot pushed a commit that referenced this issue Mar 7, 2022
76989: importccl: import avro files with logical time types r=HonoreDB,stevendanna a=msbutler

Previously, the user had to stringify all time cols in their avro file before
importing into CRDB cluster. This change allows avro columns with the following
avro types to get imported directly into: long.time-micros, int.time-millis,
long.timestamp-micros,long.timestamp-millis, and int.date.

Notably, we still don't allow local-timestamp-micros and local-timestamp-millis
to get imported directly because avro file reader we use, goAvro, does not
suppor them.

If there's demand, future work could be done to support importing the decimal
avro type, the only other avro logical type goAvro supports.

Release justification: low risk, high benefit changes to existing functionality

Release note (sql change): IMPORT INTO with AVRO now supports avro files with
the following avro types: long.time-micros, int.time-millis,
long.timestamp-micros,long.timestamp-millis, and int.date. This feature only
works if the user has created a crdb table with column types with match certain
avro types. Specifically:

AVRO | CRDB
time-* | TIME
timestamp_* | TIMESTAMP
date | DATE

77360: roachpb: modernize AmbiguousResultError r=erikgrinaker a=tbg

AmbiguousResultError previously had a `WrappedErr *Error` field.
Wrapping `Error` inside of itself is a really bad idea! Luckily,
it wasn't used by anyone, but it is being used now, with replica
circuit breakers.

Before we cement this in a production release, give this error type a
do-over and let it "properly" wrap an error. I performed an audit to
make sure we were always checking for ambiguous results first, i.e.
looked for bugs of the form

```go
switch err.(type) {
  case *roachpb.SomeRetriableError:
    // Retry request.
    continue
  case *AmbiguousResultError:
    return err
}
```

where `err` might be an `AmbiguousResultError` wrapping a
`NotRetriableError`. I didn't find anything; we always carry out the
ambiguous result check first. Also, very few errors are wrapped in an
ambiguous result, and most of them could actually take precedence. If
for example an ambiguous result wraps a lease error, the result is not
ambiguous. We don't try to exploit that at all in this commit, and
likely never will since this isn't a common situation.

Release justification: improvement to new functionality (#33007)
Release note: None

77416: kvserver: change invariant checking in raftLogQueue r=erikgrinaker a=sumeerbhola

The meaning of truncateDecisionIndex.FirstIndex has changed
in the presence of loosely coupled truncation. We now consider
the raft log to be empty for purposes of truncation when
[FirstIndex,LastIndex] is an empty interval.

Fixes #77394

Release justification: Bug fix by changing an assertion that was
no longer correct.

Release note: None

77437: sql/catalog/lease: update a comment r=ajwerner a=ajwerner

The comment referenced a variable which has since been renamed.

Release justification: non-production code change

Release note: None

Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
@tbg
Copy link
Member

tbg commented Mar 24, 2022

Replica circuit breakers are enabled on master and release-22.1. There are follow-up issues filed for future extensions, see above.

@tbg tbg closed this as completed Mar 24, 2022
Replication automation moved this from Current Milestone / In Progress to Done Mar 24, 2022
tbg added a commit to tbg/cockroach that referenced this issue Apr 12, 2022
This adds a roachtest that runs TPCC (with a low warehouse count and
`--tolerate-errors`), loses quorum half-way through, and checks the
prometheus metrics of the workload for fail-fast behavior. It also
checks that follower reads and bounded staleness reads work as
expected at timestamps at which the cluster was known to have been
healthy.

Touches cockroachdb#33007.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 3, 2022
This adds a roachtest that runs TPCC (with a low warehouse count and
`--tolerate-errors`), loses quorum half-way through, and checks the
prometheus metrics of the workload for fail-fast behavior. It also
checks that follower reads and bounded staleness reads work as
expected at timestamps at which the cluster was known to have been
healthy.

Touches cockroachdb#33007.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Dec 20, 2022
This adds a roachtest that runs TPCC (with a low warehouse count and
`--tolerate-errors`), loses quorum half-way through, and checks the
prometheus metrics of the workload for fail-fast behavior. It also
checks that follower reads and bounded staleness reads work as
expected at timestamps at which the cluster was known to have been
healthy.

Touches cockroachdb#33007.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-recovery A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) N-followup Needs followup. O-community Originated from the community O-postmortem Originated from a Postmortem action item. O-sre For issues SRE opened or otherwise cares about tracking. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-kv KV Team
Projects
None yet
10 participants