db: WAL failover to deal with transient unavailability #3230

sumeerbhola · 2024-01-18T03:21:11Z

We see transient write unavailability of block devices in the cloud (< 60s) that are sometimes detected as disk stalls resulting in node crashes. Whether the node crashes or not, this negatively impacts the user workload. Read have not been observed to stall in this manner, and additionally reads can often be satisfied using the Pebble block cache, or the OS page cache.

WAL failover relies on more than one block devices configured for the node, say two block devices and two Pebble DBs. The WAL for one Pebble DB can temporarily failover to the block device of the other. Flushes and compactions will stall, but most workloads are writing at a rate that we can afford to buffer 60s of data in memtables. More details in https://docs.google.com/document/d/1vAsftzyPG-kDy-A2Ic1fZeKd4OKJIf6N7KvNXRpDAFA/edit#heading=h.8n1r6sehoqgk (internal doc).

Also see CRDB-35401

Starting a list to track the remaining Pebble work:

Integrate wal.failoverManager with Pebble
Metrics: cumulative number of switches; cumulative duration writing to primary and secondary
Config: Make failover threshold (FailoverOptions.UnhealthyOperationLatencyThreshold) dynamically changeable
wal: synchronously verify secondary is writable #3463
Testing: metamorphic/randomized testing of WAL failover
[Optional] make excises flushable ingests (to not stall the commit pipeline now that we excise for range snapshots) ingest: IngestAndExcise should use flushableIngest for ingests #3335
[Misc CockroachDB, Optional] AC: if storage AC is active, it will see a drop in compaction throughput out of L0 and start limiting new writes. Could allow unlimited tokens while writing to secondary?

The text was updated successfully, but these errors were encountered:

Informs cockroachdb#3230

Move some of the logic related to the batch representation to a new batchrepr package. For now, this is mostly just the BatchReader type. Future work may move additional logic related to writing new batch mutations and sorted iteration over a batch's contents, assuming there's no impact on performance. This move is motivated by cockroachdb#3230. The planned wal package will need to inspect batch sequence numbers for deduplication when reconstructing the logical contents of a virtual WAL. Moving the logic outside the pebble package avoids duplicating the logic.

Informs cockroachdb#3230

Move some of the logic related to the batch representation to a new batchrepr package. For now, this is primarily the BatchReader type and a few very small facilities around the Header for now. Future work may move additional logic related to writing new batch mutations and sorted iteration over a batch's contents, assuming there's no impact on performance. This move is motivated by cockroachdb#3230. The planned wal package will need to inspect batch sequence numbers for deduplication when reconstructing the logical contents of a virtual WAL. Moving the logic outside the pebble package avoids duplicating the logic.

Informs #3230

Move some of the logic related to the batch representation to a new batchrepr package. For now, this is primarily the BatchReader type and a few very small facilities around the Header for now. Future work may move additional logic related to writing new batch mutations and sorted iteration over a batch's contents, assuming there's no impact on performance. This move is motivated by cockroachdb#3230. The planned wal package will need to inspect batch sequence numbers for deduplication when reconstructing the logical contents of a virtual WAL. Moving the logic outside the pebble package avoids duplicating the logic.

Move some of the logic related to the batch representation to a new batchrepr package. For now, this is primarily the BatchReader type and a few very small facilities around the Header. Future work may move additional logic related to writing new batch mutations and sorted iteration over a batch's contents, assuming there's no impact on performance. This move is motivated by cockroachdb#3230. The planned wal package will need to inspect batch sequence numbers for deduplication when reconstructing the logical contents of a virtual WAL. Moving the logic outside the pebble package avoids duplication.

Move some of the logic related to the batch representation to a new batchrepr package. For now, this is primarily the BatchReader type and a few very small facilities around the Header. Future work may move additional logic related to writing new batch mutations and sorted iteration over a batch's contents, assuming there's no impact on performance. This move is motivated by #3230. The planned wal package will need to inspect batch sequence numbers for deduplication when reconstructing the logical contents of a virtual WAL. Moving the logic outside the pebble package avoids duplication.

failover_manager.go contains the failoverManager (which implements wal.Manager) for the write path, and helper classes. dirProber monitors the primary dir when failed over to use the secondary, to decide when to failback to the primary. failoverMonitor uses the latency and error seen by the current *LogWriter, and probing state, to decide when to switch to a different dir. failover_writer.go contains the failoverWriter, that can switch across a sequence of record.LogWriters. record.LogWriter is changed to accommodate both standalone and failover mode, without affecting the synchronization in standalone mode. In failover mode there is some additional synchronization, in the failoverWriter queue, which is not lock free, but is hopefully fast enough given the fastpath will use read locks. Informs cockroachdb#3230

Adjust the FailoverOptions defaults to match our initial choices for use in Cockroach. Additionally, this provides a default for ElevatedWriteStallThresholdLag which previously had none. Informs cockroachdb#3230.

Introduce support for configuring a multi-store CockroachDB node to failover a store's write-ahead log (WAL) to another store's data directory. Failing over the write-ahead log may allow some operations against a store to continue to complete despite temporary unavailability of the underlying storage. Customers must opt into WAL failover by passing `--wal-failover=among-stores` to `cockroach start` or setting the env var `COCKROACH_WAL_FAILOVER=among-stores`. On start, cockroach will assign each store another store to be its failover destination. Cockroach will begin monitoring the latency of all WAL writes. If latency to the WAL exceeds the value of the storage.wal_failover.unhealthy_op_threshold cluster setting, Cockroach will attempt to write WAL entries to its secondary store's volume. If a user wishes to disable WAL failover, they must restart the node setting `--wal-failover=disabled`. Close cockroachdb#119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 Release note (ops change): Introduces a new start option (--wal-failover or COCKROACH_WAL_FAILOVER env var) to opt into failing over WALs between stores in multi-store nodes. Introduces a new storage.wal_failover.unhealthy_op_threshold cluster setting for configuring the latency threshold at which a WAL write is considered unhealthy.

Adjust the FailoverOptions defaults to match our initial choices for use in Cockroach. Additionally, this provides a default for ElevatedWriteStallThresholdLag which previously had none. Informs #3230.

Introduce support for configuring a multi-store CockroachDB node to failover a store's write-ahead log (WAL) to another store's data directory. Failing over the write-ahead log may allow some operations against a store to continue to complete despite temporary unavailability of the underlying storage. Customers must opt into WAL failover by passing `--wal-failover=among-stores` to `cockroach start` or setting the env var `COCKROACH_WAL_FAILOVER=among-stores`. On start, cockroach will assign each store another store to be its failover destination. Cockroach will begin monitoring the latency of all WAL writes. If latency to the WAL exceeds the value of the storage.wal_failover.unhealthy_op_threshold cluster setting, Cockroach will attempt to write WAL entries to its secondary store's volume. If a user wishes to disable WAL failover, they must restart the node setting `--wal-failover=disabled`. Close cockroachdb#119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 Release note (ops change): Introduces a new start option (--wal-failover or COCKROACH_WAL_FAILOVER env var) to opt into failing over WALs between stores in multi-store nodes. Introduces a new storage.wal_failover.unhealthy_op_threshold cluster setting for configuring the latency threshold at which a WAL write is considered unhealthy.

120509: storage: add WAL failover configuration r=sumeerbhola a=jbowens Introduce support for configuring a multi-store CockroachDB node to failover a store's write-ahead log (WAL) to another store's data directory. Failing over the write-ahead log may allow some operations against a store to continue to complete despite temporary unavailability of the underlying storage. Customers must opt into WAL failover by passing `--wal-failover=among-stores` to `cockroach start` or setting the env var `COCKROACH_WAL_FAILOVER=among-stores`. On start, cockroach will assign each store another store to be its failover destination. Cockroach will begin monitoring the latency of all WAL writes. If latency to the WAL exceeds the value of the storage.wal_failover.unhealthy_op_threshold cluster setting, Cockroach will attempt to write WAL entries to its secondary store's volume. If a user wishes to disable WAL failover, they must restart the node setting `--wal-failover=disabled`. Close #119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 Release note (ops change): Introduces a new start option (--wal-failover or COCKROACH_WAL_FAILOVER env var) to opt into failing over WALs between stores in multi-store nodes. Introduces a new storage.wal_failover.unhealthy_op_threshold cluster setting for configuring the latency threshold at which a WAL write is considered unhealthy. 120515: workflows: make more builds mandatory: `check_generated_code`, ... r=celiala a=rickystewart ...`docker_image_amd64`, and `examples_orms` Epic: CRDB-8308 Release note: None 120636: release: released CockroachDB version 24.1.0-alpha.3. Next version: 24.1.0-alpha.4 r=kvoli a=cockroach-teamcity Release note: None Epic: None Release justification: non-production (release infra) change. Co-authored-by: Jackson Owens <jackson@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Justin Beaver <teamcity@cockroachlabs.com>

This commit expands on cockroachdb#120509, introducing a WAL failover mode that allows an operator of a node with a single store to configure WAL failover to failover to a particular path (rather than another store's directory). This is configured via the --wal-failover flag: --wal-failover=path=/mnt/data2 When disabling or changing the path, the operator is required to pass the previous path. Eg, --wal_failover=path=/mnt/data3,prev_path=/mnt/data2 or --wal_failover=disabled,prev_path=/mnt/data2 Informs cockroachdb#119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 Release note (ops change): Adds an additional option to the new (in 24.1) --wal-failover CLI flag allowing an operator to specify an explicit path for WAL failover for single-store nodes.

This commit expands on cockroachdb#120509, introducing a WAL failover mode that allows an operator of a node with a single store to configure WAL failover to failover to a particular path (rather than another store's directory). This is configured via the --wal-failover flag: --wal-failover=path=/mnt/data2 When disabling or changing the path, the operator is required to pass the previous path. Eg, --wal-failover=path=/mnt/data3,prev_path=/mnt/data2 or --wal-failover=disabled,prev_path=/mnt/data2 Informs cockroachdb#119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 Release note (ops change): Adds an additional option to the new (in 24.1) --wal-failover CLI flag allowing an operator to specify an explicit path for WAL failover for single-store nodes.

120783: storage: support WAL failover to an explicit path r=sumeerbhola a=jbowens This commit expands on #120509, introducing a WAL failover mode that allows an operator of a node with a single store to configure WAL failover to failover to a particular path (rather than another store's directory). This is configured via the --wal-failover flag: --wal-failover=path=/mnt/data2 When disabling or changing the path, the operator is required to pass the previous path. Eg, --wal_failover=path=/mnt/data3,prev_path=/mnt/data2 or --wal_failover=disabled,prev_path=/mnt/data2 Informs #119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 Release note (ops change): Adds an additional option to the new (in 24.1) --wal-failover CLI flag allowing an operator to specify an explicit path for WAL failover for single-store nodes. Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>

Introduce a new roachtest that simulates disk stalls on one store of a 3-node cluster with two stores per node, and the --wal-failover=among-stores configuration set. The WAL failover configuration should ensure the workload continues uninterrupted until it becomes blocked on disk reads. Informs cockroachdb#119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401

119975: kv: allow DeleteRangeRequests to be pipelined r=nvanbenschoten a=arulajmani Previously, ranged requests could not be pipelined. However, there is no good reason to not allow them to be pipeliend -- we just have to take extra care to correctly update in-flight writes tracking on the response path. We do so now. As part of this patch, we introduce two new flags -- canPipeline and canParallelCommit. We use these flags to determine whether batches can be pipelined or committed using parallel commits. This is in contrast to before, where we derived this information from other flags (isIntentWrite, !isRange). This wasn't strictly necessary for this change, but helps clean up the concepts. As a consequence of this change, we now have a distinction between requests that can be pipelined and requests that can be part of a batch that can be committed in parallel. Notably, this applies to DeleteRangeRequests -- they can be pipeliend, but not be committed in parallel. That's because we need to have the entire write set upfront when performing a parallel commit, lest we need to perform recovery -- we don't have this for DeleteRange requests. In the future, we'll extend the concept of canPipeline (and !canParallelCommit) to other locking ranged requests as well. In particular, (replicated) locking {,Reverse}ScanRequests who want to pipeline their lock acquisitions. Closes #64723 Informs #117978 Release note: None 120812: changefeedccl: deflake TestAlterChangefeedAddTargetsDuringBackfill r=rharding6373 a=andyyang890 This patch deflakes `TestAlterChangefeedAddTargetsDuringBackfill` by increasing the max batch size used for changefeed initial scans. Previously, if we were unlucky, the batch sizes could be too small leading to a timeout while waiting for the initial scan to complete. Fixes #120744 Release note: None 121023: roachtest: add disk-stalled/wal-failover/among-stores test r=sumeerbhola a=jbowens Introduce a new roachtest that simulates disk stalls on one store of a 3-node cluster with two stores per node, and the --wal-failover=among-stores configuration set. The WAL failover configuration should ensure the workload continues uninterrupted until it becomes blocked on disk reads. Informs #119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 121073: master: Update pkg/testutils/release/cockroach_releases.yaml r=rail a=github-actions[bot] Update pkg/testutils/release/cockroach_releases.yaml with recent values. Epic: None Release note: None Release justification: test-only updates Co-authored-by: Arul Ajmani <arulajmani@gmail.com> Co-authored-by: Andy Yang <yang@cockroachlabs.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com> Co-authored-by: CRL Release bot <teamcity@cockroachlabs.com>

When initializing the WAL failover manager, synchronously verify that we can write to the secondary directory by writing some human-readable metadata about the Pebble instance using it as a secondary. Informs cockroachdb#3230.

jbowens · 2024-03-27T19:41:01Z

I'm going to close this out as everything has been merged except for the punchlist item #3463.

When initializing the WAL failover manager, synchronously verify that we can write to the secondary directory by writing some human-readable metadata about the Pebble instance using it as a secondary. Informs #3230.

blathers-crl bot added A-storage T-storage labels Jan 18, 2024

blathers-crl bot added this to Incoming in Storage Jan 18, 2024

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 18, 2024

wal: initial interfaces for wal package

24908b3

Informs cockroachdb#3230

sumeerbhola mentioned this issue Jan 18, 2024

wal: initial interfaces for wal package #3231

Merged

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 18, 2024

wal: initial interfaces for wal package

391f0ce

Informs cockroachdb#3230

This was referenced Jan 18, 2024

batchrepr: new package #3232

Merged

db: flush/compact memtables to in-memory sstables #3233

Open

sumeerbhola added a commit that referenced this issue Jan 18, 2024

wal: initial interfaces for wal package

4e6dda4

Informs #3230

sumeerbhola mentioned this issue Jan 22, 2024

wal: failover write path code #3224

Merged

sumeerbhola assigned sumeerbhola and jbowens Jan 23, 2024

nicktrav moved this from Incoming to In Progress (this milestone) in Storage Jan 23, 2024

jbowens mentioned this issue Jan 24, 2024

db: SeekPrefixGE lazy positioning, or GetPrefix #2002

Open

jbowens mentioned this issue Mar 15, 2024

wal: adjust FailoverOptions defaults #3413

Merged

jbowens mentioned this issue Mar 20, 2024

storage: support WAL failover to an explicit path cockroachdb/cockroach#120783

Merged

jbowens mentioned this issue Mar 25, 2024

roachtest: add disk-stalled/wal-failover/among-stores test cockroachdb/cockroach#121023

Merged

jbowens mentioned this issue Mar 25, 2024

storage,kv: calculate approximate MVCC stats during compactions cockroachdb/cockroach#121066

Open

jbowens mentioned this issue Mar 27, 2024

wal: synchronously verify secondary is writable #3463

Merged

jbowens closed this as completed Mar 27, 2024

Storage automation moved this from In Progress (this milestone) to Done Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: WAL failover to deal with transient unavailability #3230

db: WAL failover to deal with transient unavailability #3230

sumeerbhola commented Jan 18, 2024 •

edited by jbowens

jbowens commented Mar 27, 2024

db: WAL failover to deal with transient unavailability #3230

db: WAL failover to deal with transient unavailability #3230

Comments

sumeerbhola commented Jan 18, 2024 • edited by jbowens

jbowens commented Mar 27, 2024

sumeerbhola commented Jan 18, 2024 •

edited by jbowens