Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: support WAL failover to an explicit path #120783

Merged
merged 1 commit into from Mar 23, 2024

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Mar 20, 2024

This commit expands on #120509, introducing a WAL failover mode that allows an operator of a node with a single store to configure WAL failover to failover to a particular path (rather than another store's directory).

This is configured via the --wal-failover flag:

--wal-failover=path=/mnt/data2

When disabling or changing the path, the operator is required to pass the previous path. Eg,

--wal_failover=path=/mnt/data3,prev_path=/mnt/data2

or

--wal_failover=disabled,prev_path=/mnt/data2

Informs #119418.
Informs cockroachdb/pebble#3230
Epic: CRDB-35401
Release note (ops change): Adds an additional option to the new (in 24.1) --wal-failover CLI flag allowing an operator to specify an explicit path for WAL failover for single-store nodes.

Copy link

blathers-crl bot commented Mar 20, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jbowens jbowens marked this pull request as ready for review March 20, 2024 22:42
@jbowens jbowens requested review from a team as code owners March 20, 2024 22:42
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jbowens and @RaduBerinde)


-- commits line 15 at r1:
nit: wal-failover


-- commits line 19 at r1:
Do you know if the potential use cases in CC are using encryption-at-rest, in which case it would be worth the complexity of adding support for that configuration?

Can you add a substantial comment somewhere, say flags.go, with the various ways to set --wal-failover. As far as I can tell, master does not have one place which describes --wal-failover={among-stores,disabled} (it is spread across flags.go and storage/open.go), and now we are adding more.

Specifically, the distinction between --wal_failover=disabled,prev_path=/mnt/data2 and --wal_failover=disabled would be good to clarify, so that the operator knows that they can do the latter only after among-stores.


pkg/base/config.go line 897 at r1 (raw file):

	// WALFailoverExplicitPath (when changing the secondary path) or
	// WALFailoverDisabled (when disabling WAL failover after it was previously
	// enabled). It must be empty for other modes.

I think it is worth clarifying that this is also only set when disabling after explicit path was used. It must not be set when among-stores was used.


pkg/storage/open.go line 232 at r1 (raw file):

	if len(storeEnvs) == 1 {
		switch walCfg.Mode {
		case base.WALFailoverDefault, base.WALFailoverAmongStores:

arguably, base.WALFailoverAmongStores should be an error.


pkg/storage/open.go line 250 at r1 (raw file):

			// WALFailoverAmongStores. If there's only 1 store, then WAL failover was
			// effectively disabled and we noop.
			return noopConfigOption

error?


pkg/storage/open.go line 282 at r1 (raw file):

		// Check if the user provided an explicit previous path. If they did, they
		// were previously using WALFailoverExplicitPath and are now disabling it.
		// We need to add the explicilt path to WALRecoveryDirs.

I didn't quite understand the "We need to add the explicit path ..." given this is an error case -- they can't be using explicit path with multi store.

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @RaduBerinde and @sumeerbhola)


-- commits line 15 at r1:

Previously, sumeerbhola wrote…

nit: wal-failover

Done.


-- commits line 19 at r1:

Previously, sumeerbhola wrote…

Do you know if the potential use cases in CC are using encryption-at-rest, in which case it would be worth the complexity of adding support for that configuration?

Can you add a substantial comment somewhere, say flags.go, with the various ways to set --wal-failover. As far as I can tell, master does not have one place which describes --wal-failover={among-stores,disabled} (it is spread across flags.go and storage/open.go), and now we are adding more.

Specifically, the distinction between --wal_failover=disabled,prev_path=/mnt/data2 and --wal_failover=disabled would be good to clarify, so that the operator knows that they can do the latter only after among-stores.

Yeah, CC allows any cluster to use CMEK, so I think we will need to handle it. It actually completely slipped my mind when I was putting this up. One thought; can we use the same encrypted FS and just convert the path= to an absolute path before handing to Pebble?

I updated the actual flag description for --wal-failover to expand on the usage requirements. Do you think that's sufficient, or do you think we should have a comment discussing the internal requirements that necessitate it too?


pkg/base/config.go line 897 at r1 (raw file):

Previously, sumeerbhola wrote…

I think it is worth clarifying that this is also only set when disabling after explicit path was used. It must not be set when among-stores was used.

Done.


pkg/storage/open.go line 232 at r1 (raw file):

Previously, sumeerbhola wrote…

arguably, base.WALFailoverAmongStores should be an error.

Yeah, I'm worried about one customer in particular that intends to set COCKROACH_WAL_FAILOVER=among-stores in a fleet-wide environment that may apply to both nodes with single stores and nodes with multiple stores.


pkg/storage/open.go line 250 at r1 (raw file):

Previously, sumeerbhola wrote…

error?

Added a comment here; we will ultimately error if there was a WAL failover secondary that wasn't provided, but I think we should allow them to explicitly set it to disabled, even if they never enabled it. I can imagine a customer wanting to do this in advance of 24.2 if we do change the default of multi-store nodes to among-stores.


pkg/storage/open.go line 282 at r1 (raw file):

Previously, sumeerbhola wrote…

I didn't quite understand the "We need to add the explicit path ..." given this is an error case -- they can't be using explicit path with multi store.

Oops, done.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jbowens and @RaduBerinde)


-- commits line 19 at r1:

can we use the same encrypted FS and just convert the path= to an absolute path before handing to Pebble?

I didn't understand. "same" as what? We need the file registry etc. to function on the secondary even if the primary is stalled. I must be missing something.

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @RaduBerinde and @sumeerbhola)


-- commits line 19 at r1:

Previously, sumeerbhola wrote…

can we use the same encrypted FS and just convert the path= to an absolute path before handing to Pebble?

I didn't understand. "same" as what? We need the file registry etc. to function on the secondary even if the primary is stalled. I must be missing something.

You weren't missing anything; I'm just confusing myself. Updated to handle encryption-at-rest by using the same store key to initialize a separate encrypted FS for the wal dir.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 3 files at r2, 10 of 10 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jbowens and @RaduBerinde)


-- commits line 19 at r1:

Do you think that's sufficient, or do you think we should have a comment discussing the internal requirements that necessitate it too?

I don't think the requirements are needed for the code comments.


pkg/ccl/baseccl/encryption_spec.go line 222 at r3 (raw file):

			}
			if len(externalPath.EncryptionOptions) > 0 {
				return fmt.Errorf("WAL failover path %s already has an encryption setting",

For multi-store, the outer loop can cause this error, since a previous iteration may have set it, yes?

Can you add a code comment that the external paths will only be set in a single store config. Otherwise this code by itself looks buggy.


pkg/cli/flags.go line 1268 at r3 (raw file):

	if serverCfg.WALFailover.Path.IsSet() {
		absPath, err := base.GetAbsoluteFSPath("wal-failover.path", serverCfg.WALFailover.Path.Path)

I'm forgetting something here. Why do we need the absolute path? Is that something to do with the file registry lookups?


pkg/storage/open.go line 232 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Yeah, I'm worried about one customer in particular that intends to set COCKROACH_WAL_FAILOVER=among-stores in a fleet-wide environment that may apply to both nodes with single stores and nodes with multiple stores.

ack


pkg/storage/open.go line 250 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Added a comment here; we will ultimately error if there was a WAL failover secondary that wasn't provided, but I think we should allow them to explicitly set it to disabled, even if they never enabled it. I can imagine a customer wanting to do this in advance of 24.2 if we do change the default of multi-store nodes to among-stores.

sounds good


pkg/storage/open.go line 226 at r3 (raw file):

	// be encrypted so that the user doesn't accidentally leak data unencrypted
	// onto the filesystem.
	if engineCfg.Env.Encryption != nil && len(externalDir.EncryptionOptions) == 0 {

what about the engineCfg.Env.Encryption == nil && len(externalDir.EncryptionOptions) != 0 case?

This commit expands on cockroachdb#120509, introducing a WAL failover mode that allows an
operator of a node with a single store to configure WAL failover to failover to
a particular path (rather than another store's directory).

This is configured via the --wal-failover flag:

    --wal-failover=path=/mnt/data2

When disabling or changing the path, the operator is required to pass the
previous path. Eg,

    --wal-failover=path=/mnt/data3,prev_path=/mnt/data2

or

    --wal-failover=disabled,prev_path=/mnt/data2

Informs cockroachdb#119418.
Informs cockroachdb/pebble#3230
Epic: CRDB-35401
Release note (ops change): Adds an additional option to the new (in 24.1)
--wal-failover CLI flag allowing an operator to specify an explicit path for
WAL failover for single-store nodes.
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @RaduBerinde and @sumeerbhola)


pkg/ccl/baseccl/encryption_spec.go line 222 at r3 (raw file):

Previously, sumeerbhola wrote…

For multi-store, the outer loop can cause this error, since a previous iteration may have set it, yes?

Can you add a code comment that the external paths will only be set in a single store config. Otherwise this code by itself looks buggy.

Done


pkg/cli/flags.go line 1268 at r3 (raw file):

Previously, sumeerbhola wrote…

I'm forgetting something here. Why do we need the absolute path? Is that something to do with the file registry lookups?

I believe we do it to produce "canonical" paths that can be compared (eg, when matching encryption specs to directories), and just so logged store directories are consistent regardless of how the path is specified at CLI start. IIRC, file registry lookups will convert full absolute paths back into paths relative to the data directory.


pkg/storage/open.go line 226 at r3 (raw file):

Previously, sumeerbhola wrote…

what about the engineCfg.Env.Encryption == nil && len(externalDir.EncryptionOptions) != 0 case?

Done.

@jbowens
Copy link
Collaborator Author

jbowens commented Mar 23, 2024

TFTR!

bors r=sumeerbhola

@craig
Copy link
Contributor

craig bot commented Mar 23, 2024

@craig craig bot merged commit 0e93bbb into cockroachdb:master Mar 23, 2024
21 of 22 checks passed
@jbowens jbowens deleted the explicit-path branch March 23, 2024 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants