Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: disable {split,replicate,mvccGC} queues until... #98422

Merged
merged 1 commit into from Mar 13, 2023

Conversation

irfansharif
Copy link
Contributor

...subscribed to span configs. Do the same for the store rebalancer. We applied this treatment for the merge queue back in #78122 since the fallback behavior, if not subscribed, is to use the statically defined span config for every operation.

  • For the replicate queue this mean obtusely applying a replication factor of 3, regardless of configuration. This was possible typically post node restart before subscription was initially established. We saw this in roachtest: replicate/wide failed #98385. It was possible then for us to ignore configured voter/non-voter/lease constraints.
  • For the split queue, we wouldn't actually compute any split keys if unsubscribed, so the missing check was somewhat benign. But we would be obtusely applying the default range sizes [128MiB,512MiB], so for clusters configured with larger range sizes, this could lead to a burst of splitting post node-restart.
  • For the MVCC GC queue, it would mean applying the the statically configured default GC TTL and ignoring any set protected timestamps. The latter is best-effort protection but could result in internal operations relying on protection (like backups, changefeeds) failing informatively. For clusters configured with GC TTL greater than the default, post node-restart it could lead to a burst of MVCC GC activity and AOST queries failing to find expected data.
  • For the store rebalancer, ignoring span configs could result in violating lease preferences and voter constraints.

Fixes #98421.
Fixes #98385.

Release note (bug fix): It was previously possible for CockroachDB to not respect non-default zone configs. This only happened for a short window after nodes with existing replicas were restarted, and self-rectified within seconds. This manifested in a few ways:

  • If num_replicas was set to something other than 3, we would still add or remove replicas to get to 3x replication.
    • If num_voters was set explicitly to get a mix of voting and non-voting replicas, it would be ignored. CockroachDB could possibly remove non-voting replicas.
  • If range_min_bytes or range_max_bytes were changed from 128 MiB and 512 MiB respectively, we would instead try to size ranges to be within [128 MiB, 512MiB]. This could appear as an excess amount of range splits or merges, as visible in the Replication Dashboard under "Range Operations".
  • If gc.ttlseconds was set to something other than 90000 seconds, we would still GC data only older than 90000s/25h. If the GC TTL was set to something larger than 25h, AOST queries going further back may now start failing. For GC TTLs less than the 25h default, clusters would observe increased disk usage due to more retained garbage.
  • If constraints, lease_preferences or voter_constraints were set, they would be ignored. Range data and leases would possibly be moved outside where prescribed. This issues only lasted a few seconds post node-restarts, and any zone config violations were rectified shortly after.

@irfansharif irfansharif added the backport-22.2.x Flags PRs that need to be backported to 22.2. label Mar 11, 2023
@irfansharif irfansharif requested a review from a team as a code owner March 11, 2023 01:33
@blathers-crl
Copy link

blathers-crl bot commented Mar 11, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only happened for a short window after nodes with existing replicas were restarted

It seems as though it could possibly occur if the range feed couldn't be established as well? Is that unlikely to occur in practice?

Overall :lgtm:. Some minor nits. Looks like the patch needs crlfmt, update setting names and implement *mockSpanConfigSubscriber.LastUpdated to pass CI.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @irfansharif)


-- commits line 16 at r1:
Could you explain this further? If there are no split keys being computed how could there be a burst of splitting post node restart?

Code quote:

  - For the split queue, we wouldn't actually compute any split keys if
    unsubscribed, so the missing check was somewhat benign. But we would
    be obtusely applying the default range sizes [128MiB,512MiB], so for
    clusters configured with larger range sizes, this could lead to a
    burst of splitting post node-restart.

-- commits line 27 at r1:
nit: Generally it affected constraints, not just voter.

for i := 0; i < len(finalNonVoterTargets); i++ {
add, remove, _, shouldRebalance := sr.allocator.RebalanceTarget(
ctx,
sr.storePool,
rbCtx.conf,
rbCtx.candidateReplica.RaftStatus(),
finalVoterTargets,
finalNonVoterTargets,
rbCtx.candidateReplica.RangeUsageInfo(),
storepool.StoreFilterSuspect,
allocatorimpl.NonVoterTarget,
options,
)


pkg/kv/kvserver/store_rebalancer.go line 158 at r1 (raw file):

		storePool = rq.store.cfg.StorePool
	}

nit: empty line added.


pkg/kv/kvserver/kvserverbase/base.go line 43 at r1 (raw file):

var ReplicateQueueEnabled = settings.RegisterBoolSetting(
	settings.SystemOnly,
	"kv.replicate.queue_enabled",

nit: change these to be kv.xxxx.queue.enabled

@irfansharif irfansharif requested a review from a team as a code owner March 13, 2023 16:17
...subscribed to span configs. Do the same for the store
rebalancer. We applied this treatment for the merge queue back in cockroachdb#78122
since the fallback behavior, if not subscribed, is to use the statically
defined span config for every operation.

- For the replicate queue this mean obtusely applying a replication
  factor of 3, regardless of configuration. This was possible typically
  post node restart before subscription was initially established. We
  saw this in cockroachdb#98385. It was possible then for us to ignore configured
  voter/non-voter/lease constraints.
- For the split queue, we wouldn't actually compute any split keys if
  unsubscribed, so the missing check was somewhat benign. But we would
  be obtusely applying the default range sizes [128MiB,512MiB], so for
  clusters configured with larger range sizes, this could lead to a
  burst of splitting post node-restart.
- For the MVCC GC queue, it would mean applying the the statically
  configured default GC TTL and ignoring any set protected timestamps.
  The latter is best-effort protection but could result in internal
  operations relying on protection (like backups, changefeeds) failing
  informatively. For clusters configured with GC TTL greater than the
  default, post node-restart it could lead to a burst of MVCC GC
  activity and AOST queries failing to find expected data.
- For the store rebalancer, ignoring span configs could result in
  violating lease preferences and voter/non-voter constraints.

Fixes cockroachdb#98421.
Fixes cockroachdb#98385.

While here, we also introduce the following non-public cluster settings
to selectively enable/disable KV queues:
- kv.mvcc_gc_queue.enabled
- kv.split_queue.enabled
- kv.replicate_queue.enabled

Release note (bug fix): It was previously possible for CockroachDB to
not respect non-default zone configs. This could only happen for a short
window after nodes with existing replicas were restarted (measured in
seconds), and self-rectified (also within seconds). This manifested in a
few ways:
- If num_replicas was set to something other than 3, we would still
  add or remove replicas to get to 3x replication.
  - If num_voters was set explicitly to get a mix of voting and
    non-voting replicas, it would be ignored. CockroachDB could possibly
    remove non-voting replicas.
- If range_min_bytes or range_max_bytes were changed from 128 MiB and
  512 MiB respectively, we would instead try to size ranges to be within
  [128 MiB, 512MiB]. This could appear as an excess amount of range
  splits or merges, as visible in the Replication Dashboard under "Range
  Operations".
- If gc.ttlseconds was set to something other than 90000 seconds, we
  would still GC data only older than 90000s/25h. If the GC TTL was set
  to something larger than 25h, AOST queries going further back may now
  start failing. For GC TTLs less than the 25h default, clusters would
  observe increased disk usage due to more retained garbage.
- If constraints, lease_preferences or voter_constraints were set, they
  would be ignored. Range data and leases would possibly be moved
  outside where prescribed.
This issues lasted a few seconds post node-restarts, and any zone config
violations were rectified shortly after.
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only mentioning the node restart point since it's a release note -- rangefeeds are an internal mechanism. But yes, it can also happen if rangefeeds reliably fail to get established, say when there's loss of quorum. But these are centralized system database ranges, and if you've lost quorum there, things are pretty bad. I don't think it's worth mentioning, and I think it's rare given our routine 5x replication.

Repurposed queueConfig.needsSystemConfig to capture this needs-subscription-logic; will wait for CI to get green.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @kvoli)


-- commits line 16 at r1:

Previously, kvoli (Austen) wrote…

Could you explain this further? If there are no split keys being computed how could there be a burst of splitting post node restart?

The burst of splitting would be due to the larger configured range sizes, not due to computed split keys (which BTW, are only used to figure out schema boundaries or under coalesced ranges, the points where adjacent schemas have different span configs). So if the restarted store has 4GiB ranges, post-restart we might try to slice them back down to 512MiB ranges.

@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 13, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 13, 2023

Build failed (retrying...):

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! I see this is already in the bors queue, so no need to stop the train for me, but please do address comments in a followup. Specifically, the one around expanding our testing.

Reviewed 19 of 19 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @kvoli)


pkg/kv/kvserver/mvcc_gc_queue.go line 686 at r2 (raw file):

) (processed bool, err error) {
	if !mgcq.enabled() {
		log.VEventf(ctx, 2, "skipping mvcc gc: queue has been disabled")

nit: Consider logging louder, but doing so log.Every(10 seconds).


pkg/kv/kvserver/store.go line 2066 at r2 (raw file):

}

// GetConfReader exposes access to a configuration reader.

Please add a sentence about returning errSpanConfigUnavailable up here as well.


pkg/kv/kvserver/store.go line 2090 at r2 (raw file):

		// surface explicitly that we don't have any span configs instead of
		// falling back to the statically configured one.
		// - enabling range merges would be extremely dangerous -- we could

Instead of having these words here, they're better placed around where you're setting needsSpanConfigs for the respective queues.


pkg/spanconfig/spanconfigkvsubscriber/kvsubscriber_client_test.go line 32 at r2 (raw file):

// TestBlockedKVSubscriberDisablesMerges ensures that the merge queue is
// disabled until the KVSubscriber has some snapshot.
func TestBlockedKVSubscriberDisablesMerges(t *testing.T) {

Let's expand this test to cover the other queues we've identified in this commit as well.

@craig
Copy link
Contributor

craig bot commented Mar 13, 2023

Build succeeded:

@craig craig bot merged commit c31c1ac into cockroachdb:master Mar 13, 2023
7 checks passed
@blathers-crl
Copy link

blathers-crl bot commented Mar 13, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 234bcd0 to blathers/backport-release-22.2-98422: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do address comments in a followup. Specifically, the one around expanding our testing.

See #98807.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 21, 2023
Regression test for cockroachdb#98422 which disables {split,replicate,mvccGC}
queues until subscribed to span configs. We're only slightly modifying
the existing merge-queue specific TestBlockedKVSubscriberDisablesMerges.

Release note: None
craig bot pushed a commit that referenced this pull request Mar 21, 2023
…99123

98712: storage: expose separate WriteBatch interface r=sumeerbhola a=jbowens

Adapt NewUnindexedBatch to no longer take a writeOnly parameter. Instead, a new NewWriteBatch method is exposed that returns a WriteBatch type that does not provide Reader facilities. Future work may remove UnindexedBatch altogether, updating callers to explicitly maintain separate Readers and WriteBatches.

Epic: None
Release note: None

98807: spanconfig: add TestBlockedKVSubscriberDisablesQueues r=irfansharif a=irfansharif

Regression test for #98422 which disables {split,replicate,mvccGC} queues until subscribed to span configs. We're only slightly modifying the existing merge-queue specific TestBlockedKVSubscriberDisablesMerges.

Release note: None

98823: sql: address a couple of old TODOs r=yuzefovich a=yuzefovich

This commit addresses a couple of old TODOs in the `connExecutor.dispatchToExecutionEngine` method.

First, it simply removes the now-stale TODO about needing to `Step()` the txn before executing the query. The TODO mentioned things about the old way FK checks were performed, but we now always run checks in a deferred fashion, and perform the `Step`s correctly there for cascades and checks, thus, the TODO can be removed. (The TODO itself explicitly mentions that it can be removed once we do checks and cascades in the deferred fashion.)

Second, it addresses a TODO about how some plan information is saved. Up until 961e66f the cleanup of `planNode` tree and `flow` infra was intertwined, so in 6ae4881 in order to "sample" the plan with correct info (some of which is only available _after_ the execution is done) we delegated that sampling to `planTop.close`. However, with `planNode` tree and `flow` cleanups refactored and de-coupled, we no longer have to close the plan early on. As a result, we now close the plan in a defer right after we create it (now it's the only place we do so), and this commit makes it so that we record the plan info explicitly right after returning from the execution engine, thus, addressing the second TODO.

Epic: None

Release note: None

98896: tracing: fix WithEventListeners option with verbose parent r=yuzefovich a=yuzefovich

Previously, when using `WithEventListeners` span option and not explicitly specifying `WithRecording` option, we could incorrectly use the structured recording. In particular, this was the case when the parent has verbose recording, and this is now fixed. At the moment, this span option is used only by the backup and restore processors, so the impact would be that their verbose recording wouldn't be included into the session recording (perhaps we don't actually expose that anyway).

Epic: None

Release note: None

99079: sql: fix hint for unknown udfs in views and funcs r=rharding6373 a=rharding6373

Fixes a hint for some instances of an "unknown function" error that was not properly applied previously.

Informs: #99002
Epic: None

Release note: None

99095: ui: fix inflight request replacement, cleanup database details api r=THardy98 a=THardy98

Changes to allow in-flight request replacement in this PR: #98331 caused breaking changes to `KeyedCachedDataReducer`s. Despite the added `allowReplacementRequests` flag being `false`, in-flight requests would be replaced (specifically, when fetching for initial state). This would prevent all requests but the last one from being stored in the reducer state and leave the remaining requests in a permanent `inFlight` status. Consequently, our pages would break as the permanent `inFlight` status on these requests would prevent us from ever fetching data, putting these pages in a permanent loading state.

This PR adds checks to ensure `allowReplacementRequests` is enabled when we receive a response, before deciding whether to omit it from the reducer state.

It's important to note that this is still not an ideal solution for `KeyedCachedDataReducer`s, should we ever want to replace inflight requests for one. The checks to replace an in-flight request still occur at the reducer-level, whereas in a `KeyedCachedDataReducer` we'd like them to occur at the state-level (i.e. instead of replacement checks for all requests across the reducer, we should check for requests that impact a specific key's state). This way we scope replacements to each key in the reducer, allowing us to fetch data for each key independently (which is the intended functionality). Enabling this for a `KeyedCachedDataReducer` without making this change I believe will cause the breaking behaviour above.

This PR also includes some cleanup to the database details API, namely:
- camel casing response fields
- encapsulating the database span stats response into its own object (instead of across multiple objects), this helps scope any errors deriving from the query to the single stats object

**BEFORE**
https://www.loom.com/share/57c9f278376a4551977398316d6ed7b6

**AFTER**
https://www.loom.com/share/b2d4e074afca4b42a4f711171de3fd72

Release note (bug fix): Fixed replacement of in-flight requests for `KeyedCachedDataReducer`s to prevent permanent loading on requests stuck on an `inFlight` status.

99097: upgrades: fix backfilling of special roles in user ID migrations r=rafiss a=andyyang890

upgrades: fix system.privileges user ID migration

This patch adds a missing case for the public role in the backfilling
portion of the system.privileges user ID migration.

Release note: None

---

upgrades: fix system.database_role_settings user ID migration

This patch adds a missing case for the empty role in the backfilling
portion of the system.database_role_settings user ID migration.

Release note: None

---

Part of #87079

99113: server: use no-op cost controller for shared-process tenants r=knz a=stevendanna

In the long run, all rate limiting of tenants should be controlled by
a tenant capability.

However, at the moment, we do not have the infrastructure to read
tenant capabilities from the tenant process.

Since, the main user of shared-process tenants is the new,
experimental unified-architecture were we do not anticipate needing
the cost controller for most users, this PR injects a no-op cost
controller for shared-process tenants.

Epic: CRDB-23559

Release note: None

99120: tracing: allow collection of stack history for Structured spans as well r=dt a=adityamaru

Previously, we mandated that the span must be Verbose to be able to collect and emit a Structured event containing stack history. This change permits Structured spans to also collect stack history.

Release note: None
Epic: none

99123: sql: skip flaky TestSQLStatsDataDriven r=ericharmeling a=ericharmeling

This commit skips TestSQLStatsDataDriven. This test is flaky.

Part of #89861.

Epic: none

Release note: None

Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: rharding6373 <rharding6373@users.noreply.github.com>
Co-authored-by: Thomas Hardy <thardy@cockroachlabs.com>
Co-authored-by: Andy Yang <yang@cockroachlabs.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-22.2.x Flags PRs that need to be backported to 22.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: default span config is used temporarily on startup roachtest: replicate/wide failed
4 participants