Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: avoid clobbering replica conf #75233

Conversation

irfansharif
Copy link
Contributor

Fixes #75109. There are two ways to read the configuration applied over
a given replica:
(a) the copy embedded in each replica struct
(b) spanconfig.StoreReader, hanging off the store struct

The interface in (b) is implemented by both the span configs
infrastructure and the legacy system config span it's designed to
replace; it's typically used by KV queues (see #69172). When switching
between subsystems in KV (controlled by spanconfig.store.enabled), for
we transparently switch the source for (b). We also use then use the
right source to refresh (a) for every replica. Incremental updates
thereafter mutate (a) directly. As you'd expect, we need to take special
care that only one subsystem is writing to (a) at a given point in time,
a guarantee that was not upheld before this commit. This bug manifested
after we enabled span configs by default (see #73876), and likely
affected many tests.

To avoid the system config span from clobbering (a) when span configs
are enabled, we just need to check what spanconfig.store.enabled
holds at the right spot. To repro:

# Reliably fails with 1-2m on my GCE worker before this patch,
# doesn't after.
dev test pkg/kv/kvserver \
  -f TestBackpressureNotAppliedWhenReducingRangeSize \
  -v --show-logs --timeout 15m --stress

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif
Copy link
Contributor Author

PS: Looking at the CI failures here, merging this bug fix will probably surface more CI failures/improper test assumptions that were masked until now. Take a look at the following for e.g., definitely does not hold true anymore.

// Disable the replication queues so that the range we're about to create
// doesn't get down-replicated too soon.
tc.ToggleReplicateQueues(false)
testKey := tc.ScratchRange(t)
desc := tc.LookupRangeOrFatal(t, testKey)
// At the end of StartTestCluster(), all ranges have 5 replicas since they're
// all "system ranges". When the ScratchRange() splits its range, it also
// starts up with 5 replicas. Since it's not a system range, its default zone
// config asks for 3x replication, and the replication queue will
// down-replicate it.
require.Len(t, desc.Replicas().Descriptors(), 5)
// Re-enable the replication queue.
tc.ToggleReplicateQueues(true)

@irfansharif irfansharif force-pushed the 220120.deflake-TestBackpressureNotAppliedWhenReducingRangeSize branch from 1a702c7 to 61e1bc8 Compare January 20, 2022 21:49
@irfansharif
Copy link
Contributor Author

Outstanding CI failure is #75212.

Fixes cockroachdb#75109. There are two ways to read the configuration applied over
a given replica:
  (a) the copy embedded in each replica struct
  (b) spanconfig.StoreReader, hanging off the store struct

The interface in (b) is implemented by both the span configs
infrastructure and the legacy system config span it's designed to
replace; it's typically used by KV queues (see cockroachdb#69172). When switching
between subsystems in KV (controlled by spanconfig.store.enabled), for
we transparently switch the source for (b). We also use then use the
right source to refresh (a) for every replica. Incremental updates
thereafter mutate (a) directly. As you'd expect, we need to take special
care that only one subsystem is writing to (a) at a given point in time,
a guarantee that was not upheld before this commit. This bug manifested
after we enabled span configs by default (see cockroachdb#73876), and likely
affected many tests.

To avoid the system config span from clobbering (a) when span configs
are enabled, we just need to check what spanconfig.store.enabled
holds at the right spot. To repro:

    # Reliably fails with 1-2m on my GCE worker before this patch,
    # doesn't after.
    dev test pkg/kv/kvserver \
      -f TestBackpressureNotAppliedWhenReducingRangeSize \
      -v --show-logs --timeout 15m --stress

Release note: None
@irfansharif irfansharif force-pushed the 220120.deflake-TestBackpressureNotAppliedWhenReducingRangeSize branch from 09b8f9f to 2e2894b Compare January 21, 2022 20:53
Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple questions for the second commit but otherwise everything else looks good.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)


-- commits, line 38 at r2:
It's worth adding more words around what those assumptions were here.


pkg/kv/kvserver/testing_knobs.go, line 378 at r4 (raw file):

	// is used only for (old) tests written with the system config span in mind.
	//
	// TODO(irfansharif): Get rid of this knob, maybe by first move

"maybe by first move"


pkg/kv/kvserver/client_store_test.go, line 30 at r3 (raw file):

// TestStoreSetRangesMaxBytes creates a set of ranges via splitting and then
// sets the config zone to a custom max bytes value to verify the ranges' max
// bytes are updated appropriately. // XXX:

// XXX: ?


pkg/kv/kvserver/replicate_queue_test.go, line 275 at r2 (raw file):

	// config asks for 3x replication, and the replication queue will
	// down-replicate it.
	require.Len(t, desc.Replicas().Descriptors(), 5)

Do we know why this worked before and doesn't work now? Is that something we want to address or are we happy changing this test?

This test made assumptions about where the scratch range was carved out
(the system table ranges) from and what config was originally applied to
it (database system's zone config), assumptions that no longer hold with
the span configs infrastructure. This commit partially rewrites the test
to be compatible with span configs (cockroachdb#73876), though unfortunately making
it a bit slower (~4.5s to ~8.5s). The failure mode was masked until now
due to the bug fixed in the preceding commit.

Release note: None
It's a pretty old test and did not age well in lieu of the span configs
infrastructure. We re-write it such that it's no longer so intimately
coupled to the gossiped system config span internals.

Release note: None
@irfansharif irfansharif force-pushed the 220120.deflake-TestBackpressureNotAppliedWhenReducingRangeSize branch from 2e2894b to dedcec7 Compare January 24, 2022 14:02
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


-- commits, line 38 at r2:

Previously, arulajmani (Arul Ajmani) wrote…

It's worth adding more words around what those assumptions were here.

Done.


pkg/kv/kvserver/testing_knobs.go, line 378 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

"maybe by first move"

Done.


pkg/kv/kvserver/client_store_test.go, line 30 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

// XXX: ?

Woops, forgot to remove. Done.


pkg/kv/kvserver/replicate_queue_test.go, line 275 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Do we know why this worked before and doesn't work now? Is that something we want to address or are we happy changing this test?

Yes, added to the commit message. Basically this test relied on the the internal behavior of the system config span where scratch range was split off from the system table ranges, and it "originally" applying the system database zone/span config. With "ConfigureScratchRange", we don't have any such lineage.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r5, 3 of 3 files at r6, 2 of 2 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

@irfansharif
Copy link
Contributor Author

irfansharif commented Jan 24, 2022

CI failures are #75227 (unsure if span configs fallout fixed by #75300) and #75080 (a bit surprising, this PR is rebased on top of #75146; will have a look soon).

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 24, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 24, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 25, 2022

Build succeeded:

@craig craig bot merged commit c956f4c into cockroachdb:master Jan 25, 2022
@irfansharif irfansharif deleted the 220120.deflake-TestBackpressureNotAppliedWhenReducingRangeSize branch January 25, 2022 07:46
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jan 27, 2022
This test relies on a second node's span config rangefeed to have caught
up sufficiently with the global state such that a snapshot, when
applied, would construct a replica with the config this test attempts to
declare over the originating replica. Though very unlikely to occur
(repro-ed only once in hours GCE worker runs), it's possible. This patch
is a speculative fix for a failure observed only with manual stressing
on a SHA that included cockroachdb#75233.

(Also make some prominent span config logging a bit more consistent with
one another.)

Release note: None
craig bot pushed a commit that referenced this pull request Jan 27, 2022
75429: parser: implement ALTER TABLE ... RESET (...)  r=rafiss a=otan

Only last two commits for review (rest is #75262).
See individual commits for details.

75464: *: use bootstrap.BootstrappedSystemIDChecker in tests r=postamar a=postamar

Instead of relying on hard-coded constants, tests now rely on the
bootstrapped system schema to determine which IDs belong to system
tables or not.

Release note: None

75561: sql/tests: skip st_buffer in randomized test r=otan a=rafiss

fixes #74078

Release note: None

75598: kvserver: deflake TestBackpressureNotAppliedWhenReducingRangeSize r=irfansharif a=irfansharif

This test relies on a second node's span config rangefeed to have caught
up sufficiently with the global state such that a snapshot, when
applied, would construct a replica with the config this test attempts to
declare over the originating replica. Though very unlikely to occur
(repro-ed only once in hours GCE worker runs), it's possible. This patch
is a speculative fix for a failure observed only with manual stressing
on a SHA that included #75233.

(Also make some prominent span config logging a bit more consistent with
one another.)

Release note: None

Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
Co-authored-by: Marius Posta <marius@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv/kvserver: TestBackpressureNotAppliedWhenReducingRangeSize failed
3 participants