Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: TestUpgradeHappensAfterMigrations failed [unexpected stopped rangefeed processor] #76649

Closed
cockroach-teamcity opened this issue Feb 16, 2022 · 7 comments · Fixed by #76825
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

server.TestUpgradeHappensAfterMigrations failed with artifacts on master @ 512320f0d3c6230edcbe8ca6eab33ed9183bfbfb:

=== RUN   TestUpgradeHappensAfterMigrations
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss

/cc @cockroachdb/server

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Feb 16, 2022
@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label Feb 16, 2022
@cameronnunez
Copy link
Contributor

panic: unexpected Stopped processor

goroutine 369588 [running]:
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).registerWithRangefeedRaftMuLocked(0xc001bdaa80, {0x50c3e38, 0xc0081b8810}, {{0xc007afc078, 0x1, 0x8}, {0xc007afc080, 0x1, 0x8}}, {0x16d4384b954857e6, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:417 +0x758
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).rangeFeedWithRangeID(0xc001bdaa80, 0x2c, 0xc007ee37a0, {0x5120af0, 0xc00d3eaea0})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:240 +0x6a5
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).RangeFeed(...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:148
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).RangeFeed(0xc0050a1600, 0xc007ee37a0, {0x5120af0, 0xc00d3eaea0})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3036 +0xdc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).RangeFeed(0xc0056bb450, 0xc007ee37a0, {0x5120af0, 0xc00d3eaea0})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:216 +0xe8
github.com/cockroachdb/cockroach/pkg/server.(*Node).RangeFeed(0x5120af0, 0xc0021a3b00, {0x5120af0, 0xc00d3eaea0})
	github.com/cockroachdb/cockroach/pkg/server/node.go:1194 +0x38
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.RangeFeed.func1()
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:702 +0xeb
created by github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.RangeFeed
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:699 +0x1cf

@cameronnunez
Copy link
Contributor

cc @cockroachdb/kv-prs for triage

@cameronnunez cameronnunez removed this from To do in DB Server & Security Feb 16, 2022
@cameronnunez cameronnunez added T-kv KV Team and removed T-server-and-security DB Server & Security labels Feb 16, 2022
@blathers-crl blathers-crl bot added this to Incoming in KV Feb 16, 2022
@tbg tbg changed the title server: TestUpgradeHappensAfterMigrations failed server: TestUpgradeHappensAfterMigrations failed [unexpected stopped rangefeed processor] Feb 17, 2022
@tbg tbg added this to Incoming in Replication via automation Feb 17, 2022
@tbg tbg removed this from Incoming in KV Feb 17, 2022
@irfansharif irfansharif self-assigned this Feb 18, 2022
@rafiss
Copy link
Collaborator

rafiss commented Feb 18, 2022

Also happened in #76724 and #76655

@rafiss
Copy link
Collaborator

rafiss commented Feb 18, 2022

I was able to get pretty good repros with

make stress PKG=./pkg/sql/pgwire TESTS="TestPGTest/pgjdbc" TESTTIMEOUT=15m0s STRESSFLAGS="-stderr -maxfails 1 -maxtime 20m0s -p 16"

Usually would fail pretty quick, but sometimes it could take 3-5 minutes to hit the bug.

@erikgrinaker
Copy link
Contributor

This was bisected to #76445 in #76655 (comment).

@erikgrinaker
Copy link
Contributor

Reverting f048ab0 makes this much less likely to trigger. I suspect because of this client-side change:

f048ab0#diff-063a1c1eae06f7b653f7f19bdb2f948df949b0039a7d6bcfb6194e53829294e0R283-R290

The old code would check the stopper before sending the RangeFeed request, while the new code relies on context cancellation instead.

However, even after reverting that change, I still got panic: unexpected Stopped processor after about 50k runs of TestPGTest/pgjdbc, so I think that only masks the underlying problem by making it less likely to trigger when the client and the server derive from the same stopper.

@erikgrinaker
Copy link
Contributor

I think I found the problem.stoppedC gets closed here:

Because RunAsyncTask returns NodeUnavailableError, most likely from here:

if !s.runPrelude() {
return ErrUnavailable
}

The condition there is:

return s.mu.stopping || s.mu.quiescing

s.mu.quiescing is only set when s.quiescer is also closed:

if !s.mu.quiescing {
s.mu.quiescing = true
close(s.quiescer)
}

If this happens, then it gets caught by the ShouldQuiesce() case:

case <-r.store.Stopper().ShouldQuiesce():
errC <- roachpb.NewError(&roachpb.NodeUnavailableError{})
return nil

However, s.mu.stopping gets set in a separate critical section from s.mu.quiescing:

s.mu.Lock()
stopCalled := s.mu.stopping
s.mu.stopping = true
s.mu.Unlock()

The following sequence of events is therefore possible:

  • T1: s.mu.stopping = true
  • T2: Processor.Start()RunAsyncTask()s.mu.stopping == trueNodeUnavailableErrorclose(stoppedC)
  • T2: Processor.Register()stoppedCfalse
  • T2: ShouldQuiesce()s.mu.quiescer not closed → panic
  • T1: s.mu.quiescing = true, close(s.mu.quiescer)

I'll make Processor.Start() return an error, and propagate that back to the caller, which I think should fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants