Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: failover/chaos/read-only failed #126927

Closed
cockroach-teamcity opened this issue Jul 10, 2024 · 3 comments
Closed

roachtest: failover/chaos/read-only failed #126927

cockroach-teamcity opened this issue Jul 10, 2024 · 3 comments
Assignees
Labels
branch-release-24.1.2-rc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team X-stale
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 10, 2024

roachtest.failover/chaos/read-only failed with artifacts on release-24.1.2-rc @ 7e81be6de75205c3d08b0d8dcc6ca188306abc27:

(assertions.go:363).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1466
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:344
	            				main/pkg/cmd/roachtest/monitor.go:120
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight load-value:authinfo-roachprod-2-2: context deadline exceeded
	Test:       	failover/chaos/read-only
(require.go:1357).NoError: FailNow called
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
(cluster.go:2398).Run: context canceled
(cluster.go:2398).Run: context canceled
(cluster.go:2398).Run: context canceled
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-only/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-40187

@cockroach-teamcity cockroach-teamcity added branch-release-24.1.2-rc C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jul 10, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Jul 10, 2024
@andrewbaptist andrewbaptist added P-3 Issues/test failures with no fix SLA C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 15, 2024
@andrewbaptist
Copy link
Contributor

I'm removing the release blocker - it appears to be an issue where we have two back-to-back failures on epoch leases that aren't supported. Specifically:

09:27:00 failover.go:293: chaos iteration 5
09:28:12 failover.go:343: failing n8 (blackhole-recv)
09:28:12 failover.go:343: failing n9 (deadlock)

The deadlock doesn't go through because the blackhole has left the cluster in a bad state with epoch leases. The problem is that a blackhole with epoch lease doesn't always return availability so the attempt to induce the deadlock fails.

This is a test problem where we need to disallow this combination.

Assigning myself and setting P3 as it isn't a real issue but is also hard to fix without either crippling the test for epoch leases (to only have a single failure) or manually figuring out the combinations that can't be done together in a metamorphic-like test.

@arulajmani
Copy link
Collaborator

manually figuring out the combinations that can't be done together

This sounds promising. Seems like this issue has come up another time; @andrewbaptist do you think it'll help to at least list down the incompatible combinations? Even if we don't address the issue by automatically selecting from just the compatible operations, having a list would make for quick triage.

@arulajmani arulajmani added P-2 Issues/test failures with a fix SLA of 3 months and removed P-3 Issues/test failures with no fix SLA labels Jul 31, 2024
@github-project-automation github-project-automation bot moved this to roachtest/unit test backlog in KV Aug 28, 2024
Copy link

github-actions bot commented Sep 2, 2024

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.1.2-rc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team X-stale
Projects
No open projects
Status: roachtest/unit test backlog
Development

No branches or pull requests

3 participants