Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cluster monitor hanging after unexpected node closure #118563

Closed
cockroach-teamcity opened this issue Feb 1, 2024 · 6 comments · Fixed by #119535
Closed

roachtest: cluster monitor hanging after unexpected node closure #118563

cockroach-teamcity opened this issue Feb 1, 2024 · 6 comments · Fixed by #119535
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-1 Issues/test failures with a fix SLA of 1 month T-testeng TestEng Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 1, 2024

roachtest.import/tpch/nodes=8 failed with artifacts on master @ ed3a25e3c9459cede2f80babbfc9d44a836b6c12:

VMs preempted during the test run : projects/cockroach-ephemeral/zones/us-east1-b/instances/teamcity-13762710-1706682693-107-n8cpu4-0005

**Other Failure**
(test_runner.go:1136).runTest: test timed out (10h0m0s)
test artifacts and logs in: /artifacts/import/tpch/nodes=8/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-35788

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-queries SQL Queries Team labels Feb 1, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Feb 1, 2024
@DrewKimball
Copy link
Collaborator

The import seems to have succeeded in ~2 hours, while node 5 died about 1.5 hours in, probably due to VM preemption. I wonder if the test runner isn't able to handle dead nodes, and hangs instead?

@DrewKimball DrewKimball removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 1, 2024
@DrewKimball
Copy link
Collaborator

The test uses a roachtest/cluster.Monitor to watch for unexpected node deaths, but maybe something about the nature of this node death (ssh failures due to VM preemption) prevents that from working correctly. I see this bit in the stack trace repeated 7 times (indicating that one node is already finished):

goroutine 7338029 [IO wait, 599 minutes]:
internal/poll.runtime_pollWait(0x7f9740938e98, 0x72)
	GOROOT/src/runtime/netpoll.go:343 +0x85
internal/poll.(*pollDesc).wait(0xc00478c960?, 0xc0011d3000?, 0x1)
	GOROOT/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	GOROOT/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00478c960, {0xc0011d3000, 0x1000, 0x1000})
	GOROOT/src/internal/poll/fd_unix.go:164 +0x27a
os.(*File).read(...)
	GOROOT/src/os/file_posix.go:29
os.(*File).Read(0xc003cbaeb8, {0xc0011d3000?, 0x2?, 0xc004af0db0?})
	GOROOT/src/os/file.go:118 +0x52
bufio.(*Reader).fill(0xc004af0f50)
	GOROOT/src/bufio/bufio.go:113 +0x103
bufio.(*Reader).ReadSlice(0xc004af0f50, 0x1?)
	GOROOT/src/bufio/bufio.go:379 +0x29
bufio.(*Reader).ReadLine(0xc004af0f50)
	GOROOT/src/bufio/bufio.go:408 +0x25
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Monitor.func2.3({0x85bfb40?, 0xc003cbaeb8?})
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:956 +0x1e5
created by github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Monitor.func2 in goroutine 7337904
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:952 +0xea5

There are another 7 of these:

goroutine 7337904 [semacquire, 599 minutes]:
sync.runtime_Semacquire(0xc0053640c0?)
	GOROOT/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc0042fc220?)
	GOROOT/src/sync/waitgroup.go:116 +0x48
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Monitor.func2(0x0)
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:1021 +0xf59
created by github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Monitor in goroutine 7337903
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:759 +0x19b

...but 8 of these:

goroutine 7337979 [chan receive, 599 minutes]:
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Monitor.func2.4()
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:1017 +0x35
created by github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Monitor.func2 in goroutine 7338018
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:1016 +0xf4c

I'm suspicious of this line:

if err == io.EOF {
return
}

which never invokes sendEvent to inform the caller that a node has closed.

@DrewKimball DrewKimball added the P-3 Issues/test failures with no fix SLA label Feb 1, 2024
@DrewKimball DrewKimball changed the title roachtest: import/tpch/nodes=8 failed roachtest: cluster monitor hanging after unexpected node closure Feb 1, 2024
@DrewKimball
Copy link
Collaborator

cc @cockroach-dev-inf does this look like something your team might address?

@jlinder
Copy link
Collaborator

jlinder commented Feb 7, 2024

This looks more like something @cockroachdb/test-eng would address as they own roachprod + roachtest.

@renatolabs
Copy link
Collaborator

Ack, thanks for redirecting, we'll look into it.

Also, it's pretty weird that this issue wasn't filed under #118528 with the rest of the other VM preemption failures. Might have something to do with the test timeout.

@renatolabs renatolabs added the T-testeng TestEng Team label Feb 7, 2024
Copy link

blathers-crl bot commented Feb 7, 2024

cc @cockroachdb/test-eng

@blathers-crl blathers-crl bot added this to Triage in Test Engineering Feb 7, 2024
@renatolabs renatolabs removed the T-sql-queries SQL Queries Team label Feb 7, 2024
@renatolabs renatolabs added P-2 Issues/test failures with a fix SLA of 3 months P-1 Issues/test failures with a fix SLA of 1 month and removed P-3 Issues/test failures with no fix SLA P-2 Issues/test failures with a fix SLA of 3 months labels Feb 20, 2024
renatolabs added a commit to renatolabs/cockroach that referenced this issue Feb 22, 2024
This commit updates the roachprod and roachtest monitors to 1) send an
event when the monitor is abruptly terminated (i.e., reader stream sees
an EOF when the associated context is *not* canceled); and 2) return
an error when the abrupt termination event is received.

The main purpose of this change is for the monitor to fail in
situations where the monitored node is preempted by the cloud
provider. Previously, these events would be ignored, leading to a test
timeout, wasting resources and leading to confusing test failures
being reported on GitHub.

Fixes: cockroachdb#118563

Release note: None
renatolabs added a commit to renatolabs/cockroach that referenced this issue Feb 22, 2024
This commit updates the roachprod and roachtest monitors to 1) send an
event when the monitor is abruptly terminated (i.e., reader stream
sees an EOF when the associated context is *not* canceled); and 2)
return any errors encountered by the roachprod monitor in roachtest,
causing the currently running test to fail. The error has TestEng
ownership so that teams are not be pinged on these kinds of flakes.

The main purpose of this change is for the monitor to fail in
situations where the monitored node is preempted by the cloud
provider. Previously, these events would be ignored, leading to a test
timeout, wasting resources and leading to confusing test failures
being reported on GitHub.

Fixes: cockroachdb#118563.

Release note: None
renatolabs added a commit to renatolabs/cockroach that referenced this issue Feb 23, 2024
This commit updates the roachprod and roachtest monitors to 1) send an
event when the monitor is abruptly terminated (i.e., reader stream
sees an EOF when the associated context is *not* canceled); and 2)
return any errors encountered by the roachprod monitor in roachtest,
causing the currently running test to fail. The error has TestEng
ownership so that teams are not be pinged on these kinds of flakes.

The main purpose of this change is for the monitor to fail in
situations where the monitored node is preempted by the cloud
provider. Previously, these events would be ignored, leading to a test
timeout, wasting resources and leading to confusing test failures
being reported on GitHub.

Fixes: cockroachdb#118563.

Release note: None
craig bot pushed a commit that referenced this issue Feb 28, 2024
119535: roachtest: fail tests if monitor encounters an error r=srosenberg a=renatolabs

This commit updates the roachprod and roachtest monitors to 1) send an
event when the monitor is abruptly terminated (i.e., reader stream
sees an EOF when the associated context is *not* canceled); and 2)
return any errors encountered by the roachprod monitor in roachtest,
causing the currently running test to fail. The error has TestEng
ownership so that teams are not be pinged on these kinds of flakes.

The main purpose of this change is for the monitor to fail in
situations where the monitored node is preempted by the cloud
provider. Previously, these events would be ignored, leading to a test
timeout, wasting resources and leading to confusing test failures
being reported on GitHub.

Fixes: #118563.

Release note: None

119725: changefeedccl: disable TestChangefeedOnlyInitialScanCSV with pulsar r=rharding6373 a=jayshrivastava

Informs: #119289
Release note: None
Epic: None

119732: master: Update pkg/testutils/release/cockroach_releases.yaml r=rail a=github-actions[bot]

Update pkg/testutils/release/cockroach_releases.yaml with recent values.

Epic: None
Release note: None
Release justification: test-only updates

Co-authored-by: Renato Costa <renato@cockroachlabs.com>
Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
Co-authored-by: CRL Release bot <teamcity@cockroachlabs.com>
@craig craig bot closed this as completed in 29892b5 Feb 28, 2024
Test Engineering automation moved this from Triage to Done Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-1 Issues/test failures with a fix SLA of 1 month T-testeng TestEng Team
Projects
Development

Successfully merging a pull request may close this issue.

4 participants