roachtest: cluster monitor hanging after unexpected node closure #118563

cockroach-teamcity · 2024-02-01T03:08:18Z

roachtest.import/tpch/nodes=8 failed with artifacts on master @ ed3a25e3c9459cede2f80babbfc9d44a836b6c12:

VMs preempted during the test run : projects/cockroach-ephemeral/zones/us-east1-b/instances/teamcity-13762710-1706682693-107-n8cpu4-0005

**Other Failure**
(test_runner.go:1136).runTest: test timed out (10h0m0s)
test artifacts and logs in: /artifacts/import/tpch/nodes=8/run_1

Parameters:

ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=4
ROACHTEST_encrypted=true
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/sql-queries _{This test on roachdash | Improve this report!

Jira issue: CRDB-35788}

The text was updated successfully, but these errors were encountered:

DrewKimball · 2024-02-01T06:15:24Z

The import seems to have succeeded in ~2 hours, while node 5 died about 1.5 hours in, probably due to VM preemption. I wonder if the test runner isn't able to handle dead nodes, and hangs instead?

DrewKimball · 2024-02-01T06:43:56Z

The test uses a roachtest/cluster.Monitor to watch for unexpected node deaths, but maybe something about the nature of this node death (ssh failures due to VM preemption) prevents that from working correctly. I see this bit in the stack trace repeated 7 times (indicating that one node is already finished):

goroutine 7338029 [IO wait, 599 minutes]:
internal/poll.runtime_pollWait(0x7f9740938e98, 0x72)
	GOROOT/src/runtime/netpoll.go:343 +0x85
internal/poll.(*pollDesc).wait(0xc00478c960?, 0xc0011d3000?, 0x1)
	GOROOT/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	GOROOT/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00478c960, {0xc0011d3000, 0x1000, 0x1000})
	GOROOT/src/internal/poll/fd_unix.go:164 +0x27a
os.(*File).read(...)
	GOROOT/src/os/file_posix.go:29
os.(*File).Read(0xc003cbaeb8, {0xc0011d3000?, 0x2?, 0xc004af0db0?})
	GOROOT/src/os/file.go:118 +0x52
bufio.(*Reader).fill(0xc004af0f50)
	GOROOT/src/bufio/bufio.go:113 +0x103
bufio.(*Reader).ReadSlice(0xc004af0f50, 0x1?)
	GOROOT/src/bufio/bufio.go:379 +0x29
bufio.(*Reader).ReadLine(0xc004af0f50)
	GOROOT/src/bufio/bufio.go:408 +0x25
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Monitor.func2.3({0x85bfb40?, 0xc003cbaeb8?})
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:956 +0x1e5
created by github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Monitor.func2 in goroutine 7337904
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:952 +0xea5

There are another 7 of these:

goroutine 7337904 [semacquire, 599 minutes]:
sync.runtime_Semacquire(0xc0053640c0?)
	GOROOT/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc0042fc220?)
	GOROOT/src/sync/waitgroup.go:116 +0x48
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Monitor.func2(0x0)
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:1021 +0xf59
created by github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Monitor in goroutine 7337903
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:759 +0x19b

...but 8 of these:

goroutine 7337979 [chan receive, 599 minutes]:
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Monitor.func2.4()
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:1017 +0x35
created by github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Monitor.func2 in goroutine 7338018
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:1016 +0xf4c

I'm suspicious of this line:

cockroach/pkg/roachprod/install/cluster_synced.go

Lines 957 to 959 in cc4fdff

    
           if err == io.EOF { 
        
           	return 
        
           }

which never invokes sendEvent to inform the caller that a node has closed.

DrewKimball · 2024-02-06T20:01:14Z

cc @cockroach-dev-inf does this look like something your team might address?

jlinder · 2024-02-07T19:43:11Z

This looks more like something @cockroachdb/test-eng would address as they own roachprod + roachtest.

renatolabs · 2024-02-07T20:25:09Z

Ack, thanks for redirecting, we'll look into it.

Also, it's pretty weird that this issue wasn't filed under #118528 with the rest of the other VM preemption failures. Might have something to do with the test timeout.

blathers-crl · 2024-02-07T20:25:27Z

cc @cockroachdb/test-eng

This commit updates the roachprod and roachtest monitors to 1) send an event when the monitor is abruptly terminated (i.e., reader stream sees an EOF when the associated context is *not* canceled); and 2) return an error when the abrupt termination event is received. The main purpose of this change is for the monitor to fail in situations where the monitored node is preempted by the cloud provider. Previously, these events would be ignored, leading to a test timeout, wasting resources and leading to confusing test failures being reported on GitHub. Fixes: cockroachdb#118563 Release note: None

This commit updates the roachprod and roachtest monitors to 1) send an event when the monitor is abruptly terminated (i.e., reader stream sees an EOF when the associated context is *not* canceled); and 2) return any errors encountered by the roachprod monitor in roachtest, causing the currently running test to fail. The error has TestEng ownership so that teams are not be pinged on these kinds of flakes. The main purpose of this change is for the monitor to fail in situations where the monitored node is preempted by the cloud provider. Previously, these events would be ignored, leading to a test timeout, wasting resources and leading to confusing test failures being reported on GitHub. Fixes: cockroachdb#118563. Release note: None

119535: roachtest: fail tests if monitor encounters an error r=srosenberg a=renatolabs This commit updates the roachprod and roachtest monitors to 1) send an event when the monitor is abruptly terminated (i.e., reader stream sees an EOF when the associated context is *not* canceled); and 2) return any errors encountered by the roachprod monitor in roachtest, causing the currently running test to fail. The error has TestEng ownership so that teams are not be pinged on these kinds of flakes. The main purpose of this change is for the monitor to fail in situations where the monitored node is preempted by the cloud provider. Previously, these events would be ignored, leading to a test timeout, wasting resources and leading to confusing test failures being reported on GitHub. Fixes: #118563. Release note: None 119725: changefeedccl: disable TestChangefeedOnlyInitialScanCSV with pulsar r=rharding6373 a=jayshrivastava Informs: #119289 Release note: None Epic: None 119732: master: Update pkg/testutils/release/cockroach_releases.yaml r=rail a=github-actions[bot] Update pkg/testutils/release/cockroach_releases.yaml with recent values. Epic: None Release note: None Release justification: test-only updates Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: CRL Release bot <teamcity@cockroachlabs.com>

cockroach-teamcity added this to the 24.1 milestone Feb 1, 2024

DrewKimball removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 1, 2024

DrewKimball added the P-3 Issues/test failures with no fix SLA label Feb 1, 2024

DrewKimball changed the title ~~roachtest: import/tpch/nodes=8 failed~~ roachtest: cluster monitor hanging after unexpected node closure Feb 1, 2024

DrewKimball mentioned this issue Feb 4, 2024

roachtest: import/tpcc/warehouses=1000/nodes=32 failed #118739

Closed

renatolabs added the T-testeng TestEng Team label Feb 7, 2024

blathers-crl bot added this to Triage in Test Engineering Feb 7, 2024

renatolabs removed the T-sql-queries SQL Queries Team label Feb 7, 2024

renatolabs added P-2 Issues/test failures with a fix SLA of 3 months P-1 Issues/test failures with a fix SLA of 1 month and removed P-3 Issues/test failures with no fix SLA P-2 Issues/test failures with a fix SLA of 3 months labels Feb 20, 2024

renatolabs mentioned this issue Feb 20, 2024

roachtest: import/tpch/nodes=8 failed [timeout after successful run] #119281

Closed

yuzefovich mentioned this issue Feb 22, 2024

roachtest: import/tpch/nodes=8 failed #119498

Closed

renatolabs mentioned this issue Feb 22, 2024

roachtest: fail tests if monitor encounters an error #119535

Merged

renatolabs mentioned this issue Feb 28, 2024

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #119717

Closed

craig bot closed this as completed in 29892b5 Feb 28, 2024

Test Engineering automation moved this from Triage to Done Feb 28, 2024

DrewKimball mentioned this issue Mar 6, 2024

roachtest: tpch_concurrency/no_streamer failed #119983

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: cluster monitor hanging after unexpected node closure #118563

roachtest: cluster monitor hanging after unexpected node closure #118563

cockroach-teamcity commented Feb 1, 2024 •

edited by jlinder

DrewKimball commented Feb 1, 2024

DrewKimball commented Feb 1, 2024

DrewKimball commented Feb 6, 2024

jlinder commented Feb 7, 2024

renatolabs commented Feb 7, 2024

blathers-crl bot commented Feb 7, 2024

roachtest: cluster monitor hanging after unexpected node closure #118563

roachtest: cluster monitor hanging after unexpected node closure #118563

Comments

cockroach-teamcity commented Feb 1, 2024 • edited by jlinder

DrewKimball commented Feb 1, 2024

DrewKimball commented Feb 1, 2024

DrewKimball commented Feb 6, 2024

jlinder commented Feb 7, 2024

renatolabs commented Feb 7, 2024

blathers-crl bot commented Feb 7, 2024

cockroach-teamcity commented Feb 1, 2024 •

edited by jlinder