-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: cluster monitor hanging after unexpected node closure #118563
Comments
The import seems to have succeeded in ~2 hours, while node 5 died about 1.5 hours in, probably due to VM preemption. I wonder if the test runner isn't able to handle dead nodes, and hangs instead? |
The test uses a
There are another 7 of these:
...but 8 of these:
I'm suspicious of this line: cockroach/pkg/roachprod/install/cluster_synced.go Lines 957 to 959 in cc4fdff
which never invokes sendEvent to inform the caller that a node has closed.
|
cc @cockroach-dev-inf does this look like something your team might address? |
This looks more like something @cockroachdb/test-eng would address as they own roachprod + roachtest. |
Ack, thanks for redirecting, we'll look into it. Also, it's pretty weird that this issue wasn't filed under #118528 with the rest of the other VM preemption failures. Might have something to do with the test timeout. |
cc @cockroachdb/test-eng |
This commit updates the roachprod and roachtest monitors to 1) send an event when the monitor is abruptly terminated (i.e., reader stream sees an EOF when the associated context is *not* canceled); and 2) return an error when the abrupt termination event is received. The main purpose of this change is for the monitor to fail in situations where the monitored node is preempted by the cloud provider. Previously, these events would be ignored, leading to a test timeout, wasting resources and leading to confusing test failures being reported on GitHub. Fixes: cockroachdb#118563 Release note: None
This commit updates the roachprod and roachtest monitors to 1) send an event when the monitor is abruptly terminated (i.e., reader stream sees an EOF when the associated context is *not* canceled); and 2) return any errors encountered by the roachprod monitor in roachtest, causing the currently running test to fail. The error has TestEng ownership so that teams are not be pinged on these kinds of flakes. The main purpose of this change is for the monitor to fail in situations where the monitored node is preempted by the cloud provider. Previously, these events would be ignored, leading to a test timeout, wasting resources and leading to confusing test failures being reported on GitHub. Fixes: cockroachdb#118563. Release note: None
This commit updates the roachprod and roachtest monitors to 1) send an event when the monitor is abruptly terminated (i.e., reader stream sees an EOF when the associated context is *not* canceled); and 2) return any errors encountered by the roachprod monitor in roachtest, causing the currently running test to fail. The error has TestEng ownership so that teams are not be pinged on these kinds of flakes. The main purpose of this change is for the monitor to fail in situations where the monitored node is preempted by the cloud provider. Previously, these events would be ignored, leading to a test timeout, wasting resources and leading to confusing test failures being reported on GitHub. Fixes: cockroachdb#118563. Release note: None
119535: roachtest: fail tests if monitor encounters an error r=srosenberg a=renatolabs This commit updates the roachprod and roachtest monitors to 1) send an event when the monitor is abruptly terminated (i.e., reader stream sees an EOF when the associated context is *not* canceled); and 2) return any errors encountered by the roachprod monitor in roachtest, causing the currently running test to fail. The error has TestEng ownership so that teams are not be pinged on these kinds of flakes. The main purpose of this change is for the monitor to fail in situations where the monitored node is preempted by the cloud provider. Previously, these events would be ignored, leading to a test timeout, wasting resources and leading to confusing test failures being reported on GitHub. Fixes: #118563. Release note: None 119725: changefeedccl: disable TestChangefeedOnlyInitialScanCSV with pulsar r=rharding6373 a=jayshrivastava Informs: #119289 Release note: None Epic: None 119732: master: Update pkg/testutils/release/cockroach_releases.yaml r=rail a=github-actions[bot] Update pkg/testutils/release/cockroach_releases.yaml with recent values. Epic: None Release note: None Release justification: test-only updates Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: CRL Release bot <teamcity@cockroachlabs.com>
roachtest.import/tpch/nodes=8 failed with artifacts on master @ ed3a25e3c9459cede2f80babbfc9d44a836b6c12:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=4
ROACHTEST_encrypted=true
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-35788
The text was updated successfully, but these errors were encountered: