roachtest: fail tests if monitor encounters an error #119535

renatolabs · 2024-02-22T20:27:19Z

This commit updates the roachprod and roachtest monitors to 1) send an
event when the monitor is abruptly terminated (i.e., reader stream
sees an EOF when the associated context is not canceled); and 2)
return any errors encountered by the roachprod monitor in roachtest,
causing the currently running test to fail. The error has TestEng
ownership so that teams are not be pinged on these kinds of flakes.

The main purpose of this change is for the monitor to fail in
situations where the monitored node is preempted by the cloud
provider. Previously, these events would be ignored, leading to a test
timeout, wasting resources and leading to confusing test failures
being reported on GitHub.

Fixes: #118563.

Release note: None

cockroach-teamcity · 2024-02-22T20:27:29Z

This change is

renatolabs · 2024-02-22T20:49:51Z

Verified that the logic here works by simulating a VM preemption (i.e., running a roachtest and manually deleting one of the VMs on GCE). 0.1 build is also currently in progress.

Do others think this approach is reasonable? I also played with a different approach where roachtest continuously monitors for preempted VMs, but I think it's more general to have the monitor cause the test to fail on errors, which should fix the timeouts we observed with VM preemption.

Let me know!

herkolategan

Nice, hopefully this will reduce some noise for other teams!

srosenberg · 2024-02-27T16:34:53Z

pkg/cmd/roachtest/test_runner.go

+					// `failureMsg` used when reporting the issue. In addition,
+					// `failure_N.log` files should also already exist at this
+					// point.
+					t.resetFailures()


srosenberg · 2024-02-27T16:43:17Z

Do others think this approach is reasonable? I also played with a different approach where roachtest continuously monitors for preempted VMs, but I think it's more general to have the monitor cause the test to fail on errors, which should fix the timeouts we observed with VM preemption.

Let me know!

I like the current approach for its parsimony! It's also more general, i.e., not specific to preemption. Thus, we can now monitor (pun intended) for these types of "infra flake" while at the same time reducing the noise due to unattributed test failures.

srosenberg

Thanks for continuing to improve the monitor! :)

This commit updates the roachprod and roachtest monitors to 1) send an event when the monitor is abruptly terminated (i.e., reader stream sees an EOF when the associated context is *not* canceled); and 2) return any errors encountered by the roachprod monitor in roachtest, causing the currently running test to fail. The error has TestEng ownership so that teams are not be pinged on these kinds of flakes. The main purpose of this change is for the monitor to fail in situations where the monitored node is preempted by the cloud provider. Previously, these events would be ignored, leading to a test timeout, wasting resources and leading to confusing test failures being reported on GitHub. Fixes: cockroachdb#118563. Release note: None

renatolabs · 2024-02-28T19:07:38Z

TFTR!

bors r=srosenberg

craig · 2024-02-28T20:35:45Z

Build succeeded:

Bazel Essential CI (Cockroach)

In cockroachdb#119535, we introduced a `resetFailures` method that is called after a test failure when the test runner identifies that a VM was preempted while the test was running. This function makes sure that the VM preemption error is the only one visible to the issue poster, ensuring that whenever a VM is preempted, the test failure is routed to Test Eng. However, there was one situation where we still would not get the routing right: when the test times out. In this situation, we set the internal `failuresSuppressed` field to true; in other words, `resetFailures` followed by the VM preempmtion error would not have the desired effect of making the VM preemption error visible to the issue poster. While cockroachdb#119535 solves the most common source of test timeouts due to preemption (a bug in the monitor), tests can run arbitrary code and can timeout if a VM is preempted. In this commit, we reset `failuresSuppressed` in `resetFailures` to make sure that the VM preemption error is taken into account when reporting failures in arbitrary test timeouts. Fixes: cockroachdb#120381 Release note: None

120497: roachtest: reset `failuresSuppressed` in `resetFailures` r=DarrylWong a=renatolabs In #119535, we introduced a `resetFailures` method that is called after a test failure when the test runner identifies that a VM was preempted while the test was running. This function makes sure that the VM preemption error is the only one visible to the issue poster, ensuring that whenever a VM is preempted, the test failure is routed to Test Eng. However, there was one situation where we still would not get the routing right: when the test times out. In this situation, we set the internal `failuresSuppressed` field to true; in other words, `resetFailures` followed by the VM preempmtion error would not have the desired effect of making the VM preemption error visible to the issue poster. While #119535 solves the most common source of test timeouts due to preemption (a bug in the monitor), tests can run arbitrary code and can timeout if a VM is preempted. In this commit, we reset `failuresSuppressed` in `resetFailures` to make sure that the VM preemption error is taken into account when reporting failures in arbitrary test timeouts. Fixes: #120381 Release note: None Co-authored-by: Renato Costa <renato@cockroachlabs.com>

renatolabs force-pushed the rc/roachtest-monitor-vm-preemption branch from 10c3894 to 566e549 Compare February 22, 2024 20:42

renatolabs changed the title ~~roachtest: fail if monitor dies unexpectedly~~ roachtest: fail tests if monitor encounters an error Feb 22, 2024

renatolabs force-pushed the rc/roachtest-monitor-vm-preemption branch from 566e549 to e1fa59a Compare February 22, 2024 20:45

renatolabs marked this pull request as ready for review February 22, 2024 20:45

renatolabs requested a review from a team as a code owner February 22, 2024 20:45

renatolabs requested review from herkolategan and DarrylWong and removed request for a team February 22, 2024 20:45

renatolabs force-pushed the rc/roachtest-monitor-vm-preemption branch from e1fa59a to e2ef466 Compare February 23, 2024 14:10

herkolategan approved these changes Feb 27, 2024

View reviewed changes

srosenberg reviewed Feb 27, 2024

View reviewed changes

srosenberg self-requested a review February 27, 2024 16:43

srosenberg approved these changes Feb 27, 2024

View reviewed changes

renatolabs force-pushed the rc/roachtest-monitor-vm-preemption branch 3 times, most recently from 35ca806 to 752362f Compare February 28, 2024 15:15

renatolabs force-pushed the rc/roachtest-monitor-vm-preemption branch from 752362f to 29892b5 Compare February 28, 2024 15:26

craig bot merged commit 0ce84d4 into cockroachdb:master Feb 28, 2024
17 checks passed

renatolabs deleted the rc/roachtest-monitor-vm-preemption branch February 28, 2024 20:38

andyyang890 mentioned this pull request Mar 4, 2024

roachtest: cdc/workload/kv100/nodes=5/cpu=16/ranges=100/server=scheduler/protocol=mux/format=json/sink=null failed #119683

Closed

renatolabs mentioned this pull request Mar 14, 2024

roachtest: reset failuresSuppressed in resetFailures #120497

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: fail tests if monitor encounters an error #119535

roachtest: fail tests if monitor encounters an error #119535

renatolabs commented Feb 22, 2024 •

edited

cockroach-teamcity commented Feb 22, 2024

renatolabs commented Feb 22, 2024

herkolategan left a comment

srosenberg Feb 27, 2024

srosenberg commented Feb 27, 2024

srosenberg left a comment

renatolabs commented Feb 28, 2024

craig bot commented Feb 28, 2024

roachtest: fail tests if monitor encounters an error #119535

roachtest: fail tests if monitor encounters an error #119535

Conversation

renatolabs commented Feb 22, 2024 • edited

cockroach-teamcity commented Feb 22, 2024

renatolabs commented Feb 22, 2024

herkolategan left a comment

Choose a reason for hiding this comment

srosenberg Feb 27, 2024

Choose a reason for hiding this comment

srosenberg commented Feb 27, 2024

srosenberg left a comment

Choose a reason for hiding this comment

renatolabs commented Feb 28, 2024

craig bot commented Feb 28, 2024

renatolabs commented Feb 22, 2024 •

edited