Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: fail tests if monitor encounters an error #119535

Merged

Conversation

renatolabs
Copy link
Collaborator

@renatolabs renatolabs commented Feb 22, 2024

This commit updates the roachprod and roachtest monitors to 1) send an
event when the monitor is abruptly terminated (i.e., reader stream
sees an EOF when the associated context is not canceled); and 2)
return any errors encountered by the roachprod monitor in roachtest,
causing the currently running test to fail. The error has TestEng
ownership so that teams are not be pinged on these kinds of flakes.

The main purpose of this change is for the monitor to fail in
situations where the monitored node is preempted by the cloud
provider. Previously, these events would be ignored, leading to a test
timeout, wasting resources and leading to confusing test failures
being reported on GitHub.

Fixes: #118563.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@renatolabs renatolabs force-pushed the rc/roachtest-monitor-vm-preemption branch from 10c3894 to 566e549 Compare February 22, 2024 20:42
@renatolabs renatolabs changed the title roachtest: fail if monitor dies unexpectedly roachtest: fail tests if monitor encounters an error Feb 22, 2024
@renatolabs renatolabs force-pushed the rc/roachtest-monitor-vm-preemption branch from 566e549 to e1fa59a Compare February 22, 2024 20:45
@renatolabs renatolabs marked this pull request as ready for review February 22, 2024 20:45
@renatolabs renatolabs requested a review from a team as a code owner February 22, 2024 20:45
@renatolabs renatolabs requested review from herkolategan and DarrylWong and removed request for a team February 22, 2024 20:45
@renatolabs
Copy link
Collaborator Author

Verified that the logic here works by simulating a VM preemption (i.e., running a roachtest and manually deleting one of the VMs on GCE). 0.1 build is also currently in progress.

Do others think this approach is reasonable? I also played with a different approach where roachtest continuously monitors for preempted VMs, but I think it's more general to have the monitor cause the test to fail on errors, which should fix the timeouts we observed with VM preemption.

Let me know!

@renatolabs renatolabs force-pushed the rc/roachtest-monitor-vm-preemption branch from e1fa59a to e2ef466 Compare February 23, 2024 14:10
Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, hopefully this will reduce some noise for other teams!

// `failureMsg` used when reporting the issue. In addition,
// `failure_N.log` files should also already exist at this
// point.
t.resetFailures()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@srosenberg
Copy link
Member

Do others think this approach is reasonable? I also played with a different approach where roachtest continuously monitors for preempted VMs, but I think it's more general to have the monitor cause the test to fail on errors, which should fix the timeouts we observed with VM preemption.

Let me know!

I like the current approach for its parsimony! It's also more general, i.e., not specific to preemption. Thus, we can now monitor (pun intended) for these types of "infra flake" while at the same time reducing the noise due to unattributed test failures.

@srosenberg srosenberg self-requested a review February 27, 2024 16:43
Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to improve the monitor! :)

@renatolabs renatolabs force-pushed the rc/roachtest-monitor-vm-preemption branch 3 times, most recently from 35ca806 to 752362f Compare February 28, 2024 15:15
This commit updates the roachprod and roachtest monitors to 1) send an
event when the monitor is abruptly terminated (i.e., reader stream
sees an EOF when the associated context is *not* canceled); and 2)
return any errors encountered by the roachprod monitor in roachtest,
causing the currently running test to fail. The error has TestEng
ownership so that teams are not be pinged on these kinds of flakes.

The main purpose of this change is for the monitor to fail in
situations where the monitored node is preempted by the cloud
provider. Previously, these events would be ignored, leading to a test
timeout, wasting resources and leading to confusing test failures
being reported on GitHub.

Fixes: cockroachdb#118563.

Release note: None
@renatolabs renatolabs force-pushed the rc/roachtest-monitor-vm-preemption branch from 752362f to 29892b5 Compare February 28, 2024 15:26
@renatolabs
Copy link
Collaborator Author

TFTR!

bors r=srosenberg

@craig
Copy link
Contributor

craig bot commented Feb 28, 2024

Build succeeded:

@craig craig bot merged commit 0ce84d4 into cockroachdb:master Feb 28, 2024
17 checks passed
@renatolabs renatolabs deleted the rc/roachtest-monitor-vm-preemption branch February 28, 2024 20:38
renatolabs added a commit to renatolabs/cockroach that referenced this pull request Mar 14, 2024
In cockroachdb#119535, we introduced a `resetFailures` method that is called
after a test failure when the test runner identifies that a VM was
preempted while the test was running. This function makes sure that
the VM preemption error is the only one visible to the issue poster,
ensuring that whenever a VM is preempted, the test failure is routed
to Test Eng.

However, there was one situation where we still would not get the
routing right: when the test times out. In this situation, we set the
internal `failuresSuppressed` field to true; in other words,
`resetFailures` followed by the VM preempmtion error would not have
the desired effect of making the VM preemption error visible to the
issue poster. While cockroachdb#119535 solves the most common source of test
timeouts due to preemption (a bug in the monitor), tests can run
arbitrary code and can timeout if a VM is preempted.

In this commit, we reset `failuresSuppressed` in `resetFailures` to
make sure that the VM preemption error is taken into account when
reporting failures in arbitrary test timeouts.

Fixes: cockroachdb#120381

Release note: None
craig bot pushed a commit that referenced this pull request Mar 14, 2024
120497: roachtest: reset `failuresSuppressed` in `resetFailures` r=DarrylWong a=renatolabs

In #119535, we introduced a `resetFailures` method that is called after a test failure when the test runner identifies that a VM was preempted while the test was running. This function makes sure that the VM preemption error is the only one visible to the issue poster, ensuring that whenever a VM is preempted, the test failure is routed to Test Eng.

However, there was one situation where we still would not get the routing right: when the test times out. In this situation, we set the internal `failuresSuppressed` field to true; in other words, `resetFailures` followed by the VM preempmtion error would not have the desired effect of making the VM preemption error visible to the issue poster. While #119535 solves the most common source of test timeouts due to preemption (a bug in the monitor), tests can run arbitrary code and can timeout if a VM is preempted.

In this commit, we reset `failuresSuppressed` in `resetFailures` to make sure that the VM preemption error is taken into account when reporting failures in arbitrary test timeouts.

Fixes: #120381

Release note: None

Co-authored-by: Renato Costa <renato@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: cluster monitor hanging after unexpected node closure
4 participants