-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: fail tests if monitor encounters an error #119535
roachtest: fail tests if monitor encounters an error #119535
Conversation
10c3894
to
566e549
Compare
566e549
to
e1fa59a
Compare
Verified that the logic here works by simulating a VM preemption (i.e., running a roachtest and manually deleting one of the VMs on GCE). 0.1 build is also currently in progress. Do others think this approach is reasonable? I also played with a different approach where roachtest continuously monitors for preempted VMs, but I think it's more general to have the monitor cause the test to fail on errors, which should fix the timeouts we observed with VM preemption. Let me know! |
e1fa59a
to
e2ef466
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, hopefully this will reduce some noise for other teams!
// `failureMsg` used when reporting the issue. In addition, | ||
// `failure_N.log` files should also already exist at this | ||
// point. | ||
t.resetFailures() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I like the current approach for its parsimony! It's also more general, i.e., not specific to preemption. Thus, we can now monitor (pun intended) for these types of "infra flake" while at the same time reducing the noise due to unattributed test failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing to improve the monitor
! :)
35ca806
to
752362f
Compare
This commit updates the roachprod and roachtest monitors to 1) send an event when the monitor is abruptly terminated (i.e., reader stream sees an EOF when the associated context is *not* canceled); and 2) return any errors encountered by the roachprod monitor in roachtest, causing the currently running test to fail. The error has TestEng ownership so that teams are not be pinged on these kinds of flakes. The main purpose of this change is for the monitor to fail in situations where the monitored node is preempted by the cloud provider. Previously, these events would be ignored, leading to a test timeout, wasting resources and leading to confusing test failures being reported on GitHub. Fixes: cockroachdb#118563. Release note: None
752362f
to
29892b5
Compare
TFTR! bors r=srosenberg |
Build succeeded: |
In cockroachdb#119535, we introduced a `resetFailures` method that is called after a test failure when the test runner identifies that a VM was preempted while the test was running. This function makes sure that the VM preemption error is the only one visible to the issue poster, ensuring that whenever a VM is preempted, the test failure is routed to Test Eng. However, there was one situation where we still would not get the routing right: when the test times out. In this situation, we set the internal `failuresSuppressed` field to true; in other words, `resetFailures` followed by the VM preempmtion error would not have the desired effect of making the VM preemption error visible to the issue poster. While cockroachdb#119535 solves the most common source of test timeouts due to preemption (a bug in the monitor), tests can run arbitrary code and can timeout if a VM is preempted. In this commit, we reset `failuresSuppressed` in `resetFailures` to make sure that the VM preemption error is taken into account when reporting failures in arbitrary test timeouts. Fixes: cockroachdb#120381 Release note: None
120497: roachtest: reset `failuresSuppressed` in `resetFailures` r=DarrylWong a=renatolabs In #119535, we introduced a `resetFailures` method that is called after a test failure when the test runner identifies that a VM was preempted while the test was running. This function makes sure that the VM preemption error is the only one visible to the issue poster, ensuring that whenever a VM is preempted, the test failure is routed to Test Eng. However, there was one situation where we still would not get the routing right: when the test times out. In this situation, we set the internal `failuresSuppressed` field to true; in other words, `resetFailures` followed by the VM preempmtion error would not have the desired effect of making the VM preemption error visible to the issue poster. While #119535 solves the most common source of test timeouts due to preemption (a bug in the monitor), tests can run arbitrary code and can timeout if a VM is preempted. In this commit, we reset `failuresSuppressed` in `resetFailures` to make sure that the VM preemption error is taken into account when reporting failures in arbitrary test timeouts. Fixes: #120381 Release note: None Co-authored-by: Renato Costa <renato@cockroachlabs.com>
This commit updates the roachprod and roachtest monitors to 1) send an
event when the monitor is abruptly terminated (i.e., reader stream
sees an EOF when the associated context is not canceled); and 2)
return any errors encountered by the roachprod monitor in roachtest,
causing the currently running test to fail. The error has TestEng
ownership so that teams are not be pinged on these kinds of flakes.
The main purpose of this change is for the monitor to fail in
situations where the monitored node is preempted by the cloud
provider. Previously, these events would be ignored, leading to a test
timeout, wasting resources and leading to confusing test failures
being reported on GitHub.
Fixes: #118563.
Release note: None