-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: identify apt
failures automatically and mark them as flakes
#103316
Comments
cc @cockroachdb/test-eng |
This change adds retries to `Cluster.Install` to mitigate test failures due to package installation failures. The webhook test in `cdc.go` is updated to use `Cluster.Install` to install `go`. The install command for `go` is just `sudo apt --yes install golang-go;` now. No previously existing uses of `Cluster.Install` used this command, so changing it should be safe. Informs: cockroachdb#103316 Informs: cockroachdb#107088 Release note: None Epic: None
This change adds retries to `roachprod.Install` to mitigate test failures due to package installation failures. The webhook test in `cdc.go` is updated to use `Cluster.Install` to install `go`. The install command for `go` is just `sudo apt --yes install golang-go;` now. No previously existing uses of `Cluster.Install` used this command, so changing it should be safe. Informs: cockroachdb#103316 Informs: cockroachdb#107088 Closes: cockroachdb#71934 Release note: None Epic: None
107271: roachtest/cdc: add clear error message when a golang cannot be installed r=jayshrivastava a=jayshrivastava ### roachtest/cdc: add clear error message when a golang cannot be installed Previously, the test would fail immediately if installing golang, which is currently required to run the mock webhook sink, would fail. This change adds retries for up to 5 mins. If installation still fails after 5 minutes, the error is wrapped with a more clear message. Informs: #107088 Epic: None Release note: None --- ### roachtest: add retries to Cluster.Install This change adds retries to `Cluster.Install` to mitigate test failures due to package installation failures. The webhook test in `cdc.go` is updated to use `Cluster.Install` to install `go`. The install command for `go` is just `sudo apt --yes install golang-go;` now. No previously existing uses of `Cluster.Install` used this command, so changing it should be safe. Informs: #103316 Informs: #107088 Release note: None Epic: None 107366: randgen: skip virtual columns in generate_test_objects r=yuzefovich a=yuzefovich If we don't, then `validateTableIndexes` will fail. Found when running `local` logic tests against the test tenant. Epic: None Release note: None Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
This change adds retries to `roachprod.Install` to mitigate test failures due to package installation failures. The webhook test in `cdc.go` is updated to use `Cluster.Install` to install `go`. The install command for `go` is just `sudo apt --yes install golang-go;` now. No previously existing uses of `Cluster.Install` used this command, so changing it should be safe. Informs: #103316 Informs: #107088 Closes: #71934 Release note: None Epic: None
This function only runs a fixed set of installation commands. If they fail, we create an issue assigned to Test Eng and label the issue as an infra flake. Informs: cockroachdb#103316 Release note: None
121788: roachprod: better identification of transient failures r=srosenberg a=renatolabs This PR introduces a roachprod API to allow certain errors to be labeled as _transient_. When using `roachprod` in the command line, these errors are returned to the caller as usual (and include a `TRANSIENT_ERROR` string). More importantly, when `roachtest` sees a transient error from roachprod, the resulting test failure is marked as a test-flake and routed to Test Eng. The following commits then use this API to resolve a few related issues: **roachprod: mark internal scp failures as transient**\ This uses the transient failures mechanism recently introduced to mark failures from internal `scp` commands (when setting up a cluster or starting the cockroach process) as transient. Fixes: #121126 **roachprod: mark failures to stage binaries as transient** This commit uses the recently introduced transient failure mechanism in roachprod to automatically detect when staging a cockroach (or workload) binary on a node fails in order to mark that as a transient failure. Note that 404 errors (i.e., binary not found on the GCS bucket) is *not* marked as a transient failure as that often corresponds to a programming error where an invalid version was requested. Fixes: #88044 **roachprod: mark failures to find an open port transient** These failures should not happen in regular clusters, and if they do, they shouldn't fail tests as it indicates a problem in the test infrastructure itself. Fixes: #120339 **roachtest: mark failures in cluster.Install as transient** This function only runs a fixed set of installation commands. If they fail, we create an issue assigned to Test Eng and label the issue as an infra flake. Informs: #103316 The final commit in this PR also removes some outdated `install` targets; it's a bit unrelated with the rest of the PR, but I decided to clean that up while I was changing that file. 122013: allocatorimpl: increase mean threshold from 10% to 75% for lease shedding r=sumeerbhola a=kvoli The allocator will block rebalancing, lease transfers and also actively shed leases when a store's IO overload score exceeds some threshold and a % above the mean. The % check ensures that when multiple stores are IO overloaded, they don't needlessly shed leases for no benefit. The same also applies to removing a lease, when both the mean and absolute check are exceeded. The absolute check is higher for shedding by default. In #122008, we saw that leases shed across multiple stores in a cycle. Increase the mean % check to 75% for lease shedding, requiring that stores have 75% more than the mean IO overload before shedding leases. Note that under normal operation, we rarely see nodes exceed 0.1 IO overload, such that an IO overloaded node which exceeds the default shed threshold of 0.4 would always exceed +75% of the mean. In other words, optimize for the outlier node case, rather than multiple (>2) overloaded nodes. Resolves: #122008 Release note: None 122128: sqlstats: delete sqlstats.StatsCollector interface r=xinhaoz a=xinhaoz There is one implementation of this interface. Let us simply use the concrete type. Epic: none Release note: None Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com>
This function only runs a fixed set of installation commands. If they fail, we create an issue assigned to Test Eng and label the issue as an infra flake. Informs: #103316 Release note: None
This function only runs a fixed set of installation commands. If they fail, we create an issue assigned to Test Eng and label the issue as an infra flake. Informs: cockroachdb#103316 Release note: None
Previously, we used to get roachtest failures when the Ubuntu mirrors were down for a period of time. This PR identifies these apt failures and marks them as a flake. Epic: none Fixes: cockroachdb#103316 Release note: None
123725: roachprod: identify apt failures automatically and mark them as flakes r=vidit-bhat a=vidit-bhat Previously, we used to get roachtest failures when the Ubuntu mirrors were down for a period of time. This PR identifies these apt failures and marks them as a flake. Epic: none Fixes: #103316 Release note: None Co-authored-by: Vidit Bhat <vidit.bhat@cockroachlabs.com>
Using pre-baked images instead of using |
Indeed, we even have a story for it [1]. Complexity-wise, it's definitely easier to execute than getting rid of public ips, but the effort/lift trade-off is fuzzy. We don't see a high frequency of apt-related flakes; otherwise, we would have most likely prioritized using pre-baked images. [1] #77644 |
Previously, we used to get roachtest failures when the Ubuntu mirrors were down for a period of time. This PR identifies these apt failures and marks them as a flake. Epic: none Fixes: #103316 Release note: None
Previously, we used to get roachtest failures when the Ubuntu mirrors were down for a period of time. This PR identifies these apt failures and marks them as a flake. Epic: none Fixes: #103316 Release note: None
This function only runs a fixed set of installation commands. If they fail, we create an issue assigned to Test Eng and label the issue as an infra flake. Informs: cockroachdb#103316 Release note: None
Every couple of weeks, we get a few roachtest failures because the Ubuntu mirrors were down for a period of time (for the latest occurrence at the time of writing, see #103297).
Ideally, we want these issues to be marked as infrastructure flakes automatically, without pinging teams (just like we do with cluster creation failures and SSH errors). Luckily,
apt
flakes are fairly easy to identify: the process exists with code 100.Interestingly, the Jepsen tests already try to detect this scenario [1], but the check only happens when running
apt-get install
; the failures linked in the first paragraph happened duringapt-get upgrade
. This highlights the need for a more general solution that doesn't involve every test adding their own logic to handle these flakes.Related issues (created when this behavior was noted in the past):
--retry
option torun
command #71929: although I'm unsure retrying will be of any help in this case. These "outages" typically last more than a few seconds.Perhaps we should close the issues above if handling flakes in roachtest is deemed sufficient.
[1]
cockroach/pkg/cmd/roachtest/tests/jepsen.go
Lines 168 to 170 in 1ceb218
Jira issue: CRDB-27941
The text was updated successfully, but these errors were encountered: