roachtest: identify `apt` failures automatically and mark them as flakes #103316

renatolabs · 2023-05-15T15:58:37Z

Every couple of weeks, we get a few roachtest failures because the Ubuntu mirrors were down for a period of time (for the latest occurrence at the time of writing, see #103297).

Ideally, we want these issues to be marked as infrastructure flakes automatically, without pinging teams (just like we do with cluster creation failures and SSH errors). Luckily, apt flakes are fairly easy to identify: the process exists with code 100.

Interestingly, the Jepsen tests already try to detect this scenario [1], but the check only happens when running apt-get install; the failures linked in the first paragraph happened during apt-get upgrade. This highlights the need for a more general solution that doesn't involve every test adding their own logic to handle these flakes.

Related issues (created when this behavior was noted in the past):

roachprod: add --retry option to run command #71929: although I'm unsure retrying will be of any help in this case. These "outages" typically last more than a few seconds.
roachprod: configure apt-get on cluster hosts to retry getting files on failures #71934: another retry-based approach. Once again, I don't believe retrying, at the roachprod level, will be the right solution.

Perhaps we should close the issues above if handling flakes in roachtest is deemed sufficient.

[1]

cockroach/pkg/cmd/roachtest/tests/jepsen.go

Lines 168 to 170 in 1ceb218

    
           if result.RemoteExitStatus == 100 { 
        
           	t.Skip("apt-get failure (#31944)", result.Stdout+result.Stderr) 
        
           }

Jira issue: CRDB-27941

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-05-15T15:58:39Z

cc @cockroachdb/test-eng

This change adds retries to `Cluster.Install` to mitigate test failures due to package installation failures. The webhook test in `cdc.go` is updated to use `Cluster.Install` to install `go`. The install command for `go` is just `sudo apt --yes install golang-go;` now. No previously existing uses of `Cluster.Install` used this command, so changing it should be safe. Informs: cockroachdb#103316 Informs: cockroachdb#107088 Release note: None Epic: None

This change adds retries to `roachprod.Install` to mitigate test failures due to package installation failures. The webhook test in `cdc.go` is updated to use `Cluster.Install` to install `go`. The install command for `go` is just `sudo apt --yes install golang-go;` now. No previously existing uses of `Cluster.Install` used this command, so changing it should be safe. Informs: cockroachdb#103316 Informs: cockroachdb#107088 Closes: cockroachdb#71934 Release note: None Epic: None

107271: roachtest/cdc: add clear error message when a golang cannot be installed r=jayshrivastava a=jayshrivastava ### roachtest/cdc: add clear error message when a golang cannot be installed Previously, the test would fail immediately if installing golang, which is currently required to run the mock webhook sink, would fail. This change adds retries for up to 5 mins. If installation still fails after 5 minutes, the error is wrapped with a more clear message. Informs: #107088 Epic: None Release note: None --- ### roachtest: add retries to Cluster.Install This change adds retries to `Cluster.Install` to mitigate test failures due to package installation failures. The webhook test in `cdc.go` is updated to use `Cluster.Install` to install `go`. The install command for `go` is just `sudo apt --yes install golang-go;` now. No previously existing uses of `Cluster.Install` used this command, so changing it should be safe. Informs: #103316 Informs: #107088 Release note: None Epic: None 107366: randgen: skip virtual columns in generate_test_objects r=yuzefovich a=yuzefovich If we don't, then `validateTableIndexes` will fail. Found when running `local` logic tests against the test tenant. Epic: None Release note: None Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

This change adds retries to `roachprod.Install` to mitigate test failures due to package installation failures. The webhook test in `cdc.go` is updated to use `Cluster.Install` to install `go`. The install command for `go` is just `sudo apt --yes install golang-go;` now. No previously existing uses of `Cluster.Install` used this command, so changing it should be safe. Informs: #103316 Informs: #107088 Closes: #71934 Release note: None Epic: None

This function only runs a fixed set of installation commands. If they fail, we create an issue assigned to Test Eng and label the issue as an infra flake. Informs: cockroachdb#103316 Release note: None

121788: roachprod: better identification of transient failures r=srosenberg a=renatolabs This PR introduces a roachprod API to allow certain errors to be labeled as _transient_. When using `roachprod` in the command line, these errors are returned to the caller as usual (and include a `TRANSIENT_ERROR` string). More importantly, when `roachtest` sees a transient error from roachprod, the resulting test failure is marked as a test-flake and routed to Test Eng. The following commits then use this API to resolve a few related issues: **roachprod: mark internal scp failures as transient**\ This uses the transient failures mechanism recently introduced to mark failures from internal `scp` commands (when setting up a cluster or starting the cockroach process) as transient. Fixes: #121126 **roachprod: mark failures to stage binaries as transient** This commit uses the recently introduced transient failure mechanism in roachprod to automatically detect when staging a cockroach (or workload) binary on a node fails in order to mark that as a transient failure. Note that 404 errors (i.e., binary not found on the GCS bucket) is *not* marked as a transient failure as that often corresponds to a programming error where an invalid version was requested. Fixes: #88044 **roachprod: mark failures to find an open port transient** These failures should not happen in regular clusters, and if they do, they shouldn't fail tests as it indicates a problem in the test infrastructure itself. Fixes: #120339 **roachtest: mark failures in cluster.Install as transient** This function only runs a fixed set of installation commands. If they fail, we create an issue assigned to Test Eng and label the issue as an infra flake. Informs: #103316 The final commit in this PR also removes some outdated `install` targets; it's a bit unrelated with the rest of the PR, but I decided to clean that up while I was changing that file. 122013: allocatorimpl: increase mean threshold from 10% to 75% for lease shedding r=sumeerbhola a=kvoli The allocator will block rebalancing, lease transfers and also actively shed leases when a store's IO overload score exceeds some threshold and a % above the mean. The % check ensures that when multiple stores are IO overloaded, they don't needlessly shed leases for no benefit. The same also applies to removing a lease, when both the mean and absolute check are exceeded. The absolute check is higher for shedding by default. In #122008, we saw that leases shed across multiple stores in a cycle. Increase the mean % check to 75% for lease shedding, requiring that stores have 75% more than the mean IO overload before shedding leases. Note that under normal operation, we rarely see nodes exceed 0.1 IO overload, such that an IO overloaded node which exceeds the default shed threshold of 0.4 would always exceed +75% of the mean. In other words, optimize for the outlier node case, rather than multiple (>2) overloaded nodes. Resolves: #122008 Release note: None 122128: sqlstats: delete sqlstats.StatsCollector interface r=xinhaoz a=xinhaoz There is one implementation of this interface. Let us simply use the concrete type. Epic: none Release note: None Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com>

This function only runs a fixed set of installation commands. If they fail, we create an issue assigned to Test Eng and label the issue as an infra flake. Informs: #103316 Release note: None

This function only runs a fixed set of installation commands. If they fail, we create an issue assigned to Test Eng and label the issue as an infra flake. Informs: cockroachdb#103316 Release note: None

Previously, we used to get roachtest failures when the Ubuntu mirrors were down for a period of time. This PR identifies these apt failures and marks them as a flake. Epic: none Fixes: cockroachdb#103316 Release note: None

123725: roachprod: identify apt failures automatically and mark them as flakes r=vidit-bhat a=vidit-bhat Previously, we used to get roachtest failures when the Ubuntu mirrors were down for a period of time. This PR identifies these apt failures and marks them as a flake. Epic: none Fixes: #103316 Release note: None Co-authored-by: Vidit Bhat <vidit.bhat@cockroachlabs.com>

rickystewart · 2024-05-09T16:15:41Z

Using pre-baked images instead of using apt to setup nodes would be an interesting alternative (though not an easy one to implement).

srosenberg · 2024-05-09T17:33:10Z

Using pre-baked images instead of using apt to setup nodes would be an interesting alternative (though not an easy one to implement).

Indeed, we even have a story for it [1]. Complexity-wise, it's definitely easier to execute than getting rid of public ips, but the effort/lift trade-off is fuzzy. We don't see a high frequency of apt-related flakes; otherwise, we would have most likely prioritized using pre-baked images.

[1] #77644

Previously, we used to get roachtest failures when the Ubuntu mirrors were down for a period of time. This PR identifies these apt failures and marks them as a flake. Epic: none Fixes: #103316 Release note: None

This function only runs a fixed set of installation commands. If they fail, we create an issue assigned to Test Eng and label the issue as an infra flake. Informs: cockroachdb#103316 Release note: None

renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels May 15, 2023

blathers-crl bot added this to Triage in Test Engineering May 15, 2023

renatolabs mentioned this issue Jul 25, 2023

roachtest: clock/jump/large_backward_disabled failed #107522

Closed

blathers-crl bot mentioned this issue Jul 25, 2023

release-23.1: roachtest/cdc: add clear error message when a golang cannot be installed #107579

Merged

renatolabs mentioned this issue Nov 29, 2023

roachtest: jepsen/monotonic/majority-ring-start-kill-2 failed #115247

Closed

This was referenced Jan 18, 2024

roachtest: jepsen/multi-register/parts-start-kill-2 failed #117878

Closed

roachtest: jepsen/multi-register/split failed #117879

Closed

renatolabs mentioned this issue Jan 25, 2024

roachtest: introduce registry.ErrorWithOwner #118324

Merged

srosenberg added the E-quick-win Likely to be a quick win for someone experienced. label Apr 2, 2024

This was referenced Apr 2, 2024

roachtest: jepsen/register/majority-ring failed #121557

Closed

roachtest: jepsen/register/majority-ring-start-kill-2 failed #121558

Closed

This was referenced Apr 5, 2024

roachprod: better identification of transient failures #121788

Merged

roachtest: jepsen/bank/majority-ring-subcritical-skews failed #121909

Closed

blathers-crl bot mentioned this issue Apr 11, 2024

release-24.1: roachprod: better identification of transient failures #122216

Merged

renatolabs mentioned this issue Apr 11, 2024

release-23.2: roachprod: better identification of transient failures #122239

Merged

renatolabs mentioned this issue Apr 15, 2024

roachtest: jepsen/multi-register/subcritical-skews-start-kill-2 failed #122370

Closed

vidit-bhat mentioned this issue May 7, 2024

roachprod: identify apt failures automatically and mark them as flakes #123725

Merged

craig bot closed this as completed in 488551c May 9, 2024

Test Engineering automation moved this from Triage to Done May 9, 2024

blathers-crl bot mentioned this issue May 13, 2024

release-24.1: roachprod: identify apt failures automatically and mark them as flakes #124031

Merged

blathers-crl bot mentioned this issue May 13, 2024

release-23.2: roachprod: identify apt failures automatically and mark them as flakes #124032

Merged

exalate-issue-sync bot assigned vidit-bhat Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: identify `apt` failures automatically and mark them as flakes #103316

roachtest: identify `apt` failures automatically and mark them as flakes #103316

renatolabs commented May 15, 2023 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented May 15, 2023

rickystewart commented May 9, 2024

srosenberg commented May 9, 2024

roachtest: identify apt failures automatically and mark them as flakes #103316

roachtest: identify apt failures automatically and mark them as flakes #103316

Comments

renatolabs commented May 15, 2023 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented May 15, 2023

rickystewart commented May 9, 2024

srosenberg commented May 9, 2024

roachtest: identify `apt` failures automatically and mark them as flakes #103316

roachtest: identify `apt` failures automatically and mark them as flakes #103316

renatolabs commented May 15, 2023 •

edited by cockroach-jira-scripts

Loading