Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: identify apt failures automatically and mark them as flakes #103316

Closed
renatolabs opened this issue May 15, 2023 · 3 comments · Fixed by #123725
Closed

roachtest: identify apt failures automatically and mark them as flakes #103316

renatolabs opened this issue May 15, 2023 · 3 comments · Fixed by #123725
Assignees
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. E-quick-win Likely to be a quick win for someone experienced. T-testeng TestEng Team

Comments

@renatolabs
Copy link
Collaborator

renatolabs commented May 15, 2023

Every couple of weeks, we get a few roachtest failures because the Ubuntu mirrors were down for a period of time (for the latest occurrence at the time of writing, see #103297).

Ideally, we want these issues to be marked as infrastructure flakes automatically, without pinging teams (just like we do with cluster creation failures and SSH errors). Luckily, apt flakes are fairly easy to identify: the process exists with code 100.

Interestingly, the Jepsen tests already try to detect this scenario [1], but the check only happens when running apt-get install; the failures linked in the first paragraph happened during apt-get upgrade. This highlights the need for a more general solution that doesn't involve every test adding their own logic to handle these flakes.

Related issues (created when this behavior was noted in the past):

Perhaps we should close the issues above if handling flakes in roachtest is deemed sufficient.

[1]

if result.RemoteExitStatus == 100 {
t.Skip("apt-get failure (#31944)", result.Stdout+result.Stderr)
}

Jira issue: CRDB-27941

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels May 15, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 15, 2023

cc @cockroachdb/test-eng

@blathers-crl blathers-crl bot added this to Triage in Test Engineering May 15, 2023
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jul 20, 2023
This change adds retries to `Cluster.Install` to mitigate
test failures due to package installation failures.

The webhook test in `cdc.go` is updated to use `Cluster.Install` to
install `go`.

The install command for `go` is just `sudo apt --yes install golang-go;`
now. No previously existing uses of `Cluster.Install` used this command,
so changing it should be safe.

Informs: cockroachdb#103316
Informs: cockroachdb#107088
Release note: None
Epic: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jul 21, 2023
This change adds retries to `roachprod.Install` to mitigate
test failures due to package installation failures.

The webhook test in `cdc.go` is updated to use `Cluster.Install` to
install `go`.

The install command for `go` is just `sudo apt --yes install golang-go;`
now. No previously existing uses of `Cluster.Install` used this command,
so changing it should be safe.

Informs: cockroachdb#103316
Informs: cockroachdb#107088
Closes: cockroachdb#71934
Release note: None
Epic: None
craig bot pushed a commit that referenced this issue Jul 21, 2023
107271: roachtest/cdc: add clear error message when a golang cannot be installed r=jayshrivastava a=jayshrivastava

### roachtest/cdc: add clear error message when a golang cannot be installed 
Previously, the test would fail immediately if installing golang, which is currently required to run the mock webhook sink, would fail. This change adds retries for up to 5 mins. If installation still fails after 5 minutes, the error is wrapped with a more clear message.

Informs: #107088
Epic: None
Release note: None

---

### roachtest: add retries to Cluster.Install
This change adds retries to `Cluster.Install` to mitigate
test failures due to package installation failures.

The webhook test in `cdc.go` is updated to use `Cluster.Install` to
install `go`.

The install command for `go` is just `sudo apt --yes install golang-go;`
now. No previously existing uses of `Cluster.Install` used this command,
so changing it should be safe.

Informs: #103316
Informs: #107088
Release note: None
Epic: None

107366: randgen: skip virtual columns in generate_test_objects r=yuzefovich a=yuzefovich

If we don't, then `validateTableIndexes` will fail. Found when running `local` logic tests against the test tenant.

Epic: None

Release note: None

Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Jul 25, 2023
This change adds retries to `roachprod.Install` to mitigate
test failures due to package installation failures.

The webhook test in `cdc.go` is updated to use `Cluster.Install` to
install `go`.

The install command for `go` is just `sudo apt --yes install golang-go;`
now. No previously existing uses of `Cluster.Install` used this command,
so changing it should be safe.

Informs: #103316
Informs: #107088
Closes: #71934
Release note: None
Epic: None
@srosenberg srosenberg added the E-quick-win Likely to be a quick win for someone experienced. label Apr 2, 2024
renatolabs added a commit to renatolabs/cockroach that referenced this issue Apr 11, 2024
This function only runs a fixed set of installation commands. If they
fail, we create an issue assigned to Test Eng and label the issue as
an infra flake.

Informs: cockroachdb#103316

Release note: None
craig bot pushed a commit that referenced this issue Apr 11, 2024
121788: roachprod: better identification of transient failures r=srosenberg a=renatolabs

This PR introduces a roachprod API to allow certain errors to be labeled as _transient_. When using `roachprod` in the command line, these errors are returned to the caller as usual (and include a `TRANSIENT_ERROR` string). More importantly, when `roachtest` sees a transient error from roachprod, the resulting test failure is marked as a test-flake and routed to Test Eng.

The following commits then use this API to resolve a few related issues:

**roachprod: mark internal scp failures as transient**\
This uses the transient failures mechanism recently introduced to mark
failures from internal `scp` commands (when setting up a cluster or
starting the cockroach process) as transient.

Fixes: #121126

**roachprod: mark failures to stage binaries as transient**
This commit uses the recently introduced transient failure mechanism
in roachprod to automatically detect when staging a cockroach (or
workload) binary on a node fails in order to mark that as a transient
failure.

Note that 404 errors (i.e., binary not found on the GCS bucket)
is *not* marked as a transient failure as that often corresponds to a
programming error where an invalid version was requested.

Fixes: #88044

**roachprod: mark failures to find an open port transient**
These failures should not happen in regular clusters, and if they do,
they shouldn't fail tests as it indicates a problem in the test
infrastructure itself.

Fixes: #120339

**roachtest: mark failures in cluster.Install as transient**
This function only runs a fixed set of installation commands. If they
fail, we create an issue assigned to Test Eng and label the issue as
an infra flake.

Informs: #103316

The final commit in this PR also removes some outdated `install` targets; it's a bit unrelated with the rest of the PR, but I decided to clean that up while I was changing that file.

122013: allocatorimpl: increase mean threshold from 10% to 75% for lease shedding r=sumeerbhola a=kvoli

The allocator will block rebalancing, lease transfers and also actively
shed leases when a store's IO overload score exceeds some threshold and
a % above the mean. The % check ensures that when multiple stores are IO
overloaded, they don't needlessly shed leases for no benefit. The same
also applies to removing a lease, when both the mean and absolute check
are exceeded. The absolute check is higher for shedding by default.

In #122008, we saw that leases shed across multiple stores in a cycle.
Increase the mean % check to 75% for lease shedding, requiring that
stores have 75% more than the mean IO overload before shedding leases.

Note that under normal operation, we rarely see nodes exceed 0.1 IO
overload, such that an IO overloaded node which exceeds the default shed
threshold of 0.4 would always exceed +75% of the mean. In other words,
optimize for the outlier node case, rather than multiple (>2) overloaded
nodes.

Resolves: #122008
Release note: None

122128: sqlstats: delete sqlstats.StatsCollector interface r=xinhaoz a=xinhaoz

There is one implementation of this interface. Let us simply use the concrete type.

Epic: none

Release note: None

Co-authored-by: Renato Costa <renato@cockroachlabs.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Apr 11, 2024
This function only runs a fixed set of installation commands. If they
fail, we create an issue assigned to Test Eng and label the issue as
an infra flake.

Informs: #103316

Release note: None
renatolabs added a commit to renatolabs/cockroach that referenced this issue Apr 12, 2024
This function only runs a fixed set of installation commands. If they
fail, we create an issue assigned to Test Eng and label the issue as
an infra flake.

Informs: cockroachdb#103316

Release note: None
vidit-bhat added a commit to vidit-bhat/cockroach that referenced this issue May 7, 2024
Previously, we used to get roachtest failures when the
Ubuntu mirrors were down for a period of time.

This PR identifies these apt failures and marks them as a flake.

Epic: none
Fixes: cockroachdb#103316

Release note: None
craig bot pushed a commit that referenced this issue May 9, 2024
123725: roachprod: identify apt failures automatically and mark them as flakes r=vidit-bhat a=vidit-bhat

Previously, we used to get roachtest failures when the Ubuntu mirrors were down for a period of time.

This PR identifies these apt failures and marks them as a flake.

Epic: none
Fixes: #103316
Release note: None

Co-authored-by: Vidit Bhat <vidit.bhat@cockroachlabs.com>
@craig craig bot closed this as completed in 488551c May 9, 2024
Test Engineering automation moved this from Triage to Done May 9, 2024
@rickystewart
Copy link
Collaborator

Using pre-baked images instead of using apt to setup nodes would be an interesting alternative (though not an easy one to implement).

@srosenberg
Copy link
Member

Using pre-baked images instead of using apt to setup nodes would be an interesting alternative (though not an easy one to implement).

Indeed, we even have a story for it [1]. Complexity-wise, it's definitely easier to execute than getting rid of public ips, but the effort/lift trade-off is fuzzy. We don't see a high frequency of apt-related flakes; otherwise, we would have most likely prioritized using pre-baked images.

[1] #77644

blathers-crl bot pushed a commit that referenced this issue May 13, 2024
Previously, we used to get roachtest failures when the
Ubuntu mirrors were down for a period of time.

This PR identifies these apt failures and marks them as a flake.

Epic: none
Fixes: #103316

Release note: None
blathers-crl bot pushed a commit that referenced this issue May 13, 2024
Previously, we used to get roachtest failures when the
Ubuntu mirrors were down for a period of time.

This PR identifies these apt failures and marks them as a flake.

Epic: none
Fixes: #103316

Release note: None
andrew-delph pushed a commit to andrew-delph/cockroach that referenced this issue Jun 20, 2024
This function only runs a fixed set of installation commands. If they
fail, we create an issue assigned to Test Eng and label the issue as
an infra flake.

Informs: cockroachdb#103316

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. E-quick-win Likely to be a quick win for someone experienced. T-testeng TestEng Team
Projects
Development

Successfully merging a pull request may close this issue.

4 participants