Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: improve handling of cluster creation errors from cloud provider #114523

Open
renatolabs opened this issue Nov 15, 2023 · 1 comment
Open
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team

Comments

@renatolabs
Copy link
Contributor

renatolabs commented Nov 15, 2023

We have been seeing a number of cluster creation errors every night on roachtest nightly runs. Part of this has to do with GCP not having enough resources on us-east1-b, where we create VMs by default [1]. These errors typically look like:

The zone 'projects/cockroach-ephemeral/zones/us-east1-b' does not have enough
  resources available to fulfill the request.  '(resource type:compute)'.

Recent example: #108629 (comment).

The error message includes a computer-readable payload that indicates other AZs where there are resources available for the request that failed; that information should be in errorInfo.metadatas.zonesAvailable:

- errorInfo:
    domain: compute.googleapis.com
    metadatas:
      attachment: local-ssd:16
      vmType: n2-highcpu-96
      zone: us-east1-b
      zonesAvailable: us-east1-c,us-east1-d
    reason: resource_availability

(extracted from the error message linked above).

Roachtest could be smarter about its cluster creation retry mechanism and take this information into account.

It also wouldn't hurt to rotate the default AZ (i.e., use us-east1-{b,c,d}), or even use us-central as well.

[1]

zones = []string{defaultZones[0]}

Jira issue: CRDB-33544

@renatolabs renatolabs added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-testing Testing tools and infrastructure T-testeng TestEng Team labels Nov 15, 2023
Copy link

blathers-crl bot commented Nov 15, 2023

cc @cockroachdb/test-eng

renatolabs added a commit to renatolabs/cockroach that referenced this issue Mar 19, 2024
As a stopgap measure to reduce the chances of "zone exhausted" errors
we see during roachtest runs[^1], we randomize the default zone used
when creating clusters with roachprod.

[^1]: for an example, see cockroachdb#120621 (comment)

Informs: cockroachdb#114523

Release note: None
craig bot pushed a commit that referenced this issue Mar 19, 2024
120714: roachprod: randomize default zone r=srosenberg a=renatolabs

As a stopgap measure to reduce the chances of "zone exhausted" errors we see during roachtest runs[^1], we randomize the default zone used when creating clusters with roachprod.

[^1]: for an example, see #120621 (comment)

Informs: #114523

Release note: None

Co-authored-by: Renato Costa <renato@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Mar 20, 2024
As a stopgap measure to reduce the chances of "zone exhausted" errors
we see during roachtest runs[^1], we randomize the default zone used
when creating clusters with roachprod.

[^1]: for an example, see #120621 (comment)

Informs: #114523

Release note: None
blathers-crl bot pushed a commit that referenced this issue Mar 20, 2024
As a stopgap measure to reduce the chances of "zone exhausted" errors
we see during roachtest runs[^1], we randomize the default zone used
when creating clusters with roachprod.

[^1]: for an example, see #120621 (comment)

Informs: #114523

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team
Projects
No open projects
Status: Triage
Development

No branches or pull requests

1 participant