roachtest: improve handling of cluster creation errors from cloud provider #114523
Labels
A-testing
Testing tools and infrastructure
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-testeng
TestEng Team
Projects
We have been seeing a number of cluster creation errors every night on roachtest nightly runs. Part of this has to do with GCP not having enough resources on
us-east1-b
, where we create VMs by default [1]. These errors typically look like:Recent example: #108629 (comment).
The error message includes a computer-readable payload that indicates other AZs where there are resources available for the request that failed; that information should be in
errorInfo.metadatas.zonesAvailable
:(extracted from the error message linked above).
Roachtest could be smarter about its cluster creation retry mechanism and take this information into account.
It also wouldn't hurt to rotate the default AZ (i.e., use
us-east1-{b,c,d}
), or even useus-central
as well.[1]
cockroach/pkg/roachprod/vm/gce/gcloud.go
Line 969 in bab4335
Jira issue: CRDB-33544
The text was updated successfully, but these errors were encountered: