-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Labels
A-testeng-infraA-testingTesting tools and infrastructureTesting tools and infrastructureC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-testengTestEng TeamTestEng Team
Description
Problem Statement
Provisioning roachprod clusters has become increasingly slow because we configure each VM entirely at boot time via the cloud-init startup script:
- We install a growing list of Debian packages and test-only tools (e.g. observability exporters).
- Instance startup time keeps creeping us as the package list grows.
- Every boot depends on live external endpoints (Debian mirrors, GitHub tarballs, etc.). A transient outage or rate-limit could fail roachtests as infra flake.
- While dependency versions are pined, downloading tarballs or building external dependencies for each test could mean that the same test run is not perfectly reproducible over time.
See cockroachdb/cockroach#147352 for the ongoing effort to package several of these tools as first-class .deb artifacts instead of downloading or compiling from source at boot.
Proposal — Pre-baked Images with Packer + Ansible
-
Image build phase (offline)
- For each cloud provider (GCE, AWS, Azure, …) and each architecture we support (amd64, arm64, fips, s390x), use Hashicorp Packer to spin up a temporary VM.
- Apply an Ansible role that:
- Upgrades the OS.
- Installs all baseline OS packages and our test dependencies.
- Adds the pre-built
.debpackages (created in roachtest: package external dependencies in a reproducible way #147352). - Copies any custom binaries we cannot (yet) package.
- Output an image/AMI ready for test workloads.
-
Test run phase (online)
roachprodlaunches from the pre-baked image instead of a vanilla Ubuntu image.cloud-initnow only needs to:- Attach extra disks.
- Write CockroachDB configuration and cluster topology.
- Start the node.
Note
The software discussed in #147352 originally built ready-to-go AMIs via Packer in GCP Cloud Build behind a small REST API. Re-implementing that workflow for Test Engineering should be straightforward.
Benefits
- Boot time: instances start significantly faster because all heavy installation happens once, offline.
- Stability: test runs no longer depend on third-party package mirrors or GitHub availability.
- Reproducibility: the exact image is versioned and immutable; restoring an old CI run is as simple as pinning an AMI ID.
Drawbacks / Trade-offs
| Drawback | Mitigation |
|---|---|
| Images must be rebuilt whenever any dependency changes. | Automate nightly image builds in CI; only promote to “latest” when the Ansible role finishes successfully. |
| More images to manage (per cloud, per arch, per Ubuntu release). | Store metadata (build date, upstream Ubuntu SHA) in image tags to aid pruning; use Terraform data sources to fetch “latest-stable”. |
| Slightly higher storage cost for custom images. | Negligible. |
Acceptance Criteria
- Packer templates exist for each supported cloud / arch.
- Ansible role reproduces current cloud-init behavior minus CRDB setup.
- CI job builds an image on demand and publishes its ID/URI.
-
roachprodcan consume a--image-familyflag (or similar) to use the new images.
Jira issue: CRDB-52590
Metadata
Metadata
Assignees
Labels
A-testeng-infraA-testingTesting tools and infrastructureTesting tools and infrastructureC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-testengTestEng TeamTestEng Team