Skip to content

roachprod: prebuild roachprod/roachtest cloud images #150144

@golgeek

Description

@golgeek

Problem Statement

Provisioning roachprod clusters has become increasingly slow because we configure each VM entirely at boot time via the cloud-init startup script:

  • We install a growing list of Debian packages and test-only tools (e.g. observability exporters).
  • Instance startup time keeps creeping us as the package list grows.
  • Every boot depends on live external endpoints (Debian mirrors, GitHub tarballs, etc.). A transient outage or rate-limit could fail roachtests as infra flake.
  • While dependency versions are pined, downloading tarballs or building external dependencies for each test could mean that the same test run is not perfectly reproducible over time.

See cockroachdb/cockroach#147352 for the ongoing effort to package several of these tools as first-class .deb artifacts instead of downloading or compiling from source at boot.


Proposal — Pre-baked Images with Packer + Ansible

  1. Image build phase (offline)

    1. For each cloud provider (GCE, AWS, Azure, …) and each architecture we support (amd64, arm64, fips, s390x), use Hashicorp Packer to spin up a temporary VM.
    2. Apply an Ansible role that:
    3. Output an image/AMI ready for test workloads.
  2. Test run phase (online)

    • roachprod launches from the pre-baked image instead of a vanilla Ubuntu image.
    • cloud-init now only needs to:
      • Attach extra disks.
      • Write CockroachDB configuration and cluster topology.
      • Start the node.

Note
The software discussed in #147352 originally built ready-to-go AMIs via Packer in GCP Cloud Build behind a small REST API. Re-implementing that workflow for Test Engineering should be straightforward.


Benefits

  • Boot time: instances start significantly faster because all heavy installation happens once, offline.
  • Stability: test runs no longer depend on third-party package mirrors or GitHub availability.
  • Reproducibility: the exact image is versioned and immutable; restoring an old CI run is as simple as pinning an AMI ID.

Drawbacks / Trade-offs

Drawback Mitigation
Images must be rebuilt whenever any dependency changes. Automate nightly image builds in CI; only promote to “latest” when the Ansible role finishes successfully.
More images to manage (per cloud, per arch, per Ubuntu release). Store metadata (build date, upstream Ubuntu SHA) in image tags to aid pruning; use Terraform data sources to fetch “latest-stable”.
Slightly higher storage cost for custom images. Negligible.

Acceptance Criteria

  • Packer templates exist for each supported cloud / arch.
  • Ansible role reproduces current cloud-init behavior minus CRDB setup.
  • CI job builds an image on demand and publishes its ID/URI.
  • roachprod can consume a --image-family flag (or similar) to use the new images.

Jira issue: CRDB-52590

Metadata

Metadata

Assignees

Labels

A-testeng-infraA-testingTesting tools and infrastructureC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-testengTestEng Team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions