Skip to content

fizz/buildbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

buildbench

Measure builds on the target hardware, not on your laptop.

buildbench launches a self-terminating spot instance, runs your build, collects metrics from two independent sources (CloudWatch Agent for live observability, sysstat for forensic depth), uploads the results to S3, and shuts down. Total cost per run is fractions of a penny. Total wall-clock time is "however long your build takes" plus about 90 seconds of bootstrap.

The problem this solves: Docker Desktop on macOS reports memory numbers that include the Linux VM's overhead — the build itself uses a fraction of what the hypervisor reports. If you size Kubernetes resource requests, Karpenter node pools, or self-hosted runners based on those numbers, you reserve memory that nothing ever uses.

Stop guessing. Measure on the target. It costs less than a penny.

Quickstart

# Provision the AWS side once
cd terraform && terraform init && terraform apply -auto-approve
export BUCKET=$(terraform output -raw bucket)
cd ..

# Run a benchmark
./bin/buildbench run \
  --instance-type r6g.large \
  --script ./examples/hello-build/build.sh \
  --bucket "$BUCKET"

The CLI prints the instance ID, a live CloudWatch URL, and an S3 path. The instance terminates itself when the build finishes. summary.json in S3 has the duration, peak memory, and exit code.

Architecture

The harness runs two collection systems in parallel because they answer different questions.

CloudWatch Agent streams metrics to a custom namespace at 2-second resolution: memory, CPU, disk I/O, network, swap. The dashboard updates while the build runs. If something goes sideways — an OOM kill, a disk filling up, a network timeout — you see it before the instance terminates.

sysstat writes the same data into a binary sa file on local disk. After the build finishes, sar -r extracts memory, sar -u extracts CPU, sar -d extracts disk throughput. The file ships to S3 alongside summary.json, giving you full post-mortem depth without per-metric query costs.

This is the Brendan Gregg methodology applied to ephemeral instances: the kernel instruments everything, you turn on the collection daemon, and you query the binary later. No sampling loops, no while true; do cat /proc/meminfo, no per-metric API calls during the hot path.

What the CLI does

  1. Resolves the target hardware. If you pass an ARM instance type (r6g.*, c7g.*, m7g.*), it finds the latest Amazon Linux 2023 arm64 AMI. For x86, it finds the x86_64 AMI.
  2. Resolves the network. Defaults to the first default-VPC subnet in the region and that VPC's default security group. Override with --subnet-id and --security-group.
  3. Uploads your build script and the CloudWatch Agent config to S3.
  4. Renders the user-data template with your parameters.
  5. Calls aws ec2 run-instances with MarketType=spot and a tag of managed-by=buildbench.
  6. Prints the live CloudWatch URL and the S3 results path.

The user-data script installs sysstat, Docker, and the CloudWatch Agent; starts sysstat in the background; downloads your build script from S3; runs it; tears the collection down; assembles summary.json; uploads everything to S3; and calls terminate-instances on itself.

CLI reference

buildbench --help prints the full reference. The required flags are --instance-type, --script, and --bucket. Everything else has a default that works for a default-VPC AWS account.

Terraform module

terraform/ contains the module that creates the S3 bucket, the IAM role, and the instance profile. The self-terminate permission is scoped by tag: only instances with managed-by=buildbench can be terminated by the role. A misconfigured build script cannot reach other workload.

module "buildbench" {
  source                 = "github.com/fizz/buildbench//terraform"
  results_retention_days = 30
}

See terraform/README.md for inputs and outputs.

The pattern

This is not specific to Docker builds. The harness answers one question: "How fast does X run on Y, and how much memory does it need?" Swap the Dockerfile for a machine learning training script, a compiler benchmark, a database import, a video transcode. Swap the instance type for any hardware question — Graviton vs. Intel, GPU vs. CPU, io2 vs. gp3, current generation vs. previous.

The instrumentation is the same regardless of workload. The cost is always fractions of a penny because spot instances bill by the second and the instance doesn't survive past the measurement.

Diagnosing failed runs

The harness installs the AWS CLI as the first step and sets a trap EXIT that uploads /var/log/buildbench.log, /var/log/cloud-init.log, /var/log/cloud-init-output.log, and a small exit.txt (with cloud-init status --long output and the script's exit code) to a diagnostics/ subdirectory in S3. This happens regardless of whether the build reached summary.json. If the run looks unsuccessful, that's where to look.

aws s3 ls   s3://my-bucket/<prefix>/<instance>/
aws s3 sync s3://my-bucket/<prefix>/<instance>/diagnostics/ ./diagnostics/
cat diagnostics/exit.txt           # exit code + cloud-init status
less diagnostics/buildbench.log    # your build's stdout/stderr
less diagnostics/cloud-init.log    # bootstrap + IAM + network errors land here

The presence-or-absence pattern tells you what happened:

summary.json diagnostics/exit.txt says What it means
present exit_code=0 Build succeeded
present exit_code≠0 Harness completed but your build script failed; look in build.log
missing exit_code≠0 Harness died before the build (IAM not propagated, S3 unreachable, dnf failed); look in cloud-init.log
missing missing The AWS CLI install failed and nothing could upload. Symptom: instance terminated by self-shutdown but S3 is empty. Cause is almost always default-SG egress (see Gotchas) or a region/AMI mismatch. Look at the spot request itself in the EC2 console.

Gotchas

A few things bite people the first time.

Spot capacity at launch. Larger instance types in popular AZs run out. If run-instances errors with InsufficientInstanceCapacity, try a different AZ or step down a size. Use aws ec2 get-spot-placement-scores --instance-types <T> --target-capacity 1 --single-availability-zone --region-names <R> to pick the best-scoring AZ before launching. For benchmarking, you only need one instance for a few minutes — pass --no-spot if you want guaranteed capacity.

Spot interruption mid-build. Modern spot capacity is reclaimed by AWS based on on-demand demand, not by your bid price. For builds expected to run longer than ~5 minutes in a constrained region, mid-build reclamation is a real risk regardless of how high your max-price is. Symptoms: BuildKit rpc error: code = Unavailable mid-compile, or any other process-died-unexpectedly pattern. Check the spot instance request afterwards — describe-spot-instance-requests --filters Name=instance-id,Values=<id> shows instance-terminated-no-capacity when this happens. For builds that need to complete, pass --no-spot. The cost difference at single-run scale is pennies. One consolation: AWS doesn't bill for spot instances that AWS itself reclaims within the first hour of runtime — so a failed bench attempt costs $0. The official policy: "If your Spot Instance is interrupted by Amazon EC2 in the first instance hour, you are not charged for the partial hour." Cheap to retry on a different AZ or instance size before falling back to on-demand.

Default security group egress. A fresh VPC's default security group allows all outbound traffic. If someone has tightened it, the instance can't reach Docker Hub, the CloudWatch Agent endpoint, or your S3 bucket. The build will hang on dnf install with no error. Pass --security-group with a known-good SG.

Subnet doesn't auto-assign public IPs. Some "public" subnets have IGW routing but MapPublicIpOnLaunch=false. The instance launches but can't reach the internet. Pass --associate-public-ip to force a public IP via --network-interfaces. Symptom: bootstrap hangs before the trap fires; no diagnostics in S3.

Root volume too small for big Docker builds. The default Amazon Linux 2023 root volume is 8 GiB. A multi-stage Docker build that compiles C/C++ libraries from source (tesseract, leptonica, etc.) plus a sizable runtime image plus dnf caches plus intermediate layer storage can easily exhaust this. Symptom: BuildKit rpc error: code = Unavailable mid-build with no other signal. Pass --volume-size N (e.g. --volume-size 30) to get a roomier root volume.

IMDSv2 enforcement. If your AWS account enforces IMDSv2 at the org or account level, unauthenticated metadata calls fail silently. buildbench's harness uses IMDSv2 tokens for every metadata call — this is required, not optional. If you're forking the harness for your own scripts, make sure your metadata reads route through the imds() helper, never bare curl 169.254.169.254.

Private repos. If your build script clones from a private repo, the spot instance has no credentials. Embed the source in the build script, use AWS CodeCommit with the instance role, or pre-stage the source in S3.

Background

buildbench started with a 16 GB Docker image that Docker Desktop reported as needing 5 GiB of memory. The actual build on Graviton needed under 1.5 GiB. The longform writeup is at ferkakta.dev — I was guessing build times from my laptop. A six-cent spot instance proved me wrong.

The sysstat lineage comes from a decade of Linux administration before "observability" was a vendor category. The kernel already instruments everything you need. sar -r, sar -u, sar -d — the data is there, in a compact binary format designed for exactly this. The CloudWatch Agent is the AWS-native version of the same principle.

License

MIT — see LICENSE.

About

Measure builds on target hardware, not your laptop. Self-terminating spot instances, CloudWatch Agent + sysstat dual instrumentation, results to S3, pennies per run.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors