Appliance workflow scripts, not core y-cluster features#27
Open
solsson wants to merge 16 commits into
Open
Conversation
Brings every scripts/ change from agents/appliance-export-import to upstream main as a single bump. The Go and testdata side of this work landed in #19 (appliance-primitives); this commit is the operator-facing bash that drives the released binary through the appliance lifecycle, plus the .env-style config the scripts source. What's new vs main: - appliance-build-hetzner.sh / appliance-build-virtualbox.sh: interactive build flows producing a .qcow2 + a VirtualBox- importable .ova respectively, both via the released y-cluster binary's prepare-export and export subcommands. - appliance-publish-hetzner.sh: pushes a built appliance to Hetzner Object Storage for handoff. - appliance-qemu-to-gcp.sh: end-to-end qemu -> GCP custom image flow (export --format=gcp-tar -> gsutil cp -> compute images create) with persistent /data/yolean disk preserved across redeploys, plus a teardown subcommand. - gcp-bootstrap-credentials.sh: one-shot bootstrap for the service account / project / key file the GCP flow needs. - e2e-appliance-export-import.sh: local qemu -> qemu round- trip exercising the full prepare-export / export / import cycle without any cloud cred dependency. - e2e-appliance-hetzner.{sh,pkr.hcl}: Packer-based snapshot flow; lays the snapshot down once, spins fresh servers on top to verify boot. - e2e-appliance-qemu-to-gcp.sh: non-interactive driver of appliance-qemu-to-gcp.sh end to end, including teardown. - .env.example + .gitignore: documents every overridable knob (GCP_PROJECT, GCP_KEY, H_S3_ENV_FILE, ENV_FILE) with a generic example path; .env stays out of git. Configuration: required values are operator-supplied via env vars (no built-in defaults). Each script derives REPO_ROOT from BASH_SOURCE and sources $REPO_ROOT/.env via `set -o allexport` when present, so the .env path works regardless of CWD (including `cd /tmp && bash /path/to/script`). Missing required values fail fast with a clear "set $VAR in .env or shell env" message. Scope: scripts/ + repo-root .env plumbing. The Go side is already on main via #19. Both `go build ./...` and `go test ./...` are unchanged-clean on this branch -- the scripts add no go.mod or testdata edits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Yolean dev / setup scripts that smoke-test the gateway expect a
host-side port that reaches guest 80. Today's qemu-side host port
forwards default to 39080 in both the Go e2e helper and the bash
appliance-build scripts, so any consumer that hardcodes
http://localhost:80 has to remember the offset.
This host (and most modern Linux distros) ships
net.ipv4.ip_unprivileged_port_start=80, so qemu's user-mode
hostfwd inherits the ability to bind port 80 without root. Default
APP_HTTP_PORT and the e2e port-forward helper to 80 in lockstep:
- e2e/qemu_test.go: e2eUniqueForwards now takes both apiPort
and httpPort; every test passes its own pair (28443 / 28444 /
... vs 26443 / 26444 / ...) keyed off the apiPort so concurrent
test runs on the same host don't collide. Each test always gets
a guest-80 forward, matching what the appliance-build scripts
install.
- scripts/appliance-{qemu-to-gcp,build-hetzner,build-virtualbox}.sh
+ scripts/e2e-appliance-{export-import,qemu-to-gcp}.sh: the
APP_HTTP_PORT default flips from 39080 to 80, with YHELP /
inline curl examples updated to match. Override via env
(APP_HTTP_PORT=39080 ./scripts/...) on hosts that keep port 80
privileged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Appliance e2e / build flows install workloads, build a seed
tarball, prepare-export, and re-boot from the prepared disk -- the
cumulative footprint pushes the 20G default disk into pressure on
the kubelet's image-gc thresholds, which surfaces as flaky pod
evictions mid-test or mid-build.
Bump to 40G everywhere a 20G default sat:
- e2e/qemu_test.go: e2eQEMURuntime overrides DiskSize to 40G so
every qemu e2e test boots with the larger disk by default.
- scripts/appliance-{qemu-to-gcp,build-hetzner,build-virtualbox}.sh
+ scripts/e2e-appliance-{export-import,qemu-to-gcp}.sh: the
generated y-cluster-provision.yaml now sets diskSize: "40G".
- scripts/appliance-qemu-to-gcp.sh: --boot-disk-size on
`gcloud compute instances create` flips from 20GB to 40GB so
the GCE VM doesn't reject the 40G custom image with "Requested
disk size cannot be smaller than the image size".
qcow2 is sparse, so the host-disk footprint only grows with actual
usage; the larger virtual size is a no-cost ceiling. The GCE side
similarly uses a thin-provisioned persistent disk.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The appliance-build / e2e scripts each carried a defaults block:
APP_HTTP_PORT="${APP_HTTP_PORT:-80}"
APP_API_PORT="${APP_API_PORT:-39443}"
APP_SSH_PORT="${APP_SSH_PORT:-2229}"
then interpolated those into the heredoc'd y-cluster-provision.yaml.
Three of the four values restate y-cluster's own defaults
(80/6443/2222 in pkg/provision/config); the bash defaults that
DIFFERED (39443 vs 6443; 2229 vs 2222) were chosen for collision
avoidance against an operator's regular y-cluster, but were quiet
duplicates of the same defaulting concept.
Replace the heredoc with a brace block that emits each port field
ONLY when the env var is set. Net behaviour:
- No env override -> minimal YAML; y-cluster fills 2222 +
{6443:6443, 80:80, 443:443}.
- APP_HTTP_PORT=N -> only the host:N -> guest:80 entry lands;
API/SSH still y-cluster-default.
- Multiple set -> all set entries land; requireHostAPIPort
validates that a guest:6443 entry exists.
Display refs (banner curl examples, ssh commands, smoketest
probes) use ${APP_*_PORT:-NN} inline so the printed URL/SSH
command shows the right port whether overridden or default.
YHELP entries reworded from "(default: 80)" to
"(y-cluster default: 80)" so the operator sees who owns the
default.
IMP_HTTP_PORT / IMP_SSH_PORT in e2e-appliance-export-import.sh
left as-is (test-only; the import-side qemu is started directly,
no y-cluster CLI involvement, so y-cluster's defaults don't
apply).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symmetric with APP_HTTP_PORT / APP_API_PORT: a new
APP_HTTPS_PORT env var lets operators override the host port
forwarded to guest 443. Unset means "let y-cluster apply its
default" -- the YAML still omits the field when no port var is
set, which matches the behaviour for the other ports.
Without this, an operator who overrides any one of {HTTP, API}
silently lost 443 forwarding (the YAML's portForwards block
became canonical and didn't include 443; previously y-cluster's
[6443:6443, 80:80, 443:443] default applied only when the bash
emitted no portForwards at all).
The host:guest match keeps standard ports inside the appliance
unchanged; the host-side ip_unprivileged_port_start sysctl on
modern Linux distros allows binding 443 without root the same
way 80 already does.
YHELP entries updated to surface the new knob.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a post-deploy step that offers to stand up a GCP regional
External Application Load Balancer in front of the appliance VM
with a self-signed cert covering operator-supplied FQDNs.
Idempotent (describe-then-create) so re-runs converge; teardown
integrated into the existing teardown subcommand.
Why a self-signed cert and a prompt-not-default
The cert-manager → upload-real-cert path is the eventual
production shape, but for the dev loop a self-signed cert lets
the operator verify the LB stack + HTTPRoute hostname matching
without DNS / CA dependencies. The opt-in default is a billing
meter (forwarding rule ~hourly, reserved IP) the operator should
deliberately accept; we don't want a forgotten ASSUME_YES run to
silently provision one.
Operator UX
- Default: prompts after the HTTP probe with a one-paragraph
explainer (cost, self-signed cert, HTTPRoute prerequisite),
accepts comma-separated FQDNs, empty skips.
- TLS_DOMAINS env var preset: skips the prompt and runs.
- ASSUME_YES alone: skips silently (unattended e2e shouldn't
surprise-bill).
- Final banner prints the LB IP + a single /etc/hosts line
covering all FQDNs, marks the cert SELF-SIGNED, points at
the gcloud commands to swap in a real cert later.
Resources, all named ${NAME}-tls-*
proxy-only subnet (reuses any ACTIVE one in the region;
creates per-build only when none exists)
static regional IP
SSL cert (uploaded, self-signed)
HTTP health check on /q/envoy/echo
zonal NEG with the VM as endpoint
backend service (EXTERNAL_MANAGED) + add-backend
URL map (default-service points at the backend)
target HTTPS proxy
forwarding rule on :443
Teardown
do_tls_teardown is invoked from the existing do_teardown so a
plain `appliance-qemu-to-gcp.sh teardown` cleans up the LB
stack alongside the VM/image/object/disk. Order forces the
forwarding rule first (stops the meter), then proxy / url-map /
backend / NEG / health-check / cert / IP. Subnet last and only
when it's the per-build one (we never delete a reused regional
subnet). Each delete is idempotent: missing resources are not
errors. The `Will DELETE:` inventory now lists `${NAME}-tls-*`
when a forwarding rule of that shape exists.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes for the GCP appliance smoke flow:
1. do_tls_frontend now creates a :80 forwarding rule that 301s
to :443. Previously the function set up only a :443 listener,
so any `curl http://<lb-ip>/...` against the LB IP hung at TCP
connect (no listener on 80). Hangs from `curl ... http://...`
were diagnosed against the live ext-app01-* LB stack which
has the same shape.
Mechanism: GCP regional EXTERNAL_MANAGED URL maps can either
forward (defaultService) or redirect (defaultUrlRedirect),
not both, so the redirect needs its own URL map. The chain:
:80 fwd -> tls-http-proxy -> tls-redirect URL map (301 to https)
:443 fwd -> tls-proxy -> tls-urlmap (existing, ->backend)
`gcloud compute url-maps create` has no flag for default-
redirect, hence the `url-maps import` from a heredoc.
Hostname-agnostic on both ports: every request, any Host:,
either redirects (on :80) or forwards to the VM (on :443).
The VM's envoy-gateway is the only Host-aware hop.
do_tls_teardown grew matching delete calls in dependency order
(forwarding rules -> proxies -> URL maps) so re-runs converge
cleanly.
2. The post-deploy probe at the end of the GCP stage now
enumerates HTTPRoute + GRPCRoute hostnames via SSH +
`sudo k3s kubectl ... -o jsonpath` and probes each FQDN
through `--resolve <fqdn>:80:$PUBLIC_IP`. Replaces the
single-path `/q/envoy/echo` probe -- which only verified
"envoy answers anything", not "every advertised route is
reachable end-to-end".
Reachability == any HTTP status code (2xx/3xx/4xx/5xx),
not 200: a route that legitimately answers 302 / 401 / 404 is
still proof the firewall + klipper-lb + envoy-gateway chain
is working. Only `000` (timeout / refused) counts as
unreachable. On any unreachable route the script logs a
warning with diagnostic suggestions (firewall source-ranges
narrowed, backend Service not Ready, workload still rolling
out) and continues -- info-level surfacing today, gating /
strict mode is a deliberately deferred follow-up.
Falls back to the old `/q/envoy/echo` probe when the cluster has
no Gateway-bound routes (a workload that hasn't applied yet).
Verified end-to-end against the live appliance: 4 routes
enumerated (dev.yolean.net, ext-app01.yolean.se, keycloak-admin,
keycloak-admin.ext-app01.yolean.se), all returned HTTP 302 on
the first attempt. The redirect chain itself is intentionally
NOT exercised against ext-app01-* in this commit (would require
mutating an in-use LB the operator owns); it lands on the next
do_tls_frontend run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…own-side PRESERVED message
State preservation across appliance redeploys is the overarching
design goal of the data-seed mechanism (commit f69addf +
APPLIANCE_MAINTENANCE.md). What was missing on the operator-
facing side: the QA-flow build script silently reused the
persistent disk on every redeploy, masking the seed-skip from
build-time-only operators who expected each run to validate the
seed end-to-end. Conversely, the production "preserve customer
state across upgrades" intent was never written down where it
mattered (the operator only saw a generic banner at deploy time,
not after teardown when the disk-keep decision is most actionable).
Changes:
- Build-flow `--reuse-disk=true|false` with an interactive
prompt (default Y -- preserve, matching the design goal).
On `--reuse-disk=false` the script delete-and-recreates the
persistent disk so the next boot's data-seed unit lands the
OS image's seed cleanly. Non-TTY callers MUST pass the flag
explicitly; ASSUME_YES + missing flag fails fast rather than
silently picking a default for an irreversible decision.
- Teardown `--keep-disk=true|false`. Default behavior is
unchanged (keep). Legacy `--delete-data-disk` continues to
work as `--keep-disk=false` with a one-line deprecation
notice, so any existing automation isn't broken.
- Decoupled the new disk decisions from the existing
`confirm()` helper (which consults ASSUME_YES). New
`prompt_yes_default()` helper requires a TTY or an
explicit flag, never falls back to ASSUME_YES. The umbrella
ASSUME_YES still covers the existing 'Proceed?' + TLS-LB
prompts.
- Moved the "Persistent data disk PRESERVED" message from
the build-success banner to the END of teardown when the
disk was kept. That's the moment the operator's mental
model needs the reminder ('what survived?' + 'how do I
delete it later?'). The build success block keeps a brief
one-line pointer to teardown's message instead of carrying
the full paragraph.
Verified end-to-end against yo-sre-appliance-qa over the past
two days: --reuse-disk=false correctly recreates the disk and
the data-seed unit extracts the image's seed onto it; the
recreated disk + grastate.dat workaround round-tripped
mariadb's keycloak.REALM rows through prepare-export -> seed
-> fresh-disk -> boot, with `keycloak/auth/realms/ext-bfv01`
returning 200 from the resulting cluster.
Two follow-up fixes lined up but not in this commit (kept
working-tree, separate commit): a `return 0` belt at the end
of do_tls_teardown so its trailing `[[ -n "$subnet" ]] && ...`
doesn't leak a non-zero exit and abort the caller before the
new PRESERVED block fires; and the revert of the route-
enumeration block that this same teardown-issue debugging
surfaced as post-import SSH+kubectl scope-creep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The build VM occasionally OOMs during heavier customer workloads applied at PROMPT 1 (mariadb + kafka + envoy + the bundled controllers all in 4GB is tight). 8GB matches the y-cluster default for stand-alone provisions but the qemu-to-gcp script was overriding it down to 4GB to keep the host's headroom; the headroom is fine on the build host, so lift the override. The y-cluster default itself is unchanged (8192 in config.QEMUConfig.applyDefaults), so other provisioner flows (multipass, docker, plain qemu) are not affected. Disk size stays at 40GB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #20 changed prepare-export to require the cluster RUNNING: its live phase clears the per-deploy dns-hint-ip annotation and snapshots reconciled Gateway state into <cacheDir>/<name>- gateway-state.json (both need the apiserver up). prepare-export then stops the VM itself before its offline (virt-customize) phase. The plan called for dropping `y-cluster stop` from the script ahead of prepare-export, but the script edit never landed. The result: every run of appliance-qemu-to-gcp.sh would stop the cluster, then crash with "VM not running; start the cluster first" when prepare-export ran against the stopped VM. Drop the explicit stop call. Update the docstring stage list to reflect that prepare-export does its own stop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the parallel-list footgun: today the operator declares hostnames in HTTPRoute manifests AND in TLS_DOMAINS, and drift between the two means the LB cert covers hostnames the cluster doesn't serve, or vice versa. Setting TLS_DOMAINS=auto now resolves the FQDN list by calling `y-cluster gateway hostnames --csv` against the just-provisioned cluster, immediately after PROMPT 1 confirmation. The cluster's reconciled HTTPRoute / GRPCRoute hostnames become the LB cert SAN list -- one source of truth. Resolution runs BEFORE prepare-export because by the TLS LB stage (after prepare-export + GCP deploy) the local apiserver is gone. Other TLS_DOMAINS values (literal CSV / empty / prompt) are still handled at the LB stage as before. Empty result aborts with an explicit error (operator asked for auto, none found = something wrong with the cluster state). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hooks The unattended flow had ASSUME_YES + TLS_DOMAINS=auto landed already, but no work-doing hook in PROMPT 1's hands-on window. Result: a build with ASSUME_YES=1 reached prepare-export with only the y-cluster echo HTTPRoute applied; TLS_DOMAINS=auto then aborted because the cluster had no non-wildcard hostnames to derive from. Add the two hooks documented in specs/y-cluster/FEATURE_APPLIANCE_AUTOMATED_FLOW.md: - APPLIANCE_SEED_CMD runs after echo install, before PROMPT 1. Customer workloads applied here populate /data/yolean for the data-seed extraction AND give TLS_DOMAINS=auto real hostnames. - APPLIANCE_VERIFY_CMD runs at the end, after the GCP deploy + optional TLS LB. Receives the LB IP / VM IP / domains via the Y_CLUSTER_CURRENT_* surface so a remote probe can curl --resolve through the deployed VM without /etc/hosts. Both fire via `bash -c "$cmd"` so the operator-supplied string can pipe / chain / cd freely. Both export a single, consistent Y_CLUSTER_CURRENT_* env surface (via the new current_env helper) -- a verify script `printenv | grep ^Y_CLUSTER_CURRENT_` sees the full surface either way; vars not yet known at the seed hook (REMOTE_VM_IP, etc.) are exported as empty strings. Non-zero exit aborts under set -e. Local cluster / VM / LB stay up for inspection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e stack) Observed an appliance build that ran fine for ~2h at 91-93% memory on a 4 GiB e2-medium, then died at 100% CPU / 3807 MiB used: ssh banner exchange timed out, :443 + :6443 went REFUSED while :80 kept LISTEN with the userspace too starved to respond. Classic OOM spiral. The full appliance stack (k3s + containerd + keycloak + envoy gateway + envoy proxy + mysql + kafka) sits within ~300 MiB of the 4 GiB ceiling at idle; any workload spike pushes it over. e2-standard-2 (2 vCPU / 8 GiB) gives the stack the headroom it needs. GCE machine types bundle CPU + memory, so there's no separate memory override -- that's spelled out in both the help text and the default-assignment comment so the next operator reading either spot sees why we don't surface a GCP_MEMORY knob. GCP_MACHINE_TYPE stays as the escape hatch for highmem / larger shapes.
The previous commit added "e2-medium's" and "there's" inside the single-quoted YHELP heredoc. Single quotes in bash can't contain single quotes, so the apostrophes terminated the string mid-block; the resumed unquoted "4 GiB OOMs ..." got parsed as a command, and any consumer that sourced or executed the help block saw "line 76: 4: command not found". Reworded to avoid the apostrophes entirely. bash -n parses the file clean and --help renders the section as intended.
Both files-pointed-at-by-env-var inputs surfaced the same
foot-gun: a malformed value passed the existence check but
failed deep inside the tool we shelled into, with a less
helpful message:
- GCP_KEY pointing at a truncated / wrong-format JSON
(e.g. a re-exported key that lost its private_key during
a copy-paste) only erred at `gcloud auth
activate-service-account`, by which point the operator
has already proven the file exists. Now `jq -e` checks
that the four fields GCP requires for a service-account
auth are populated -- type=service_account, project_id,
client_email, private_key -- and errors with the missing
field names so the operator knows what to fix.
- H_S3_REGION accepted any string and only surfaced "could
not resolve host" when the upload URL hit a non-existent
endpoint hostname. The help text already documents the
valid set (fsn1, hel1, nbg1); now the script enforces
it at config-load time with a message naming the valid
values.
Both checks fire BEFORE any cloud-side state change. Adds no
new dependency: jq is already required by the broader
appliance flow.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
To get bearable dev loops I had to keep hosting-specific scripts in this repo
Uses #19 and #21 to set up actual clusters, given external "init" (the script that applies stuff) and "verify" (smoke test before export).