User request
Add CI for PR and push to main which applies all 3 stacks (k8s, system, platform) and makes sure all services deployed in platform stack are healthy.
Researcher specification (Emerson Gray)
- Triggers: pull_request targeting main; push to main.
- Runner/permissions: ubuntu-latest; Docker available; minimal permissions (contents: read). No secrets required if agynio/platform remains public.
- Tooling: install k3d CLI; setup Terraform 1.6.6; install kubectl (v1.28.x). Set KUBECONFIG to stacks/k8s/.kube/agyn-local-kubeconfig.yaml.
- Apply sequence:
- stacks/k8s: terraform init/apply (-input=false, -auto-approve)
- stacks/system: terraform init/apply
- stacks/platform: terraform init/apply; provide TF vars in CI (e.g., TF_VAR_platform_db_password, TF_VAR_litellm_db_password, TF_VAR_litellm_master_key, TF_VAR_litellm_salt_key, TF_VAR_docker_runner_shared_secret) even though defaults exist, for clarity.
- Health checks after platform apply:
- Kubernetes readiness in namespace platform: wait for Jobs complete; rollout status for Deployments and StatefulSets; ensure all pods are Running or Completed (15m timeout).
- Optional: Argo CD Application CRs (in argocd namespace) are Synced and Healthy (15m timeout).
- Cleanup for PR: destroy in reverse order (platform → system → k8s) with always() to avoid leaks. Preserve cluster on push to main.
- Concurrency: use concurrency groups (bootstrap-pr-${{ github.event.pull_request.number }} for PR; bootstrap-main for main) with cancel-in-progress true for PR.
- Timeouts: job 45m; health checks 10–15m; use TF_IN_AUTOMATION and lock-timeout 10m.
- Risks/assumptions: GH runner resources may be tight; image pulls can be slow; Argo reconciliation may lag. No secrets required unless platform charts become private.
Acceptance criteria
- On pull_request to main and push to main, CI applies k8s, system, platform stacks.
- CI installs required tools (Docker verified, k3d, Terraform, kubectl) and sets KUBECONFIG.
- CI verifies platform stack health by waiting for Jobs, rollout statuses, and pods readiness; optionally checks Argo Application health.
- PR runs perform cleanup (destroy platform, system, k8s) even on failure; main runs do not destroy.
- Concurrency prevents overlapping runs; clear, grouped logs are present.
Implementation plan
- Modify existing workflow .github/workflows/bootstrap.yml to add platform apply and health-check steps; add PR-only destroy for platform; install kubectl; set timeouts and concurrency.
- Keep Terraform versions aligned (1.6.6). No code changes to stacks are required.
User request
Add CI for PR and push to main which applies all 3 stacks (k8s, system, platform) and makes sure all services deployed in platform stack are healthy.
Researcher specification (Emerson Gray)
Acceptance criteria
Implementation plan