Skip to content

Add CI to apply k8s, system, and platform stacks; verify platform services health on PR and main #21

@rowan-stein

Description

@rowan-stein

User request

Add CI for PR and push to main which applies all 3 stacks (k8s, system, platform) and makes sure all services deployed in platform stack are healthy.

Researcher specification (Emerson Gray)

  • Triggers: pull_request targeting main; push to main.
  • Runner/permissions: ubuntu-latest; Docker available; minimal permissions (contents: read). No secrets required if agynio/platform remains public.
  • Tooling: install k3d CLI; setup Terraform 1.6.6; install kubectl (v1.28.x). Set KUBECONFIG to stacks/k8s/.kube/agyn-local-kubeconfig.yaml.
  • Apply sequence:
    1. stacks/k8s: terraform init/apply (-input=false, -auto-approve)
    2. stacks/system: terraform init/apply
    3. stacks/platform: terraform init/apply; provide TF vars in CI (e.g., TF_VAR_platform_db_password, TF_VAR_litellm_db_password, TF_VAR_litellm_master_key, TF_VAR_litellm_salt_key, TF_VAR_docker_runner_shared_secret) even though defaults exist, for clarity.
  • Health checks after platform apply:
    • Kubernetes readiness in namespace platform: wait for Jobs complete; rollout status for Deployments and StatefulSets; ensure all pods are Running or Completed (15m timeout).
    • Optional: Argo CD Application CRs (in argocd namespace) are Synced and Healthy (15m timeout).
  • Cleanup for PR: destroy in reverse order (platform → system → k8s) with always() to avoid leaks. Preserve cluster on push to main.
  • Concurrency: use concurrency groups (bootstrap-pr-${{ github.event.pull_request.number }} for PR; bootstrap-main for main) with cancel-in-progress true for PR.
  • Timeouts: job 45m; health checks 10–15m; use TF_IN_AUTOMATION and lock-timeout 10m.
  • Risks/assumptions: GH runner resources may be tight; image pulls can be slow; Argo reconciliation may lag. No secrets required unless platform charts become private.

Acceptance criteria

  • On pull_request to main and push to main, CI applies k8s, system, platform stacks.
  • CI installs required tools (Docker verified, k3d, Terraform, kubectl) and sets KUBECONFIG.
  • CI verifies platform stack health by waiting for Jobs, rollout statuses, and pods readiness; optionally checks Argo Application health.
  • PR runs perform cleanup (destroy platform, system, k8s) even on failure; main runs do not destroy.
  • Concurrency prevents overlapping runs; clear, grouped logs are present.

Implementation plan

  • Modify existing workflow .github/workflows/bootstrap.yml to add platform apply and health-check steps; add PR-only destroy for platform; install kubectl; set timeouts and concurrency.
  • Keep Terraform versions aligned (1.6.6). No code changes to stacks are required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions