Add CI to apply k8s, system, and platform stacks; verify platform services health on PR and main

## User request
Add CI for PR and push to main which applies all 3 stacks (k8s, system, platform) and makes sure all services deployed in platform stack are healthy.

## Researcher specification (Emerson Gray)
- Triggers: pull_request targeting main; push to main.
- Runner/permissions: ubuntu-latest; Docker available; minimal permissions (contents: read). No secrets required if agynio/platform remains public.
- Tooling: install k3d CLI; setup Terraform 1.6.6; install kubectl (v1.28.x). Set KUBECONFIG to stacks/k8s/.kube/agyn-local-kubeconfig.yaml.
- Apply sequence:
  1) stacks/k8s: terraform init/apply (-input=false, -auto-approve)
  2) stacks/system: terraform init/apply
  3) stacks/platform: terraform init/apply; provide TF vars in CI (e.g., TF_VAR_platform_db_password, TF_VAR_litellm_db_password, TF_VAR_litellm_master_key, TF_VAR_litellm_salt_key, TF_VAR_docker_runner_shared_secret) even though defaults exist, for clarity.
- Health checks after platform apply:
  - Kubernetes readiness in namespace platform: wait for Jobs complete; rollout status for Deployments and StatefulSets; ensure all pods are Running or Completed (15m timeout).
  - Optional: Argo CD Application CRs (in argocd namespace) are Synced and Healthy (15m timeout).
- Cleanup for PR: destroy in reverse order (platform → system → k8s) with always() to avoid leaks. Preserve cluster on push to main.
- Concurrency: use concurrency groups (bootstrap-pr-${{ github.event.pull_request.number }} for PR; bootstrap-main for main) with cancel-in-progress true for PR.
- Timeouts: job 45m; health checks 10–15m; use TF_IN_AUTOMATION and lock-timeout 10m.
- Risks/assumptions: GH runner resources may be tight; image pulls can be slow; Argo reconciliation may lag. No secrets required unless platform charts become private.

## Acceptance criteria
- On pull_request to main and push to main, CI applies k8s, system, platform stacks.
- CI installs required tools (Docker verified, k3d, Terraform, kubectl) and sets KUBECONFIG.
- CI verifies platform stack health by waiting for Jobs, rollout statuses, and pods readiness; optionally checks Argo Application health.
- PR runs perform cleanup (destroy platform, system, k8s) even on failure; main runs do not destroy.
- Concurrency prevents overlapping runs; clear, grouped logs are present.

## Implementation plan
- Modify existing workflow .github/workflows/bootstrap.yml to add platform apply and health-check steps; add PR-only destroy for platform; install kubectl; set timeouts and concurrency.
- Keep Terraform versions aligned (1.6.6). No code changes to stacks are required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CI to apply k8s, system, and platform stacks; verify platform services health on PR and main #21

User request

Researcher specification (Emerson Gray)

Acceptance criteria

Implementation plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add CI to apply k8s, system, and platform stacks; verify platform services health on PR and main #21

Description

User request

Researcher specification (Emerson Gray)

Acceptance criteria

Implementation plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions