feat: deploy app layer (Coder, Keycloak, GitLab, AI Gateway) on GovCloud EKS#5
Merged
Conversation
…oud EKS Brings up and validates the full demo stack on the live us-gov-west-1 cluster: - Coder v2.34.0 (Helm) with Keycloak OIDC SSO, AI Governance license, and AI Gateway providers (anthropic + anthropic-bedrock via IRSA). - Keycloak 26.6.3 with realm `coder` import (client + demo user). - GitLab CE 19.0.1 single-container (embedded Postgres). - claude-code workspace template (Coder Agents + Claude Code + AgentAPI). - Platform layer: ingress-nginx + internet-facing NLB (AWS LB Controller), EBS CSI IRSA, gp3 StorageClass, RDS roles/dbs, workspace RBAC. Fixes applied during bring-up: - ingress-nginx: aws-load-balancer-type=external (standard EKS, not Auto Mode). - keycloak realm: drop non-standard _comment_* keys that break realm import. - coder values: AI provider name must be `anthropic` (AI Gateway routes by provider name; the claude-code module hardcodes /api/v2/aibridge/anthropic). - claude-code template: allow_privilege_escalation=true so the agentapi module can sudo-install to /usr/local/bin. - gitlab: gp3 StorageClass; remove mattermost key (removed in GitLab 19.0); add VPC CIDR to monitoring_whitelist so kubelet health probes pass. NOTE: EKS Auto Mode node provisioning is broken in this GovCloud account, so the cluster runs as standard EKS. See STATUS.md and deploy/platform/README.md for the deviations to reconcile into Terraform. Authored by Coder Agents on behalf of @ausbru87.
…rnal auth Disable Coder's built-in github.com providers and route git through the in-cluster GitLab instead, so no auth path leaves the GovCloud boundary. - CODER_OAUTH2_GITHUB_DEFAULT_PROVIDER_ENABLE=false disables the default GitHub login (was enabled out-of-the-box via Coder's hosted GitHub app). - Configure a GitLab external-auth provider (CODER_EXTERNAL_AUTH_0_*) against gitlab.usgov.coderdemo.io using an instance-wide OAuth app; id/secret come from Secret coder-external-auth. Declaring an explicit external-auth provider also suppresses Coder's default github.com external-auth injection. Login is now Keycloak SSO + local password owner only. Authored by Coder Agents on behalf of @ausbru87.
Harden the demo Coder deployment along three axes the user requested: - Every workspace template now requires in-boundary GitLab login. The claude-code template declares `data "coder_external_auth" "gitlab"`, so a workspace must complete the GitLab OAuth flow before the agent is ready; the agent git credential helper then injects a short-lived token for clone/fetch/push. No PATs/SSH keys in the workspace, no out-of-boundary auth path. - Disable path-based workspace apps (CODER_DISABLE_PATH_APPS=true). All templates serve apps with subdomain=true, so apps are now subdomain-only and the same-origin path-app surface is removed. - Add scripts/set-appearance.sh to set the green "UNCLASSIFIED - USGOVCLOUD" classification banner. Appearance is a runtime DB setting (premium-gated), not a Helm value, so the script makes it reproducible and idempotent. Verified live: template version /external-auth lists gitlab as required, deployment config disable_path_apps=true, GET /api/v2/appearance shows the banner. Generated by Coder Agents.
Add docs/as-built/, the engineering record of what is deployed and how it is configured, produced by a fan-out of documentation agents and cross-checked against live read-only state: - 00-overview: architecture, component map, topology diagram, core flows. - 10-infrastructure: GovCloud substrate (VPC, EKS standard-not-Auto-Mode and why, node group, IRSA, RDS, ECR, Route53, ACM, NLB). - 20-platform-kubernetes: namespaces, ingress, storage, workspace RBAC, Secrets. - 30-coder-control-plane: values.yaml walkthrough, OIDC SSO, auth hardening, licensing, appearance. - 40-identity-keycloak: realm coder, OIDC client, the no-group-sync gap. - 50-gitlab-scm: in-boundary GitLab, the OAuth app, per-workspace git auth. - 60-ai-gateway: AI Bridge providers, name-based routing, end-to-end flow, remaining action. - 70-workspace-templates: the claude-code template and required GitLab auth. - 80-iac-vs-imperative: declarative (Terraform) vs imperative ledger plus a reconciliation backlog. - 90-operations-runbook: day-2 ops and known gaps. Cross-linked from docs/00-INDEX.md and STATUS.md. Verified emdash/endash-free. Generated by Coder Agents.
Model a true multi-tenant hierarchy in Keycloak and sync it into Coder via OIDC
IdP sync (organization + group + role), with personas for the demo.
Organizations: coder (display "Platform Engineering"), alpha ("Mission Partner
Alpha"), bravo ("Mission Partner Bravo").
Keycloak (realm coder): a hierarchical group tree plus one Group Membership
mapper emitting a full-path `groups` claim (ID + access + userinfo), and 8
persona users. Coder runs runtime per-org IdP sync (not legacy env vars):
- organization sync: field=groups, assign_default=false, /platform|/alpha|/bravo
- group sync (per org): team subgroups -> pre-created Coder groups
- role sync (per org): role subgroups -> organization-admin /
organization-template-admin / organization-auditor
Tenant orgs are functional: an org-scoped provisioner key + external provisioner
daemon per tenant (deploy/coder/provisioners.yaml, reusing the coder SA), and
the claude-code template pushed into all three orgs.
Verified end to end with scripts/verify-oidc-login.py: a real Keycloak login per
persona lands them in the correct org(s), group(s), and role(s), with tenant
isolation (Alpha vs Bravo vs Platform) and a cross-tenant ISSO/auditor.
New idempotent scripts:
- scripts/setup-keycloak-hierarchy.py (Keycloak Admin REST API)
- scripts/setup-coder-idp-sync.py (Coder API: orgs, groups, sync, no secrets)
- scripts/verify-oidc-login.py (real OIDC login -> org/role/group report)
Docs: docs/as-built/45-idp-sync-personas.md; updated 40-identity-keycloak.md,
as-built README, and STATUS.md.
Generated by Coder Agents.
Move the demo's runtime secrets to AWS Secrets Manager as the source of truth and sync them into Kubernetes with the External Secrets Operator over IRSA, so no secret material lives in git or in a local file. - Mirror the ESO image into ECR (scripts/images.txt) and deploy ESO (chart 2.6.0, ns external-secrets) with deploy/platform/external-secrets/values.yaml. - IRSA role usgov-coderdemo-external-secrets: least-privilege secretsmanager:GetSecretValue/DescribeSecret on usgov-coderdemo/* only, no static keys. Codified in terraform/secrets-hardening.tf. - Migrate the 9 runtime app secrets (coder/keycloak/gitlab) into ASM with scripts/migrate-secrets-to-asm.py (values passed via mode-600 temp files). - ClusterSecretStore aws-secretsmanager + one ExternalSecret per app secret (dataFrom extract, creationPolicy Owner). ESO adopted the existing Secrets in place with byte-identical data (no app disruption); store Valid, all 9 SecretSynced; delete/recreate recovery verified. - EKS Secrets envelope encryption with a customer-managed KMS key is codified in terraform/secrets-hardening.tf but NOT applied (irreversible re-encrypt; needs a maintenance decision). Docs: docs/as-built/85-secrets-management.md; updated 80-iac-vs-imperative.md, the example secret files, STATUS.md, and the docs index. Generated by Coder Agents.
Add three design-only plans (nothing applied to the live environment) with companion GitHub issues, plus an index. - plans/observability-aws-native.md: the production AWS-native target the in-cluster Prometheus/Grafana stack should evolve into (Amazon Managed Prometheus + Grafana for metrics; CloudWatch -> Firehose -> S3 -> Athena with an optional Amazon Security Lake OCSF path for audit/SIEM). Issues #13-#20. Grounded in read-only us-gov-west-1 calls: AMP managed scraper is absent in GovCloud (self-managed ADOT + SigV4), AMG auth via SAML to Keycloak (IAM Identity Center not enabled), Security Lake optional. - plans/gitops-control-plane.md: Argo CD control plane sourced from the in-cluster GitLab, app-of-apps over the existing deploy/ paths, adopt-in-place (manual sync, no prune, no self-heal). Issues #6-#12. - plans/gitops-adoption.md: per-workload GitOps adoption and the non-Argo state (Coder API via Argo PostSync Jobs, Keycloak via keycloak-config-cli, AWS stays Terraform). Issues #21-#29. GitOps live migration is deliberately deferred: leave the current imperative state in place and adopt it later. Generated by Coder Agents.
Add an in-boundary, in-cluster observability stack and wire Coder into it, so
the demo shows live control-plane metrics and dashboards without leaving the
GovCloud boundary. The AWS-native managed variant (AMP/AMG) is planned
separately in docs/plans/ and intentionally not built here.
Stack (deploy/observability/, Helm release kps, ns monitoring):
- kube-prometheus-stack 86.2.0 (Prometheus + Grafana + operator), trimmed for
the demo: Alertmanager, node-exporter, kube-state-metrics, bundled rules, and
the EKS control-plane ServiceMonitors are off; the kubelet ServiceMonitor is
kept for cAdvisor container CPU/memory. Images mirrored into ECR
(scripts/images.txt) and the chart overridden to the mirror.
- coder-metrics.yaml: a headless Service (ns coder, :2112) selecting only the
control-plane pod, plus ServiceMonitor/coder. Prometheus discovers it
(serviceMonitorSelectorNilUsesHelmValues=false); up{job="coder-metrics"}=1.
- dashboards-coder.yaml: six Prometheus-backed Coder dashboards from
github.com/coder/observability as sidecar-imported ConfigMaps, rendering live
data. Log-only panels and the agent-boundaries dashboard are omitted (no Loki).
- grafana-ingress.yaml: host grafana.usgov.coderdemo.io behind the existing NLB
(ACM wildcard TLS); HTTP 200 with valid TLS.
Coder server (deploy/coder/values.yaml): ADD only, to respect the coderd
AI-provider drift guard.
- CODER_PROMETHEUS_ENABLE=true, CODER_PROMETHEUS_ADDRESS=0.0.0.0:2112,
CODER_PROMETHEUS_COLLECT_AGENT_STATS=true.
- Structured JSON logs for SIEM readiness: CODER_LOGGING_JSON=/dev/stderr and
CODER_LOGGING_HUMAN=/dev/null. Coder has no single CODER_LOG_FORMAT flag;
JSON is selected by pointing CODER_LOGGING_JSON at a sink.
Secrets: the Grafana admin password lives in AWS Secrets Manager
(usgov-coderdemo/observability/grafana) and is synced into the grafana-admin
Secret by a new ExternalSecret; no password in git.
Audit: licensed audit logging is already entitled and on (/audit); the JSON
server logs make coderd shippable to a downstream SIEM.
Verified live: coder Helm rev 5 healthy (1/1); monitoring pods Running
(grafana 3/3, prometheus 2/2, operator 1/1); grafana + dev hosts return 200;
the grafana-admin ExternalSecret is SecretSynced; the Coder Control Plane
dashboard renders live data end to end.
Docs: docs/as-built/55-observability.md; updated the as-built README, the docs
index, and STATUS.md.
Generated by Coder Agents.
Make the demo one SSO: Grafana now logs in through the same Keycloak realm (coder) as Coder, instead of local-admin only. The local admin login form is kept enabled as break-glass. - scripts/setup-grafana-oidc.py (idempotent): register a confidential OIDC client `grafana` in the realm (authorization-code + PKCE S256, redirect https://grafana.usgov.coderdemo.io/login/generic_oauth) with the same full-path `groups` group-membership mapper the coder client uses, then read the client secret and upsert it to AWS Secrets Manager at usgov-coderdemo/observability/grafana-oauth. - ESO ExternalSecret grafana-oauth (ns monitoring) syncs that secret into a Kubernetes Secret; Grafana consumes it via the env var GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET (grafana.envValueFrom), so no secret is in git. - kube-prometheus-stack-values.yaml: enable [auth.generic_oauth] against the realm auth/token/userinfo endpoints (scopes openid email profile) and map group membership to a Grafana org role: contains(groups[*], '/platform') && 'Admin' || 'Viewer'. allow_sign_up auto-provisions users; allow_assign_grafana_admin is off so the server-admin flag stays local. Verified live (helm release kps upgraded, Grafana rolled out): the login page shows "Sign in with Keycloak"; /login/generic_oauth redirects to the realm with client_id=grafana and PKCE; a headless authorization-code login per persona confirms role mapping (pat.platform in /platform -> Admin, /api/org/users 200; dana.dev in /alpha -> Viewer, /api/org/users 403), both authLabels Generic OAuth and isExternallySynced. Docs: docs/as-built/55-observability.md and deploy/observability/README.md gain an SSO section; STATUS.md notes the one-SSO Grafana login. Generated by Coder Agents.
Make GitLab sign in through the same Keycloak realm (coder) as Coder and Grafana, and give the demo a single SSO identity that is super admin across all three. Stays on GitLab Community Edition (no EE switch). GitLab SSO (deploy/gitlab/statefulset.yaml): - OmniAuth openid_connect provider in GITLAB_OMNIBUS_CONFIG (auth-code + PKCE, uid_field preferred_username, JIT sign-on). Auto-redirect is intentionally not set so the local root form remains as break-glass. - scripts/setup-gitlab-oidc.py registers the confidential realm client `gitlab` and stores its secret in AWS Secrets Manager (usgov-coderdemo/gitlab/oidc); ESO syncs it to the gitlab-oidc Secret, injected as GITLAB_OIDC_CLIENT_SECRET. CE role limitation, handled explicitly: - GitLab CE does not implement OIDC group-to-role assignment (admin_groups is an EE feature; this gitlab-ce image has no openid_connect group code path). The admin_groups line is left as a documented no-op (EE-forward-compatible). - scripts/setup-gitlab-users.py (idempotent, gitlab-rails) populates the eight personas, links each openid_connect identity (extern_uid = preferred_username), and sets GitLab instance admin only on pat.platform, mirroring the Coder org-admin mapping and preserving tenant isolation. Unified super admin: - scripts/grant-coder-owner.py grants the Coder site Owner role to pat.platform (site roles are not claim-driven and persist across logins). With the GitLab admin flag and the existing Grafana /platform -> Admin mapping, the single SSO identity pat.platform is super admin in Coder, GitLab, and Grafana. - Local break-glass admins remain per app; GitLab root was given a known password (stored in ASM usgov-coderdemo/gitlab/secrets root_password and the local secrets file), since the first-boot random root password was gone. Verified live: pat.platform SSO -> GitLab is_admin=true (/admin 200), Coder site roles [owner], Grafana org Admin; dana.dev -> regular/Viewer. Root login works with the reset password. Docs: docs/as-built/50-gitlab-scm.md gains a Keycloak SSO section and the CE limitation; STATUS.md gains a single sign-on + super admin summary. Generated by Coder Agents.
…orgs The unified super admin signs in via Keycloak but only saw one Coder org, because org membership is IdP-synced from the `groups` claim and pat.platform was only in /platform (-> the coder org). Add pat.platform to the /alpha and /bravo Keycloak groups (and their org-admin role subgroups) in scripts/setup-keycloak-hierarchy.py, so org sync makes Pat a member and organization-admin of all three orgs on login. Combined with the Coder site Owner role and GitLab/Grafana admin, one Keycloak login is now admin across the whole stack and the Coder org switcher shows Platform, Alpha, and Bravo. Verified live with scripts/verify-oidc-login.py (a real OIDC login, which runs the sync): pat.platform -> coder/alpha/bravo all organization-admin, site roles [owner]. Tenant isolation is unchanged for the mission-partner personas. Docs: STATUS.md and docs/as-built/45-idp-sync-personas.md updated to reflect pat.platform as the all-orgs super admin (deliberate exception to isolation). Generated by Coder Agents.
Add a dedicated operator account austen.platform (its own SUPERADMIN_PASSWORD) that is super admin across the stack through a single Keycloak login: Coder site Owner plus org-admin in all three orgs, GitLab instance admin, and Grafana org Admin (via the /platform group rule). Revert pat.platform to a normal Platform lead persona: Platform org-admin only, no Coder site Owner, not a GitLab admin. - setup-keycloak-hierarchy.py: add austen.platform in the platform/alpha/bravo org and org-admins groups with a per-user password_env (SUPERADMIN_PASSWORD); trim pat.platform back to the /platform groups. - setup-gitlab-users.py: provision austen.platform as instance admin, mark every demo persona (including pat.platform) a regular user, and support per-persona password env over stdin. - grant-coder-owner.py: default target is now austen.platform. - Docs (STATUS.md, 45-idp-sync-personas.md, 50-gitlab-scm.md, 55-observability.md): describe the operator super admin and the pat.platform revert. Verified live via headless SSO: austen is owner and org-admin in all orgs, a GitLab admin, and a Grafana Admin; pat is org-admin in coder only, no Coder site role, and a GitLab non-admin. Generated by Coder Agents.
Give the operator super admin austen.platform the Keycloak webauthn-register and CONFIGURE_TOTP required actions so its first Keycloak sign in forces WebAuthn passkey and TOTP enrollment. The actions are applied only while the matching credential is missing, so reconciles never re-force enrollment. - setup-keycloak-hierarchy.py: add required_actions to the austen.platform spec plus an ensure_required_actions() reconciler keyed on existing credentials. - Docs (STATUS.md, 45-idp-sync-personas.md): note the enforced enrollment and that the headless verify probe no longer applies to austen.platform. Applied live: austen.platform requiredActions=[CONFIGURE_TOTP, webauthn-register] with only a password credential; the other personas are unaffected. Generated by Coder Agents.
…Coder Demo Make the Coder dashboard application_name configurable via APP_NAME (default "USGOV Coder Demo") instead of empty, so the demo deployment shows a branded name in the UI title and login page. Applied live via PUT /api/v2/appearance; the UNCLASSIFIED announcement banner is preserved. Generated by Coder Agents.
Deploy a lean single-binary Grafana Loki (filesystem gp3 PVC, tsdb schema v13, 168h retention) and a node-level Promtail DaemonSet into the monitoring namespace, with both images mirrored to ECR. Add a Grafana datasource ConfigMap with uid "loki" so the generated Coder dashboards' log panels resolve to the live log store instead of erroring. Clean up the coder-status dashboard: replace the upstream LGTM "Observability Tools" row (distributed Loki, Grafana Agent, config reloaders, storage/CPU/RAM) with Prometheus, Loki, and Promtail up panels, and repoint the Workspace Builds and Postgres panels to coderd_* metrics that exist in this stack. Update the observability README, STATUS.md, and the as-built doc. Generated by Coder Agents.
Add a single Grafana dashboard (uid ai-governance, title "AI Governance") covering both the AI Gateway (AI Bridge) and the Agent Firewall (Boundary), replacing the two missing add-on dashboards. New ConfigMap deploy/observability/dashboards-ai-governance.yaml (ns monitoring, label grafana_dashboard: "1") so it never conflicts with dashboards-coder.yaml. AI Gateway panels read coder_aibridged_* (configured providers, reload health, provider inventory) and stream AI Bridge logs from Loki. Agent Firewall panels read agent_boundary_log_proxy_batches_forwarded_total and stream Boundary logs from Loki. Every panel targets datasource uid prometheus or uid loki. All ten query panels verified HTTP 200 through Grafana /api/ds/query; usage panels read 0 or stay sparse until live AI traffic occurs (placeholder Anthropic key). Documents the dashboard in docs/as-built/55-observability.md and STATUS.md. Generated by Coder Agents.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Deploys and validates the full demo stack on the live GovCloud EKS cluster
(
us-gov-west-1,usgov.coderdemo.io): Coder + Keycloak SSO + GitLab +AI Gateway + a Claude Code workspace template.
coderimported; authorize flow verifiedVerified end to end
coder+ redirect/api/v2/users/oidc/callbackreturns the login page (200).POST /api/v2/aibridge/anthropic/v1/messagesroutes through toapi.anthropic.com (currently 502 "keys failed authentication", a placeholder
key; see below). Providers
anthropic+anthropic-bedrock(IRSA) are enabled.claude-codepushed; a test workspace built, the agentconnected and went healthy, and Claude Code + AgentAPI + code-server installed.
verified, EBS CSI via IRSA, RDS roles/dbs created (force_ssl).
Hardening and documentation (latest update)
claude-codetemplate declaresdata "coder_external_auth" "gitlab"; theactive template version's
/external-authlistsgitlabas required. Theagent git credential helper then injects a short-lived token, so no PATs or
SSH keys live in the workspace and no auth path leaves the boundary.
CODER_DISABLE_PATH_APPS=true, Helmrev 4). All template apps use
subdomain = true, so apps are subdomain-only.UNCLASSIFIED - USGOVCLOUDand theapplication name set to
USGOV Coder Demovia the idempotentscripts/set-appearance.sh(runtime appearance applied over the API, not aHelm value; the name is configurable via
APP_NAME).docs/as-built/: architecture andflows, per-component configuration, and a declarative-vs-imperative ledger
with a Terraform reconciliation backlog. Indexed from
docs/00-INDEX.md.Multi-tenant IdP sync (Keycloak -> Coder)
Models a true multi-tenant hierarchy in Keycloak and syncs it into Coder via
OIDC IdP sync (organization + group + role). No org/group/role is assigned by
hand in Coder.
coder(Platform Engineering),alpha(Mission PartnerAlpha),
bravo(Mission Partner Bravo). assign_default=false, so orgmembership is purely claim-driven.
groupsclaim (Group Membership mapper on thecoderclient) drives org sync, per-org group sync, and per-org role sync
(organization-admin / organization-template-admin / organization-auditor).
scientist, cross-tenant ISSO/auditor).
provisioner daemon per tenant (reusing the
coderSA) and theclaude-codetemplate pushed into all three orgs.
scripts/verify-oidc-login.py(a real Keycloaklogin per persona): each lands in the right org(s)/group(s)/role(s); Alpha,
Bravo, and Platform stay isolated; the auditor spans both tenants read-only.
Idempotent scripts:
scripts/setup-keycloak-hierarchy.py,scripts/setup-coder-idp-sync.py,scripts/verify-oidc-login.py. Details indocs/as-built/45-idp-sync-personas.md.Secrets management (External Secrets Operator + AWS Secrets Manager)
Runtime secrets are now sourced from AWS Secrets Manager and synced into
Kubernetes by the External Secrets Operator over IRSA. ASM is the source
of truth; nothing sensitive is in git, and the app manifests are unchanged.
external-secrets, ECR-mirrored image) authenticates viaIRSA role
usgov-coderdemo-external-secrets(least-privilege: read-only onusgov-coderdemo/*, no static keys).(
scripts/migrate-secrets-to-asm.py); aClusterSecretStore+per-secret
ExternalSecretmaterialize them back with the same names/keys.sha256 before/after), so running pods were not disrupted; store
Valid, all 9SecretSynced, and delete/recreate recovery confirmed.in
terraform/secrets-hardening.tfbut not applied (irreversiblere-encrypt; needs a maintenance decision).
Details in
docs/as-built/85-secrets-management.md.Observability (in-cluster Prometheus + Grafana)
In-boundary, in-cluster metrics and dashboards so the demo shows live
control-plane telemetry without leaving the GovCloud boundary. The AWS-native
managed variant (AMP/AMG, CloudWatch -> Security Lake) is the production target
and is planned separately (see below), not built here.
deploy/observability/, Helm releasekps, nsmonitoring):kube-prometheus-stack86.2.0 (Prometheus + Grafana + operator), ECR-mirroredimages. Trimmed for the demo: Alertmanager, node-exporter, kube-state-metrics,
bundled rules, and the EKS control-plane ServiceMonitors are off; the kubelet
ServiceMonitor is kept for cAdvisor container CPU/memory.
deploy/coder/values.yaml, ADD-only to respect thecoderd drift guard):
CODER_PROMETHEUS_ENABLE=true,CODER_PROMETHEUS_ADDRESS=0.0.0.0:2112, agent stats on. A headlesscoder-metricsService + ServiceMonitor scrapes the control plane;up{job="coder-metrics"}is1.github.com/coder/observability) renderlive data at
https://grafana.usgov.coderdemo.io(HTTP 200, valid TLS). TheGrafana admin password is sourced from AWS Secrets Manager
(
usgov-coderdemo/observability/grafana) and synced by ESO; no password ingit. Log panels are wired to the in-cluster Loki datasource (below), and a
dedicated AI Governance dashboard (
uid: ai-governance,deploy/observability/dashboards-ai-governance.yaml) merges AI Gatewayprovider health (
coder_aibridged_provider_info) with Agent Firewall(Boundary) proxy activity in one view.
PVC) plus a Promtail DaemonSet aggregate all pod logs in-boundary
(
deploy/observability/loki.yaml,promtail.yaml); alokiGrafanadatasource (
deploy/observability/loki-datasource.yaml) feeds the log panels.Images are ECR-mirrored. Log aggregation stays entirely inside the GovCloud
boundary; no external log sink is used.
coder)as Coder via a confidential OIDC client
grafana(
scripts/setup-grafana-oidc.py, authorization-code + PKCE; secret in AWSSecrets Manager
usgov-coderdemo/observability/grafana-oauth, ESO-synced, nosecret in git). Keycloak group membership maps to a Grafana org role
(
contains(groups[*], '/platform') && 'Admin' || 'Viewer'); the local adminlogin is kept as break-glass. Verified per persona with a headless login:
pat.platform(/platform) ->Admin,dana.dev(/alpha) ->Viewer.(
CODER_LOGGING_JSON=/dev/stderr,CODER_LOGGING_HUMAN=/dev/null); licensedaudit logging is already entitled and on (
/audit). Coder has no singleCODER_LOG_FORMATflag, so JSON is selected by pointingCODER_LOGGING_JSONat a sink.
Verified live: coder Helm rev 5 healthy; monitoring pods Running (grafana 3/3,
prometheus 2/2, operator 1/1); the
grafana-adminExternalSecret isSecretSynced; the Coder Control Plane dashboard renders end to end. Details indocs/as-built/55-observability.mdanddeploy/observability/README.md.Planned (design + issues, nothing applied)
Forward-looking designs added under
docs/plans/with companion GitHub issues.Nothing in these plans changes the live environment; GitOps live migration is
deliberately deferred (adopt the current state in place later).
docs/plans/observability-aws-native.md, issues observability: enable Coder Prometheus metrics and JSON audit logging (Phase 0) #13-observability: migrate in-cluster demo stack to AMP/AMG and reconcile Terraform (Phase 7) #20.app-of-apps over the existing
deploy/paths, adopt-in-place):docs/plans/gitops-control-plane.md, issues GitOps: choose and install the Argo CD control plane (decision + bootstrap) #6-GitOps: non-disruptive adoption runbook + argocd app diff verification #12.Argo Jobs, Keycloak via keycloak-config-cli, AWS stays Terraform):
docs/plans/gitops-adoption.md, issues gitops: adopt the coder Helm release into GitOps in place (chart 2.34.0) #21-gitops: Terraform AWS substrate reconcile as a GitOps prerequisite (ordering cross-reference) #29.GitLab SSO + one super admin
GitLab now signs in through the same Keycloak realm (
coder) as Coder andGrafana, so the demo is one SSO. OmniAuth
openid_connectis configured in theGitLab StatefulSet; the realm client
gitlabis created byscripts/setup-gitlab-oidc.py, with its secret in AWS Secrets Manager andsynced by ESO (no secret in git). The local root form stays as break-glass.
GitLab stays Community Edition, which does not implement OIDC group-to-role
mapping (
admin_groupsis an EE feature). So persona users and the GitLab adminflag are provisioned explicitly by
scripts/setup-gitlab-users.py(idempotent,gitlab-rails), linking eachopenid_connectidentity and making only theoperator account
austen.platforman instance admin (the persona accounts staynon-admin, preserving tenant isolation).
A dedicated operator account
austen.platform(separate from the demo personas)is the single super-admin SSO identity.
scripts/grant-coder-owner.py(defaultusername
austen.platform) grants it the Coder site Owner role, andscripts/setup-gitlab-users.pymakes it GitLab Administrator; it is also GrafanaAdmin, so one Keycloak identity is super admin across Coder (Owner), GitLab
(Administrator), and Grafana (Admin). It is created in
scripts/setup-keycloak-hierarchy.pywith passkey (WebAuthn) + TOTP enrollmentrequired at first login (Keycloak required actions
webauthn-register+CONFIGURE_TOTP).pat.platformis reverted to a normal Platform lead(org-admin of the
coderorg only). Verified live per persona. Per-app localbreak-glass admins remain; credentials live in
generated-secrets.env/ AWSSecrets Manager, not git.
One action remains (external dependency)
No real Anthropic API key was available in the environment, so the
anthropicprovider is seeded with a placeholder. To make AI respond: sign in as owner →
Admin > AI > Providers (
/ai/settings) → set the realsk-ant-...key on theprovider named
anthropic(do this in the UI, not thecoder-aisecret).Details in
STATUS.md.Decision log & deviations to reconcile into Terraform
Why standard EKS instead of Auto Mode: EKS Auto Mode node provisioning is
broken in this GovCloud account. The AWS-managed SLR
AWSServiceRoleForAmazonEKSlacks
iam:AddRoleToInstanceProfile/iam:TagInstanceProfile, so Auto ModeNodeClass validation never succeeds. The cluster was converted to standard EKS.
Live-cluster changes not yet in
terraform/(seedeploy/platform/README.md):mng(3x m5.xlarge), node roleusgov-coderdemo-mngnode.usgov-coderdemo-ebs-csi+ addonservice-account-role-arn.gp3default StorageClass.coder-workspaces.Fixes folded into the manifests/template:
aws-load-balancer-type: external(LB Controller, not Auto Mode NLB)._comment_*keys that abort realm import.anthropic; the gateway routes byprovider name and the claude-code module hardcodes
/api/v2/aibridge/anthropic.allow_privilege_escalation: trueso the agentapi modulecan
sudo-install to/usr/local/bin.gp3StorageClass; removedmattermostkey (removed in GitLab 19.0);added the VPC CIDR to
gitlab_rails['monitoring_whitelist']so kubelet healthprobes (from the node IP) pass instead of getting 404.
AI provider note: providers are DB-managed since v2.34 (env only seeds once).
The running instance has
anthropic(placeholder key) +anthropic-bedrock; acoderd restart was verified to skip re-seeding the soft-deleted old provider and
not crash (no drift).
Authored by Coder Agents on behalf of @ausbru87.