fix(runner): add health probes and improve INITIAL_PROMPT error logging by maknop · Pull Request #1534 · ambient-code/platform

maknop · 2026-05-08T14:43:30Z

Summary

This PR implements health probes for runner pods and improves error logging for INITIAL_PROMPT retries, matching the implementation from #1529.

Changes

Kubernetes Health Probes

Added readiness probe to runner container (3s initial delay, 5s period)
Added liveness probe to runner container (20s initial delay, 30s period)
Probes check /health endpoint on the runner's FastAPI server

Error Logging Improvements

Enhanced retry error logging in app.py to include exception type
Previously logged empty strings for exceptions like asyncio.TimeoutError
Now logs: "error: TimeoutError: <details>" instead of "error: "

Benefits

Prevents premature traffic routing: Service won't route to pods until FastAPI is ready
Reduces 503 errors: Eliminates "runner unavailable" errors during pod startup
Better debugging: More informative error logs with exception types
Self-healing: Liveness probe enables automatic pod restarts on failure

Test Plan

Code compiles successfully (go vet passes)
Code formatting is correct (gofmt passes)
Deploy to test cluster and verify health probes are configured
Verify no 503 errors during pod startup
Verify error logs include exception types during connection failures

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

Infrastructure & Build Improvements
- Added automated build pipelines for all components with security scanning and validation.
- Enabled TLS/SSL support for database connections and service communication.
Security & Authentication
- Implemented OAuth proxy authentication with OpenShift integration for frontend access.
- Enhanced certificate management and TLS termination for API services.
Configuration Updates
- Updated database configuration to support external RDS connectivity.
- Added routes and service exposure for API, backend, and frontend services.

Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>

Creates kustomize overlay for deploying to hcmais01ue1 via app-interface: - Uses Konflux images from redhat-services-prod/hcm-eng-prod-tenant - Scales down in-cluster databases (using external RDS from app-interface Phase 2) - Scales down MinIO (using external S3 from app-interface Phase 2) - Includes CRDs, RBAC, routes, and all application components - Patches operator to use Konflux runner image Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Convert kustomize overlay to OpenShift Template format for app-interface SaaS deployment. Split into two templates: 1. template-operator.yaml (CRDs, ClusterRoles, operator deployment) - Operator and ambient-runner images - Cluster-scoped resources (CRDs, RBAC) - Operator deployment and its ConfigMaps 2. template-services.yaml (Application services) - Backend, frontend, public-api, ambient-api-server images - All deployments, services, routes, configmaps - Scales in-cluster services to 0 (minio, postgresql, unleash) Both templates use IMAGE_TAG parameter (auto-generated from git commit SHA) and support Konflux image gating through app-interface. This allows app-interface to use provider: openshift-template with proper parameter substitution instead of the directory provider which doesn't run kustomize build.

Creates kustomize overlay for deploying to hcmais01ue1 via app-interface: - Uses Konflux images from redhat-services-prod/hcm-eng-prod-tenant - Scales down in-cluster databases (using external RDS from app-interface Phase 2) - Scales down MinIO (using external S3 from app-interface Phase 2) - Includes CRDs, RBAC, routes, and all application components - Patches operator to use Konflux runner image Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The objects field must be a YAML array with proper list indicators. Previous version was missing the '-' prefix on array items, causing: 'unable to decode STDIN: json: cannot unmarshal object into Go struct field Template.objects of type []runtime.RawExtension' Changes: - Rebuild templates using Python yaml library for correct formatting - Objects now properly formatted as YAML array with '- apiVersion:' - Add validate.sh script for testing with oc process - Both templates validated successfully Generated from kustomize overlay output with proper YAML structure.

Remove minio, postgresql, unleash, ambient-api-server-db. Using external RDS and S3 from app-interface. Removed 12 resources (4 Deployments, 4 Services, 3 PVCs, 1 Secret) Remaining: ambient-api-server, backend-api, frontend, public-api

Disables OTEL metrics export by commenting out OTEL_EXPORTER_OTLP_ENDPOINT environment variable in operator deployment manifests. The operator was configured to send metrics to otel-collector.ambient-code.svc:4317, but this service does not exist in the cluster, causing repeated gRPC connection failures every 30 seconds with error: "failed to upload metrics: context deadline exceeded: rpc error: code = Unavailable desc = name resolver error: produced zero addresses" With OTEL_EXPORTER_OTLP_ENDPOINT unset, InitMetrics() will skip metrics export and log "metrics export disabled" instead of throwing connection errors. Changes: - Comment out OTEL_EXPORTER_OTLP_ENDPOINT in base operator deployment - Comment out OTEL_EXPORTER_OTLP_ENDPOINT in OpenShift template - Add clarifying comment about re-enabling when collector is deployed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Changes: - Add oauth-proxy component to frontend deployment (dashboard-ui port on 8443) - Enable SSL for ambient-api-server RDS connection (db-sslmode=require) - Set AMBIENT_ENV to 'stage' for ambient-api-server - Enable OpenShift service-ca for ambient-api-server TLS cert provisioning - Regenerate templates with new oauth-proxy and api-server patches This enables: - Authenticated access to frontend via OpenShift OAuth - Secure connections to external RDS database - Automatic TLS certificate rotation for ambient-api-server Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Remove postgresql, minio, unleash, and ambient-api-server-db resources from the services template. These services are scaled to 0 via kustomize patches because we use external RDS and S3 instead. Including them in the template causes app-interface to try deploying them, which fails imagePattern validation and wastes resources. Excluded resources: - Deployment/postgresql, Service/postgresql - Deployment/minio, Service/minio, PVC/minio-data - Deployment/unleash, Service/unleash - Deployment/ambient-api-server-db, Service/ambient-api-server-db Template now has 21 service resources (down from 30). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Switch from custom vault secrets to OpenShift service account-based OAuth: - Use Red Hat's official ose-oauth-proxy-rhel9 image - Use service account token for cookie secret (no vault needed) - Enable HTTPS on OAuth proxy with OpenShift service-ca auto-generated certs - Add system:auth-delegator ClusterRoleBinding for OAuth delegation - Add OAuth redirect reference annotation to frontend ServiceAccount - Fix service account reference from 'nginx' to 'frontend' - Add missing NAMESPACE and UPSTREAM_TIMEOUT parameters Benefits: - No manual vault secret management - Automatic TLS cert rotation via service-ca - Standard OpenShift OAuth integration pattern - Follows app-interface team recommendations Files changed: - frontend-rbac.yaml: Added OAuth annotations and auth-delegator binding - oauth-proxy component patches: Updated to new configuration - Templates: Regenerated with OAuth fixes (27 operator, 21 service resources) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The RDS credentials secret should not be in the OpenShift template - it's provided by the external resource provider (terraform) in app-interface. The namespace's externalResources section already defines: - provider: rds output_resource_name: ambient-code-rds This automatically creates the secret with the correct RDS credentials. Including the secret in the template with VAULT_INJECTED placeholders caused deployment failures. Changes: - Excluded ambient-code-rds secret from template generation - Template now has 20 service resources (down from 21) - Deployment still references the secret via volumeMount (correct) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Chris Mitchell <cmitchel@redhat.com>

Changes GCP service account configuration to align with app-interface deployment where credentials are provided via Vault. Changes: - template-services.yaml: Update backend vertex-credentials secret name from 'ambient-vertex' to 'stage-gcp-creds' (matches Vault secret) - template-operator.yaml: Update GOOGLE_APPLICATION_CREDENTIALS path to match Vault secret key name 'itpc-gcp-hcm-pe-eng.json' The secret is provided by app-interface via: path: engineering-productivity/ambient-code/stage-gcp-creds This allows the backend and operator to use Vertex AI for Claude and Gemini API calls with the service account configured with roles/aiplatform.user permissions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Chris Mitchell <cmitchel@redhat.com>

Configure OAuth proxy sidecar to inject authentication token into forwarded requests, fixing 401 errors on /api/projects endpoints. Changes: - Add --pass-access-token=true flag to inject X-Forwarded-Access-Token header - Change upstream from frontend-service:3000 to localhost:3000 (correct sidecar pattern) - Remove --request-logging to reduce log noise Backend logs showed: tokenSource=none hasAuthHeader=false hasFwdToken=false The backend expects the X-Forwarded-Access-Token header, which is now injected by the OAuth proxy for all authenticated requests. Flow: 1. User authenticates via OpenShift OAuth ✓ 2. OAuth proxy injects token header ✓ (new) 3. Frontend forwards token to backend API ✓ (fixed) This resolves the 401 authentication errors while maintaining the working OpenShift OAuth integration. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Removed the '--set-authorization-header=true' option from the configuration.

Removed the '--scope=user:full' option from the configuration.

Signed-off-by: Chris Mitchell <cmitchel@redhat.com>

chore: Update konflux deps

Switch OAuth proxy from service account authentication to explicit SSO client credentials to enable user:full scope. Changes: - Replace --openshift-service-account with --client-id=ambient-code - Mount client_secret from stage-sso-client Kubernetes secret - Add --scope=user:full to grant full user permissions - Mount /etc/oauth-client volume for client secret file This allows users to create resources (AgenticSessions, ConfigMaps) in their project namespaces by providing the necessary OAuth scope. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Remove ambient-frontend-oauth-delegator ClusterRoleBinding from the operator template as it is now deployed via app-interface openshiftResources for better separation of concerns. Cluster-scoped resources should be managed outside of saas file deployments as they have impact on the whole cluster. This ClusterRoleBinding grants the frontend service account the system:auth-delegator role needed for OAuth proxy token delegation. It is now defined in app-interface at: resources/services/ambient-code-platform/ambient-frontend-oauth-delegator.clusterrolebinding.yaml Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Oauth client updates

The pathChanged() CEL function was using incorrect glob syntax that prevented pipelines from triggering on component changes: - Changed `./components/*/***` to `components/*/**` (removed leading `./` and fixed triple-asterisk to double-asterisk for recursive matching) - Removed invalid root `Dockerfile` check (Dockerfiles are in component subdirectories, already covered by component globs) PipelinesAsCode pathChanged() expects standard glob patterns relative to repository root, with `**` for recursive directory matching. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix(ci): correct Tekton pathChanged glob patterns

When OTEL_EXPORTER_OTLP_ENDPOINT is unset, InitMetrics() was returning early without initializing metric instruments, leaving them as nil. This caused nil pointer panics when reconciliation code called metric recording functions like RecordSessionCreatedByUser(). The panic occurred at otel_metrics.go:424 when sessionsByUser.Add() was called on a nil counter during reconcilePending phase. Fix: - When OTEL endpoint is unset, initialize no-op meter from global provider - Create all metric instruments as no-ops (silently ignore all calls) - Prevents nil pointer panics while maintaining same API contract - No-op instruments have all the same methods but do nothing OpenTelemetry provides a built-in no-op MeterProvider as the global default, which creates no-op instruments that safely ignore all metric recording calls without panicking. Error before fix: panic: runtime error: invalid memory address or nil pointer dereference at RecordSessionCreatedByUser (/app/internal/controller/otel_metrics.go:424) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: initialize no-op metrics instruments when OTEL is disabled

Add permissions for mlflow.kubeflow.org Experiments and Runs CRDs to the agentic-operator ClusterRole. The operator unconditionally grants these permissions to session runner service accounts via Roles, but cannot grant permissions it doesn't hold itself. Without these ClusterRole permissions, session creation fails with: user "system:serviceaccount:ambient-code:agentic-operator" is attempting to grant RBAC permissions not currently held: {APIGroups:["mlflow.kubeflow.org"], Resources:["experiments"], Verbs:[...]} These are namespace-scoped CRDs from the Kubeflow MLflow Operator, used for ML experiment tracking with Kubernetes-native RBAC authentication. Sessions use these to log ML training runs, parameters, and metrics to the MLflow tracking server. Note: MLflow tracing is optional (MLFLOW_TRACING_ENABLED env var), but the operator code unconditionally includes these permissions in session Roles regardless of whether tracing is enabled. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: add MLflow CRD permissions to operator ClusterRole

Add mlflow.kubeflow.org CRD permissions to the agentic-operator ClusterRole. The operator creates Roles in user namespaces that include MLflow permissions, but due to Kubernetes RBAC privilege escalation protection, it can only grant permissions it holds itself. Previous commit 2af8216 added MLflow permissions to backend-api ClusterRole, but missed adding them to agentic-operator. This causes session creation to fail with: user "system:serviceaccount:ambient-code:agentic-operator" is attempting to grant RBAC permissions not currently held: {APIGroups:["mlflow.kubeflow.org"], Resources:["experiments"], Verbs:[...]} The agentic-operator service account needs these permissions to create session runner Roles that include MLflow access. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…sterrole fix: add MLflow permissions to agentic-operator ClusterRole

The operator needs to create NetworkPolicies in user namespaces to isolate runner pods. Without this permission, session creation fails with: networkpolicies.networking.k8s.io is forbidden: User "system:serviceaccount:ambient-code:agentic-operator" cannot create resource "networkpolicies" in API group "networking.k8s.io" in the namespace "mknop-ws" This adds create/delete/get/list permissions for NetworkPolicies to the agentic-operator ClusterRole. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Configure oauth-proxy to route /api/* requests to backend-service instead of the Next.js frontend. Without this routing, all requests including /api/* go to localhost:3000, causing 503 errors because Next.js doesn't handle backend API routes. Changes: - Add --upstream=http://backend-service:8080/api/ before default upstream - Requests to /api/* now route to backend-service:8080 - All other requests continue to Next.js frontend at localhost:3000 OAuth2-proxy processes upstreams in order and uses the path portion as a matching key. The /api/ path in the upstream URL matches any request starting with /api/, and the full request path is forwarded to the backend. Request flow example: Browser: GET https://ambient.corp.stage.redhat.com/api/projects/foo/sessions/bar → OAuth-proxy checks auth via --openshift-delegate-urls → Matches --upstream=http://backend-service:8080/api/ (longest match) → Forwards to: http://backend-service:8080/api/projects/foo/sessions/bar Fixes browser console errors: GET /api/projects/.../git/status [503 Service Unavailable] AG-UI stream error: Connection error The connection to .../agui/events was interrupted Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Chris Mitchell <cmitchel@redhat.com>

fix: add backend API routing to oauth-proxy upstream

Remove --openshift-delegate-urls parameter from oauth-proxy that was blocking /api/* requests with "no resource mapped path" errors. Issue: - openshift-delegate-urls={"/api":{"resource":"projects","verb":"list"}} only matches /api exactly, not /api/* subpaths - All /api/* requests were returning 503 even though backend received and processed them successfully (200 OK in backend logs) - oauth-proxy logs showed: "no resource mapped path" Solution: OAuth-proxy still provides authentication (OAuth login required for all requests) and passes the access token to the backend via --pass-access-token. The backend handles its own fine-grained authorization based on the token, so the blanket openshift-delegate-urls check is redundant and overly restrictive. Authorization flow after this change: 1. User authenticates via OAuth (enforced by oauth-proxy) 2. oauth-proxy passes access token to backend 3. Backend validates token and checks user permissions per endpoint 4. Backend returns appropriate response (200, 403, 404, etc.) This matches the backend's existing authorization model where different API endpoints have different permission requirements that can't be expressed in a single openshift-delegate-urls pattern. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…urls fix: remove overly restrictive openshift-delegate-urls check

increased initial prompt deploy seconds to 10 seconds

Kubernetes Health Probes: - Added readiness probe (3s initial delay, 5s period) - Added liveness probe (20s initial delay, 30s period) - Prevents Service routing traffic before FastAPI is ready - Reduces 503 "runner unavailable" errors Error Logging Improvements: - Enhanced retry error logging to include exception type - Previously logged empty strings for exceptions like asyncio.TimeoutError - Now logs: "error: TimeoutError: <details>" instead of "error: " Benefits: - Prevents premature traffic routing to starting pods - More informative error logs for debugging - Better system resilience through health probes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

netlify · 2026-05-08T14:43:36Z

✅ Deploy Preview for cheerful-kitten-f556a0 ready!

Name	Link
🔨 Latest commit	`0a4d259`
🔍 Latest deploy log	https://app.netlify.com/projects/cheerful-kitten-f556a0/deploys/69fdf695abec380009e6e203
😎 Deploy Preview	https://deploy-preview-1534--cheerful-kitten-f556a0.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
🤖 Make changes	Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2026-05-08T14:43:44Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 68d85295-bf36-490f-870a-fddec3bb3802

📥 Commits

Reviewing files that changed from the base of the PR and between 070520c and 0a4d259.

📒 Files selected for processing (51)

.tekton/ambient-code-ambient-api-server-main-pull-request.yaml
.tekton/ambient-code-ambient-api-server-main-push.yaml
.tekton/ambient-code-ambient-runner-main-pull-request.yaml
.tekton/ambient-code-ambient-runner-main-push.yaml
.tekton/ambient-code-backend-main-pull-request.yaml
.tekton/ambient-code-backend-main-push.yaml
.tekton/ambient-code-frontend-main-pull-request.yaml
.tekton/ambient-code-frontend-main-push.yaml
.tekton/ambient-code-operator-main-pull-request.yaml
.tekton/ambient-code-operator-main-push.yaml
.tekton/ambient-code-public-api-main-pull-request.yaml
.tekton/ambient-code-public-api-main-push.yaml
components/ambient-api-server/templates/db-template.yml
components/manifests/README.md
components/manifests/base/core/ambient-api-server-service.yml
components/manifests/base/core/operator-deployment.yaml
components/manifests/base/platform/ambient-api-server-db.yml
components/manifests/base/platform/ambient-api-server-secrets.yml
components/manifests/base/rbac/frontend-rbac.yaml
components/manifests/components/ambient-api-server-db/ambient-api-server-db-json-patch.yaml
components/manifests/components/ambient-api-server-db/ambient-api-server-init-db-patch.yaml
components/manifests/components/ambient-api-server-db/kustomization.yaml
components/manifests/components/oauth-proxy/frontend-oauth-deployment-patch.yaml
components/manifests/components/oauth-proxy/frontend-oauth-service-patch.yaml
components/manifests/overlays/app-interface/ambient-api-server-db-secret-patch.yaml
components/manifests/overlays/app-interface/ambient-api-server-env-patch.yaml
components/manifests/overlays/app-interface/ambient-api-server-route.yaml
components/manifests/overlays/app-interface/ambient-api-server-service-ca-patch.yaml
components/manifests/overlays/app-interface/ambient-api-server-ssl-patch.yaml
components/manifests/overlays/app-interface/backend-route.yaml
components/manifests/overlays/app-interface/kustomization.yaml
components/manifests/overlays/app-interface/namespace-patch.yaml
components/manifests/overlays/app-interface/namespace.yaml
components/manifests/overlays/app-interface/operator-config-openshift.yaml
components/manifests/overlays/app-interface/operator-runner-image-patch.yaml
components/manifests/overlays/app-interface/public-api-route.yaml
components/manifests/overlays/app-interface/route.yaml
components/manifests/overlays/kind/api-server-db-security-patch.yaml
components/manifests/overlays/kind/api-server-no-jwt-patch.yaml
components/manifests/overlays/local-dev/ambient-api-server-db-credentials-patch.yaml
components/manifests/overlays/local-dev/ambient-api-server-db-json-patch.yaml
components/manifests/overlays/local-dev/ambient-api-server-init-db-patch.yaml
components/manifests/overlays/production/ambient-api-server-jwt-args-patch.yaml
components/manifests/overlays/production/ambient-api-server-migration-ssl-patch.yaml
components/manifests/overlays/production/kustomization.yaml
components/manifests/templates/template-operator.yaml
components/manifests/templates/template-services.yaml
components/manifests/templates/validate.sh
components/operator/internal/controller/otel_metrics.go
components/operator/internal/handlers/sessions.go
components/runners/ambient-runner/ambient_runner/app.py

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📝 Walkthrough

Walkthrough

This PR introduces Tekton CI/CD pipelines for building multiple service components, refactors database secret naming from ambient-api-server-db to ambient-code-rds across manifests, configures OpenShift OAuth proxy and RBAC for the frontend, sets up an app-interface staging overlay with routes and TLS, defines operator and service templates with CRDs, and adds runtime health probes and observability improvements.

Changes

Tekton CI/CD Pipeline Definitions

Layer / File(s)	Summary
Pipeline Configuration Structure `.tekton/ambient-code-*-{pull-request,push}.yaml` (8 files)	Eight new `PipelineRun` definitions for API server, runner, backend, frontend, operator, and public-api components on PR and push events. Each defines trigger annotations, parameters, task graph orchestration, and conditional security scan gating.
Task Graph & Execution Flow `.tekton/ambient-code-*-{pull-request,push}.yaml`	Tasks chain init → clone → prefetch → build → index → (optional source image) → security checks → tag/push/sign, with `skip-checks` and Coverity availability gating for conditional scans.
Validation `components/manifests/templates/validate.sh`	New bash script validates both template manifests using `oc process` with test image tag.

Database Secret Refactoring (ambient-api-server-db → ambient-code-rds)

Layer / File(s)	Summary
Template Parameters `components/ambient-api-server/templates/db-template.yml`	`DATABASE_SERVICE_NAME` default changed from `ambient-api-server-db` to `ambient-code-rds`, affecting generated resource names and labels.
Base Manifests `components/manifests/base/core/ambient-api-server-service.yml`, `components/manifests/base/platform/ambient-api-server-db.yml`, `components/manifests/base/platform/ambient-api-server-secrets.yml`	Deployment/Service secret references and Secret metadata name updated to use `ambient-code-rds`; PostgreSQL env vars (`POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`) rewired to new secret.
Component Patches `components/manifests/components/ambient-api-server-db/*.yaml`	JSON/container patches update PostgreSQL image, env vars, probes, and volume mounts to reference `ambient-code-rds` secret with `POSTGRESQL_*` naming convention.
Environment Overlays `components/manifests/overlays/{kind,local-dev}/*.yaml`	Deployment patches update secret references for database credentials across kind and local-dev environments; init-db and security patches target `ambient-code-rds`.
Production Overlay `components/manifests/overlays/production/ambient-api-server-jwt-args-patch.yaml`	Switches DB SSL mode from `disable` to `require` for external RDS; new migration SSL patch applies `--db-sslmode=require`.
App-Interface Overlay `components/manifests/overlays/app-interface/ambient-api-server-db-secret-patch.yaml`	New Vault-injected Secret manifest for `ambient-code-rds` with RDS connection placeholders.
Documentation `components/manifests/README.md`	Updated to reflect `ambient-code-rds` as the target secret and component scope.

Frontend OAuth & RBAC Updates

Layer / File(s)	Summary
RBAC Binding Changes `components/manifests/base/rbac/frontend-rbac.yaml`	ServiceAccount annotation added for OpenShift OAuth redirect; ClusterRoleBinding renamed from `ambient-frontend-auth` to `ambient-frontend-oauth-delegator` and bound to `system:auth-delegator` role.
OAuth Proxy Configuration `components/manifests/components/oauth-proxy/frontend-oauth-deployment-patch.yaml`	OAuth proxy sidecar image upgraded; startup args rewritten to use OpenShift provider, TLS cert/key paths, service-account-token cookie secret, delegate URLs, and 5m upstream timeout; probes switched to HTTPS with adjusted timing.
Service TLS Certificate `components/manifests/components/oauth-proxy/frontend-oauth-service-patch.yaml`	Service annotation key/value updated from `service.beta.openshift.io/serving-cert-secret-name: dashboard-proxy-tls` to `service.alpha.openshift.io/serving-cert-secret-name: frontend-proxy-tls`.

App-Interface Staging Overlay

Layer / File(s)	Summary
Namespace & Labels `components/manifests/overlays/app-interface/{namespace,namespace-patch}.yaml`	New Namespace `ambient-code` with `environment: stage` and `service: ambient-code-platform` labels; includes annotations for app identity.
Route Definitions `components/manifests/overlays/app-interface/{ambient-api-server-route,backend-route,public-api-route,route}.yaml`	Four new OpenShift Routes: API server (HTTP + gRPC), backend, public-api, and frontend; all configured with TLS edge termination and redirect behavior.
API Server Configuration `components/manifests/overlays/app-interface/{ambient-api-server-env-patch,ambient-api-server-ssl-patch,ambient-api-server-service-ca-patch}.yaml`	Patches set `AMBIENT_ENV=stage`, add `--db-sslmode=require` to migration/server containers, and configure TLS certificate auto-provisioning via `ambient-api-server-tls` secret.
Operator Configuration `components/manifests/overlays/app-interface/{operator-config-openshift,operator-runner-image-patch}.yaml`	ConfigMap enables Vertex AI with Google credentials path; Deployment patch sets `AMBIENT_CODE_RUNNER_IMAGE` to `quay.io/redhat-services-prod/.../ambient-code-ambient-runner-main:latest`.
Kustomize Composition `components/manifests/overlays/app-interface/kustomization.yaml`	Overlay composes base + routes + OAuth component; applies patches for replica scaling (0 for DB/storage), secret injection, TLS, and image overrides to `quay.io/redhat-services-prod/.../latest`.

Operator & Service Templates

Layer / File(s)	Summary
Custom Resource Definitions `components/manifests/templates/template-operator.yaml`	Introduces `AgenticSession` (namespaced, `v1alpha1`) with workflow, prompt, repo, timeout, and status fields; and `ProjectSettings` (singleton validated) with group RBAC, inactivity timeout, and runner secrets configuration.
RBAC for Operator `components/manifests/templates/template-operator.yaml`	Primary `ClusterRole` (agentic-operator) grants CRUD on CRDs, pod/service/job/namespace management, RBAC admin, MLflow CRD, and NetworkPolicy permissions; aggregate roles for admin access; separate roles for frontend (tokenreviews), namespace viewing, and backend API.
Service Template `components/manifests/templates/template-services.yaml`	OpenShift Template bundles Namespace, Services, PVCs, LimitRange, Deployments (API server with init migration, backend, frontend with OAuth proxy, public-api), PodDisruptionBudget, and Routes with parameterized image tags and upstream timeout.
Operator Deployment `components/manifests/templates/template-operator.yaml`	Operator Deployment configures env vars from ConfigMaps/Secrets, probes, and mounts configmaps (`ambient-agent-registry`, `ambient-models`) for runner framework and model catalog.
Configuration `components/manifests/templates/template-operator.yaml`	Four ConfigMaps: `ambient-agent-registry` (runner metadata), `ambient-api-server-auth` (ACL/JWKS), `ambient-flags` (feature flags), `ambient-models` (catalog), `operator-config` (Vertex settings).

Runtime Configuration & Observability

Layer / File(s)	Summary
Metrics Initialization `components/operator/internal/controller/otel_metrics.go`	When `OTEL_EXPORTER_OTLP_ENDPOINT` is unset, metrics now initialize a no-op meter and all instruments via `initInstruments()` instead of short-circuiting; logged as initialized with no-op provider.
Runner Health Probes `components/operator/internal/handlers/sessions.go`	Runner container now includes `ReadinessProbe` and `LivenessProbe` (both `HTTPGet /health` on runner port with appropriate delays); adds `INITIAL_PROMPT_DELAY_SECONDS=10` env var.
Logging Improvement `components/runners/ambient-runner/ambient_runner/app.py`	HTTP retry failure log now includes exception type name (`type(e).__name__`) alongside the message for better debugging.
Operator OTEL Config `components/manifests/base/core/operator-deployment.yaml`	`OTEL_EXPORTER_OTLP_ENDPOINT` env var removed; commented guidance added to enable when OpenTelemetry collector is available.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

⚔️ Resolve merge conflicts

Resolve merge conflict in branch fix/add-health-probes-and-improve-logging

✨ Simplify code

Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

maknop · 2026-05-08T14:44:46Z

Wrong repository - creating PR on RedHatInsights/ambient-code-platform instead

coderabbitai · 2026-05-08T14:45:48Z

CodeRabbit chat interactions are restricted to organization members for this repository. Ask an organization member to interact with CodeRabbit, or set chat.allow_non_org_members: true in your configuration.

red-hat-konflux and others added 30 commits April 22, 2026 08:44

Red Hat Konflux update ambient-code-backend-main

6031ec7

Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>

Red Hat Konflux update ambient-code-frontend-main

606ce90

Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>

Red Hat Konflux update ambient-code-operator-main

6c55b40

Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>

Red Hat Konflux update ambient-code-public-api-main

f0c83d1

Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>

Red Hat Konflux update ambient-code-ambient-api-server-main

27ca356

Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>

Red Hat Konflux update ambient-code-ambient-runner-main

ecc111f

Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>

refactor: remove in-cluster services from template

c608bd0

Remove minio, postgresql, unleash, ambient-api-server-db. Using external RDS and S3 from app-interface. Removed 12 resources (4 Deployments, 4 Services, 3 PVCs, 1 Secret) Remaining: ambient-api-server, backend-api, frontend, public-api

updating postgresql db name

813c4e6

enabling ssl mode for rds

9d1e6c0

enabling ssl mode for rds

85b6476

fix: fix frontent route termination

e252262

Signed-off-by: Chris Mitchell <cmitchel@redhat.com>

fix: revert https changes for oauth pods

b673993

Signed-off-by: Chris Mitchell <cmitchel@redhat.com>

Change TLS termination from reencrypt to edge

07c771f

Change health check scheme from HTTPS to HTTP

3b12dbc

Update upstream URL to use frontend service

cd29d3e

Enable request logging in OAuth proxy configuration

19cae2a

Update OAuth redirect reference for frontend service account

eea6dbf

Update OAuth proxy configuration options

da9e091

Remove authorization header setting from template

59db0de

Removed the '--set-authorization-header=true' option from the configuration.

maknop and others added 25 commits April 22, 2026 08:44

removing openshift-delegate-urls

81be018

Revert "removing openshift-delegate-urls"

8409458

Update openshift-delegate-urls path in template-services.yaml

4a337c6

Remove scope option from OAuth proxy configuration

f946eb2

Removed the '--scope=user:full' option from the configuration.

chore: Update konflux deps

58123c5

Signed-off-by: Chris Mitchell <cmitchel@redhat.com>

Merge pull request #56 from RedHatInsights/update_rpm_sig_scan_ref

3731512

chore: Update konflux deps

Merge pull request #60 from RedHatInsights/oauth_client_updates

fabbc95

Oauth client updates

Merge pull request #62 from RedHatInsights/fix/tekton-path-glob-patterns

d292964

fix(ci): correct Tekton pathChanged glob patterns

Merge pull request #61 from RedHatInsights/noop_reporter_init_otel

f190ae5

fix: initialize no-op metrics instruments when OTEL is disabled

Merge pull request #63 from RedHatInsights/add_mlflow_perms

f0cafaf

fix: add MLflow CRD permissions to operator ClusterRole

Merge pull request #64 from RedHatInsights/add_mlflow_to_operator_clu…

9a63f96

…sterrole fix: add MLflow permissions to agentic-operator ClusterRole

Merge pull request #65 from RedHatInsights/fix-oauth-proxy-api-routing

622f62f

fix: add backend API routing to oauth-proxy upstream

Merge pull request #66 from RedHatInsights/fix-remove-oauth-delegate-…

0d7e8c0

…urls fix: remove overly restrictive openshift-delegate-urls check

increased initial prompt deploy seconds to 10 seconds

a3ede83

Merge pull request #67 from RedHatInsights/inital_prompt_time_increase

1b05e80

increased initial prompt deploy seconds to 10 seconds

maknop closed this May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runner): add health probes and improve INITIAL_PROMPT error logging#1534

fix(runner): add health probes and improve INITIAL_PROMPT error logging#1534
maknop wants to merge 60 commits into
ambient-code:mainfrom
RedHatInsights:fix/add-health-probes-and-improve-logging

maknop commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

netlify Bot commented May 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Uh oh!

maknop commented May 8, 2026

Uh oh!

coderabbitai Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maknop commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Kubernetes Health Probes

Error Logging Improvements

Benefits

Test Plan

Summary by CodeRabbit

Release Notes

Uh oh!

netlify Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cheerful-kitten-f556a0 ready!

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Uh oh!

maknop commented May 8, 2026

Uh oh!

coderabbitai Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maknop commented May 8, 2026 •

edited by coderabbitai Bot

Loading

netlify Bot commented May 8, 2026 •

edited

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading