fix(runner): add health probes and improve INITIAL_PROMPT error logging#1534
fix(runner): add health probes and improve INITIAL_PROMPT error logging#1534maknop wants to merge 60 commits into
Conversation
Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>
Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>
Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>
Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>
Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>
Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>
Creates kustomize overlay for deploying to hcmais01ue1 via app-interface: - Uses Konflux images from redhat-services-prod/hcm-eng-prod-tenant - Scales down in-cluster databases (using external RDS from app-interface Phase 2) - Scales down MinIO (using external S3 from app-interface Phase 2) - Includes CRDs, RBAC, routes, and all application components - Patches operator to use Konflux runner image Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Convert kustomize overlay to OpenShift Template format for app-interface SaaS deployment. Split into two templates: 1. template-operator.yaml (CRDs, ClusterRoles, operator deployment) - Operator and ambient-runner images - Cluster-scoped resources (CRDs, RBAC) - Operator deployment and its ConfigMaps 2. template-services.yaml (Application services) - Backend, frontend, public-api, ambient-api-server images - All deployments, services, routes, configmaps - Scales in-cluster services to 0 (minio, postgresql, unleash) Both templates use IMAGE_TAG parameter (auto-generated from git commit SHA) and support Konflux image gating through app-interface. This allows app-interface to use provider: openshift-template with proper parameter substitution instead of the directory provider which doesn't run kustomize build.
Creates kustomize overlay for deploying to hcmais01ue1 via app-interface: - Uses Konflux images from redhat-services-prod/hcm-eng-prod-tenant - Scales down in-cluster databases (using external RDS from app-interface Phase 2) - Scales down MinIO (using external S3 from app-interface Phase 2) - Includes CRDs, RBAC, routes, and all application components - Patches operator to use Konflux runner image Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The objects field must be a YAML array with proper list indicators. Previous version was missing the '-' prefix on array items, causing: 'unable to decode STDIN: json: cannot unmarshal object into Go struct field Template.objects of type []runtime.RawExtension' Changes: - Rebuild templates using Python yaml library for correct formatting - Objects now properly formatted as YAML array with '- apiVersion:' - Add validate.sh script for testing with oc process - Both templates validated successfully Generated from kustomize overlay output with proper YAML structure.
Remove minio, postgresql, unleash, ambient-api-server-db. Using external RDS and S3 from app-interface. Removed 12 resources (4 Deployments, 4 Services, 3 PVCs, 1 Secret) Remaining: ambient-api-server, backend-api, frontend, public-api
Disables OTEL metrics export by commenting out OTEL_EXPORTER_OTLP_ENDPOINT environment variable in operator deployment manifests. The operator was configured to send metrics to otel-collector.ambient-code.svc:4317, but this service does not exist in the cluster, causing repeated gRPC connection failures every 30 seconds with error: "failed to upload metrics: context deadline exceeded: rpc error: code = Unavailable desc = name resolver error: produced zero addresses" With OTEL_EXPORTER_OTLP_ENDPOINT unset, InitMetrics() will skip metrics export and log "metrics export disabled" instead of throwing connection errors. Changes: - Comment out OTEL_EXPORTER_OTLP_ENDPOINT in base operator deployment - Comment out OTEL_EXPORTER_OTLP_ENDPOINT in OpenShift template - Add clarifying comment about re-enabling when collector is deployed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changes: - Add oauth-proxy component to frontend deployment (dashboard-ui port on 8443) - Enable SSL for ambient-api-server RDS connection (db-sslmode=require) - Set AMBIENT_ENV to 'stage' for ambient-api-server - Enable OpenShift service-ca for ambient-api-server TLS cert provisioning - Regenerate templates with new oauth-proxy and api-server patches This enables: - Authenticated access to frontend via OpenShift OAuth - Secure connections to external RDS database - Automatic TLS certificate rotation for ambient-api-server Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove postgresql, minio, unleash, and ambient-api-server-db resources from the services template. These services are scaled to 0 via kustomize patches because we use external RDS and S3 instead. Including them in the template causes app-interface to try deploying them, which fails imagePattern validation and wastes resources. Excluded resources: - Deployment/postgresql, Service/postgresql - Deployment/minio, Service/minio, PVC/minio-data - Deployment/unleash, Service/unleash - Deployment/ambient-api-server-db, Service/ambient-api-server-db Template now has 21 service resources (down from 30). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Switch from custom vault secrets to OpenShift service account-based OAuth: - Use Red Hat's official ose-oauth-proxy-rhel9 image - Use service account token for cookie secret (no vault needed) - Enable HTTPS on OAuth proxy with OpenShift service-ca auto-generated certs - Add system:auth-delegator ClusterRoleBinding for OAuth delegation - Add OAuth redirect reference annotation to frontend ServiceAccount - Fix service account reference from 'nginx' to 'frontend' - Add missing NAMESPACE and UPSTREAM_TIMEOUT parameters Benefits: - No manual vault secret management - Automatic TLS cert rotation via service-ca - Standard OpenShift OAuth integration pattern - Follows app-interface team recommendations Files changed: - frontend-rbac.yaml: Added OAuth annotations and auth-delegator binding - oauth-proxy component patches: Updated to new configuration - Templates: Regenerated with OAuth fixes (27 operator, 21 service resources) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The RDS credentials secret should not be in the OpenShift template - it's
provided by the external resource provider (terraform) in app-interface.
The namespace's externalResources section already defines:
- provider: rds
output_resource_name: ambient-code-rds
This automatically creates the secret with the correct RDS credentials.
Including the secret in the template with VAULT_INJECTED placeholders
caused deployment failures.
Changes:
- Excluded ambient-code-rds secret from template generation
- Template now has 20 service resources (down from 21)
- Deployment still references the secret via volumeMount (correct)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Chris Mitchell <cmitchel@redhat.com>
Signed-off-by: Chris Mitchell <cmitchel@redhat.com>
Changes GCP service account configuration to align with app-interface deployment where credentials are provided via Vault. Changes: - template-services.yaml: Update backend vertex-credentials secret name from 'ambient-vertex' to 'stage-gcp-creds' (matches Vault secret) - template-operator.yaml: Update GOOGLE_APPLICATION_CREDENTIALS path to match Vault secret key name 'itpc-gcp-hcm-pe-eng.json' The secret is provided by app-interface via: path: engineering-productivity/ambient-code/stage-gcp-creds This allows the backend and operator to use Vertex AI for Claude and Gemini API calls with the service account configured with roles/aiplatform.user permissions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Chris Mitchell <cmitchel@redhat.com>
Configure OAuth proxy sidecar to inject authentication token into forwarded requests, fixing 401 errors on /api/projects endpoints. Changes: - Add --pass-access-token=true flag to inject X-Forwarded-Access-Token header - Change upstream from frontend-service:3000 to localhost:3000 (correct sidecar pattern) - Remove --request-logging to reduce log noise Backend logs showed: tokenSource=none hasAuthHeader=false hasFwdToken=false The backend expects the X-Forwarded-Access-Token header, which is now injected by the OAuth proxy for all authenticated requests. Flow: 1. User authenticates via OpenShift OAuth ✓ 2. OAuth proxy injects token header ✓ (new) 3. Frontend forwards token to backend API ✓ (fixed) This resolves the 401 authentication errors while maintaining the working OpenShift OAuth integration. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Removed the '--set-authorization-header=true' option from the configuration.
Removed the '--scope=user:full' option from the configuration.
Signed-off-by: Chris Mitchell <cmitchel@redhat.com>
chore: Update konflux deps
Switch OAuth proxy from service account authentication to explicit SSO client credentials to enable user:full scope. Changes: - Replace --openshift-service-account with --client-id=ambient-code - Mount client_secret from stage-sso-client Kubernetes secret - Add --scope=user:full to grant full user permissions - Mount /etc/oauth-client volume for client secret file This allows users to create resources (AgenticSessions, ConfigMaps) in their project namespaces by providing the necessary OAuth scope. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove ambient-frontend-oauth-delegator ClusterRoleBinding from the operator template as it is now deployed via app-interface openshiftResources for better separation of concerns. Cluster-scoped resources should be managed outside of saas file deployments as they have impact on the whole cluster. This ClusterRoleBinding grants the frontend service account the system:auth-delegator role needed for OAuth proxy token delegation. It is now defined in app-interface at: resources/services/ambient-code-platform/ambient-frontend-oauth-delegator.clusterrolebinding.yaml Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Oauth client updates
The pathChanged() CEL function was using incorrect glob syntax that prevented pipelines from triggering on component changes: - Changed `./components/*/***` to `components/*/**` (removed leading `./` and fixed triple-asterisk to double-asterisk for recursive matching) - Removed invalid root `Dockerfile` check (Dockerfiles are in component subdirectories, already covered by component globs) PipelinesAsCode pathChanged() expects standard glob patterns relative to repository root, with `**` for recursive directory matching. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
fix(ci): correct Tekton pathChanged glob patterns
When OTEL_EXPORTER_OTLP_ENDPOINT is unset, InitMetrics() was returning early without initializing metric instruments, leaving them as nil. This caused nil pointer panics when reconciliation code called metric recording functions like RecordSessionCreatedByUser(). The panic occurred at otel_metrics.go:424 when sessionsByUser.Add() was called on a nil counter during reconcilePending phase. Fix: - When OTEL endpoint is unset, initialize no-op meter from global provider - Create all metric instruments as no-ops (silently ignore all calls) - Prevents nil pointer panics while maintaining same API contract - No-op instruments have all the same methods but do nothing OpenTelemetry provides a built-in no-op MeterProvider as the global default, which creates no-op instruments that safely ignore all metric recording calls without panicking. Error before fix: panic: runtime error: invalid memory address or nil pointer dereference at RecordSessionCreatedByUser (/app/internal/controller/otel_metrics.go:424) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
fix: initialize no-op metrics instruments when OTEL is disabled
Add permissions for mlflow.kubeflow.org Experiments and Runs CRDs to
the agentic-operator ClusterRole. The operator unconditionally grants
these permissions to session runner service accounts via Roles, but
cannot grant permissions it doesn't hold itself.
Without these ClusterRole permissions, session creation fails with:
user "system:serviceaccount:ambient-code:agentic-operator" is attempting
to grant RBAC permissions not currently held:
{APIGroups:["mlflow.kubeflow.org"], Resources:["experiments"], Verbs:[...]}
These are namespace-scoped CRDs from the Kubeflow MLflow Operator, used
for ML experiment tracking with Kubernetes-native RBAC authentication.
Sessions use these to log ML training runs, parameters, and metrics to
the MLflow tracking server.
Note: MLflow tracing is optional (MLFLOW_TRACING_ENABLED env var), but
the operator code unconditionally includes these permissions in session
Roles regardless of whether tracing is enabled.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
fix: add MLflow CRD permissions to operator ClusterRole
Add mlflow.kubeflow.org CRD permissions to the agentic-operator ClusterRole. The operator creates Roles in user namespaces that include MLflow permissions, but due to Kubernetes RBAC privilege escalation protection, it can only grant permissions it holds itself. Previous commit 2af8216 added MLflow permissions to backend-api ClusterRole, but missed adding them to agentic-operator. This causes session creation to fail with: user "system:serviceaccount:ambient-code:agentic-operator" is attempting to grant RBAC permissions not currently held: {APIGroups:["mlflow.kubeflow.org"], Resources:["experiments"], Verbs:[...]} The agentic-operator service account needs these permissions to create session runner Roles that include MLflow access. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…sterrole fix: add MLflow permissions to agentic-operator ClusterRole
The operator needs to create NetworkPolicies in user namespaces to isolate runner pods. Without this permission, session creation fails with: networkpolicies.networking.k8s.io is forbidden: User "system:serviceaccount:ambient-code:agentic-operator" cannot create resource "networkpolicies" in API group "networking.k8s.io" in the namespace "mknop-ws" This adds create/delete/get/list permissions for NetworkPolicies to the agentic-operator ClusterRole. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Configure oauth-proxy to route /api/* requests to backend-service instead of the Next.js frontend. Without this routing, all requests including /api/* go to localhost:3000, causing 503 errors because Next.js doesn't handle backend API routes. Changes: - Add --upstream=http://backend-service:8080/api/ before default upstream - Requests to /api/* now route to backend-service:8080 - All other requests continue to Next.js frontend at localhost:3000 OAuth2-proxy processes upstreams in order and uses the path portion as a matching key. The /api/ path in the upstream URL matches any request starting with /api/, and the full request path is forwarded to the backend. Request flow example: Browser: GET https://ambient.corp.stage.redhat.com/api/projects/foo/sessions/bar → OAuth-proxy checks auth via --openshift-delegate-urls → Matches --upstream=http://backend-service:8080/api/ (longest match) → Forwards to: http://backend-service:8080/api/projects/foo/sessions/bar Fixes browser console errors: GET /api/projects/.../git/status [503 Service Unavailable] AG-UI stream error: Connection error The connection to .../agui/events was interrupted Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Chris Mitchell <cmitchel@redhat.com>
fix: add backend API routing to oauth-proxy upstream
Remove --openshift-delegate-urls parameter from oauth-proxy that was
blocking /api/* requests with "no resource mapped path" errors.
Issue:
- openshift-delegate-urls={"/api":{"resource":"projects","verb":"list"}}
only matches /api exactly, not /api/* subpaths
- All /api/* requests were returning 503 even though backend received
and processed them successfully (200 OK in backend logs)
- oauth-proxy logs showed: "no resource mapped path"
Solution:
OAuth-proxy still provides authentication (OAuth login required for all
requests) and passes the access token to the backend via --pass-access-token.
The backend handles its own fine-grained authorization based on the token,
so the blanket openshift-delegate-urls check is redundant and overly
restrictive.
Authorization flow after this change:
1. User authenticates via OAuth (enforced by oauth-proxy)
2. oauth-proxy passes access token to backend
3. Backend validates token and checks user permissions per endpoint
4. Backend returns appropriate response (200, 403, 404, etc.)
This matches the backend's existing authorization model where different
API endpoints have different permission requirements that can't be
expressed in a single openshift-delegate-urls pattern.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…urls fix: remove overly restrictive openshift-delegate-urls check
increased initial prompt deploy seconds to 10 seconds
Kubernetes Health Probes: - Added readiness probe (3s initial delay, 5s period) - Added liveness probe (20s initial delay, 30s period) - Prevents Service routing traffic before FastAPI is ready - Reduces 503 "runner unavailable" errors Error Logging Improvements: - Enhanced retry error logging to include exception type - Previously logged empty strings for exceptions like asyncio.TimeoutError - Now logs: "error: TimeoutError: <details>" instead of "error: " Benefits: - Prevents premature traffic routing to starting pods - More informative error logs for debugging - Better system resilience through health probes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
✅ Deploy Preview for cheerful-kitten-f556a0 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (51)
Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to data retention organization setting 📝 WalkthroughWalkthroughThis PR introduces Tekton CI/CD pipelines for building multiple service components, refactors database secret naming from ChangesTekton CI/CD Pipeline Definitions
Database Secret Refactoring (ambient-api-server-db → ambient-code-rds)
Frontend OAuth & RBAC Updates
App-Interface Staging Overlay
Operator & Service Templates
Runtime Configuration & Observability
✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Wrong repository - creating PR on RedHatInsights/ambient-code-platform instead |
|
CodeRabbit chat interactions are restricted to organization members for this repository. Ask an organization member to interact with CodeRabbit, or set |
Summary
This PR implements health probes for runner pods and improves error logging for INITIAL_PROMPT retries, matching the implementation from #1529.
Changes
Kubernetes Health Probes
/healthendpoint on the runner's FastAPI serverError Logging Improvements
app.pyto include exception typeasyncio.TimeoutError"error: TimeoutError: <details>"instead of"error: "Benefits
Test Plan
go vetpasses)gofmtpasses)🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
Infrastructure & Build Improvements
Security & Authentication
Configuration Updates