Skip to content

fix(runner): add health probes and improve INITIAL_PROMPT error logging#1534

Closed
maknop wants to merge 60 commits into
ambient-code:mainfrom
RedHatInsights:fix/add-health-probes-and-improve-logging
Closed

fix(runner): add health probes and improve INITIAL_PROMPT error logging#1534
maknop wants to merge 60 commits into
ambient-code:mainfrom
RedHatInsights:fix/add-health-probes-and-improve-logging

Conversation

@maknop
Copy link
Copy Markdown
Contributor

@maknop maknop commented May 8, 2026

Summary

This PR implements health probes for runner pods and improves error logging for INITIAL_PROMPT retries, matching the implementation from #1529.

Changes

Kubernetes Health Probes

  • Added readiness probe to runner container (3s initial delay, 5s period)
  • Added liveness probe to runner container (20s initial delay, 30s period)
  • Probes check /health endpoint on the runner's FastAPI server

Error Logging Improvements

  • Enhanced retry error logging in app.py to include exception type
  • Previously logged empty strings for exceptions like asyncio.TimeoutError
  • Now logs: "error: TimeoutError: <details>" instead of "error: "

Benefits

  • Prevents premature traffic routing: Service won't route to pods until FastAPI is ready
  • Reduces 503 errors: Eliminates "runner unavailable" errors during pod startup
  • Better debugging: More informative error logs with exception types
  • Self-healing: Liveness probe enables automatic pod restarts on failure

Test Plan

  • Code compiles successfully (go vet passes)
  • Code formatting is correct (gofmt passes)
  • Deploy to test cluster and verify health probes are configured
  • Verify no 503 errors during pod startup
  • Verify error logs include exception types during connection failures

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Infrastructure & Build Improvements

    • Added automated build pipelines for all components with security scanning and validation.
    • Enabled TLS/SSL support for database connections and service communication.
  • Security & Authentication

    • Implemented OAuth proxy authentication with OpenShift integration for frontend access.
    • Enhanced certificate management and TLS termination for API services.
  • Configuration Updates

    • Updated database configuration to support external RDS connectivity.
    • Added routes and service exposure for API, backend, and frontend services.

red-hat-konflux and others added 30 commits April 22, 2026 08:44
Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>
Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>
Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>
Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>
Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>
Signed-off-by: red-hat-konflux <konflux@no-reply.konflux-ci.dev>
Creates kustomize overlay for deploying to hcmais01ue1 via app-interface:
- Uses Konflux images from redhat-services-prod/hcm-eng-prod-tenant
- Scales down in-cluster databases (using external RDS from app-interface Phase 2)
- Scales down MinIO (using external S3 from app-interface Phase 2)
- Includes CRDs, RBAC, routes, and all application components
- Patches operator to use Konflux runner image

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Convert kustomize overlay to OpenShift Template format for app-interface
SaaS deployment. Split into two templates:

1. template-operator.yaml (CRDs, ClusterRoles, operator deployment)
   - Operator and ambient-runner images
   - Cluster-scoped resources (CRDs, RBAC)
   - Operator deployment and its ConfigMaps

2. template-services.yaml (Application services)
   - Backend, frontend, public-api, ambient-api-server images
   - All deployments, services, routes, configmaps
   - Scales in-cluster services to 0 (minio, postgresql, unleash)

Both templates use IMAGE_TAG parameter (auto-generated from git commit SHA)
and support Konflux image gating through app-interface.

This allows app-interface to use provider: openshift-template with
proper parameter substitution instead of the directory provider which
doesn't run kustomize build.
Creates kustomize overlay for deploying to hcmais01ue1 via app-interface:
- Uses Konflux images from redhat-services-prod/hcm-eng-prod-tenant
- Scales down in-cluster databases (using external RDS from app-interface Phase 2)
- Scales down MinIO (using external S3 from app-interface Phase 2)
- Includes CRDs, RBAC, routes, and all application components
- Patches operator to use Konflux runner image

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The objects field must be a YAML array with proper list indicators.
Previous version was missing the '-' prefix on array items, causing:
'unable to decode STDIN: json: cannot unmarshal object into Go struct
field Template.objects of type []runtime.RawExtension'

Changes:
- Rebuild templates using Python yaml library for correct formatting
- Objects now properly formatted as YAML array with '- apiVersion:'
- Add validate.sh script for testing with oc process
- Both templates validated successfully

Generated from kustomize overlay output with proper YAML structure.
Remove minio, postgresql, unleash, ambient-api-server-db.
Using external RDS and S3 from app-interface.

Removed 12 resources (4 Deployments, 4 Services, 3 PVCs, 1 Secret)
Remaining: ambient-api-server, backend-api, frontend, public-api
Disables OTEL metrics export by commenting out OTEL_EXPORTER_OTLP_ENDPOINT
environment variable in operator deployment manifests.

The operator was configured to send metrics to otel-collector.ambient-code.svc:4317,
but this service does not exist in the cluster, causing repeated gRPC connection
failures every 30 seconds with error:
"failed to upload metrics: context deadline exceeded: rpc error: code = Unavailable
desc = name resolver error: produced zero addresses"

With OTEL_EXPORTER_OTLP_ENDPOINT unset, InitMetrics() will skip metrics export
and log "metrics export disabled" instead of throwing connection errors.

Changes:
- Comment out OTEL_EXPORTER_OTLP_ENDPOINT in base operator deployment
- Comment out OTEL_EXPORTER_OTLP_ENDPOINT in OpenShift template
- Add clarifying comment about re-enabling when collector is deployed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changes:
- Add oauth-proxy component to frontend deployment (dashboard-ui port on 8443)
- Enable SSL for ambient-api-server RDS connection (db-sslmode=require)
- Set AMBIENT_ENV to 'stage' for ambient-api-server
- Enable OpenShift service-ca for ambient-api-server TLS cert provisioning
- Regenerate templates with new oauth-proxy and api-server patches

This enables:
- Authenticated access to frontend via OpenShift OAuth
- Secure connections to external RDS database
- Automatic TLS certificate rotation for ambient-api-server

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove postgresql, minio, unleash, and ambient-api-server-db resources
from the services template. These services are scaled to 0 via kustomize
patches because we use external RDS and S3 instead.

Including them in the template causes app-interface to try deploying
them, which fails imagePattern validation and wastes resources.

Excluded resources:
- Deployment/postgresql, Service/postgresql
- Deployment/minio, Service/minio, PVC/minio-data
- Deployment/unleash, Service/unleash
- Deployment/ambient-api-server-db, Service/ambient-api-server-db

Template now has 21 service resources (down from 30).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Switch from custom vault secrets to OpenShift service account-based OAuth:
- Use Red Hat's official ose-oauth-proxy-rhel9 image
- Use service account token for cookie secret (no vault needed)
- Enable HTTPS on OAuth proxy with OpenShift service-ca auto-generated certs
- Add system:auth-delegator ClusterRoleBinding for OAuth delegation
- Add OAuth redirect reference annotation to frontend ServiceAccount
- Fix service account reference from 'nginx' to 'frontend'
- Add missing NAMESPACE and UPSTREAM_TIMEOUT parameters

Benefits:
- No manual vault secret management
- Automatic TLS cert rotation via service-ca
- Standard OpenShift OAuth integration pattern
- Follows app-interface team recommendations

Files changed:
- frontend-rbac.yaml: Added OAuth annotations and auth-delegator binding
- oauth-proxy component patches: Updated to new configuration
- Templates: Regenerated with OAuth fixes (27 operator, 21 service resources)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The RDS credentials secret should not be in the OpenShift template - it's
provided by the external resource provider (terraform) in app-interface.

The namespace's externalResources section already defines:
  - provider: rds
    output_resource_name: ambient-code-rds

This automatically creates the secret with the correct RDS credentials.
Including the secret in the template with VAULT_INJECTED placeholders
caused deployment failures.

Changes:
- Excluded ambient-code-rds secret from template generation
- Template now has 20 service resources (down from 21)
- Deployment still references the secret via volumeMount (correct)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Chris Mitchell <cmitchel@redhat.com>
Signed-off-by: Chris Mitchell <cmitchel@redhat.com>
Changes GCP service account configuration to align with app-interface
deployment where credentials are provided via Vault.

Changes:
- template-services.yaml: Update backend vertex-credentials secret name
  from 'ambient-vertex' to 'stage-gcp-creds' (matches Vault secret)
- template-operator.yaml: Update GOOGLE_APPLICATION_CREDENTIALS path
  to match Vault secret key name 'itpc-gcp-hcm-pe-eng.json'

The secret is provided by app-interface via:
  path: engineering-productivity/ambient-code/stage-gcp-creds

This allows the backend and operator to use Vertex AI for Claude and
Gemini API calls with the service account configured with
roles/aiplatform.user permissions.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Chris Mitchell <cmitchel@redhat.com>
Configure OAuth proxy sidecar to inject authentication token into
forwarded requests, fixing 401 errors on /api/projects endpoints.

Changes:
- Add --pass-access-token=true flag to inject X-Forwarded-Access-Token header
- Change upstream from frontend-service:3000 to localhost:3000 (correct sidecar pattern)
- Remove --request-logging to reduce log noise

Backend logs showed:
  tokenSource=none hasAuthHeader=false hasFwdToken=false

The backend expects the X-Forwarded-Access-Token header, which is now
injected by the OAuth proxy for all authenticated requests.

Flow:
1. User authenticates via OpenShift OAuth ✓
2. OAuth proxy injects token header ✓ (new)
3. Frontend forwards token to backend API ✓ (fixed)

This resolves the 401 authentication errors while maintaining the
working OpenShift OAuth integration.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Removed the '--set-authorization-header=true' option from the configuration.
maknop and others added 25 commits April 22, 2026 08:44
Removed the '--scope=user:full' option from the configuration.
Signed-off-by: Chris Mitchell <cmitchel@redhat.com>
Switch OAuth proxy from service account authentication to explicit
SSO client credentials to enable user:full scope.

Changes:
- Replace --openshift-service-account with --client-id=ambient-code
- Mount client_secret from stage-sso-client Kubernetes secret
- Add --scope=user:full to grant full user permissions
- Mount /etc/oauth-client volume for client secret file

This allows users to create resources (AgenticSessions, ConfigMaps)
in their project namespaces by providing the necessary OAuth scope.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove ambient-frontend-oauth-delegator ClusterRoleBinding from the
operator template as it is now deployed via app-interface
openshiftResources for better separation of concerns.

Cluster-scoped resources should be managed outside of saas file
deployments as they have impact on the whole cluster.

This ClusterRoleBinding grants the frontend service account the
system:auth-delegator role needed for OAuth proxy token delegation.
It is now defined in app-interface at:
resources/services/ambient-code-platform/ambient-frontend-oauth-delegator.clusterrolebinding.yaml

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The pathChanged() CEL function was using incorrect glob syntax that
prevented pipelines from triggering on component changes:
- Changed `./components/*/***` to `components/*/**` (removed leading
  `./` and fixed triple-asterisk to double-asterisk for recursive matching)
- Removed invalid root `Dockerfile` check (Dockerfiles are in component
  subdirectories, already covered by component globs)

PipelinesAsCode pathChanged() expects standard glob patterns relative
to repository root, with `**` for recursive directory matching.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
fix(ci): correct Tekton pathChanged glob patterns
When OTEL_EXPORTER_OTLP_ENDPOINT is unset, InitMetrics() was returning
early without initializing metric instruments, leaving them as nil.
This caused nil pointer panics when reconciliation code called metric
recording functions like RecordSessionCreatedByUser().

The panic occurred at otel_metrics.go:424 when sessionsByUser.Add()
was called on a nil counter during reconcilePending phase.

Fix:
- When OTEL endpoint is unset, initialize no-op meter from global provider
- Create all metric instruments as no-ops (silently ignore all calls)
- Prevents nil pointer panics while maintaining same API contract
- No-op instruments have all the same methods but do nothing

OpenTelemetry provides a built-in no-op MeterProvider as the global
default, which creates no-op instruments that safely ignore all metric
recording calls without panicking.

Error before fix:
  panic: runtime error: invalid memory address or nil pointer dereference
  at RecordSessionCreatedByUser (/app/internal/controller/otel_metrics.go:424)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
fix: initialize no-op metrics instruments when OTEL is disabled
Add permissions for mlflow.kubeflow.org Experiments and Runs CRDs to
the agentic-operator ClusterRole. The operator unconditionally grants
these permissions to session runner service accounts via Roles, but
cannot grant permissions it doesn't hold itself.

Without these ClusterRole permissions, session creation fails with:
  user "system:serviceaccount:ambient-code:agentic-operator" is attempting
  to grant RBAC permissions not currently held:
  {APIGroups:["mlflow.kubeflow.org"], Resources:["experiments"], Verbs:[...]}

These are namespace-scoped CRDs from the Kubeflow MLflow Operator, used
for ML experiment tracking with Kubernetes-native RBAC authentication.
Sessions use these to log ML training runs, parameters, and metrics to
the MLflow tracking server.

Note: MLflow tracing is optional (MLFLOW_TRACING_ENABLED env var), but
the operator code unconditionally includes these permissions in session
Roles regardless of whether tracing is enabled.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
fix: add MLflow CRD permissions to operator ClusterRole
Add mlflow.kubeflow.org CRD permissions to the agentic-operator ClusterRole.
The operator creates Roles in user namespaces that include MLflow permissions,
but due to Kubernetes RBAC privilege escalation protection, it can only grant
permissions it holds itself.

Previous commit 2af8216 added MLflow permissions to backend-api ClusterRole,
but missed adding them to agentic-operator. This causes session creation to
fail with:

  user "system:serviceaccount:ambient-code:agentic-operator" is attempting
  to grant RBAC permissions not currently held:
  {APIGroups:["mlflow.kubeflow.org"], Resources:["experiments"], Verbs:[...]}

The agentic-operator service account needs these permissions to create
session runner Roles that include MLflow access.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…sterrole

fix: add MLflow permissions to agentic-operator ClusterRole
The operator needs to create NetworkPolicies in user namespaces to
isolate runner pods. Without this permission, session creation fails
with:

  networkpolicies.networking.k8s.io is forbidden:
  User "system:serviceaccount:ambient-code:agentic-operator"
  cannot create resource "networkpolicies" in API group
  "networking.k8s.io" in the namespace "mknop-ws"

This adds create/delete/get/list permissions for NetworkPolicies
to the agentic-operator ClusterRole.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Configure oauth-proxy to route /api/* requests to backend-service instead
of the Next.js frontend. Without this routing, all requests including /api/*
go to localhost:3000, causing 503 errors because Next.js doesn't handle
backend API routes.

Changes:
- Add --upstream=http://backend-service:8080/api/ before default upstream
- Requests to /api/* now route to backend-service:8080
- All other requests continue to Next.js frontend at localhost:3000

OAuth2-proxy processes upstreams in order and uses the path portion as a
matching key. The /api/ path in the upstream URL matches any request
starting with /api/, and the full request path is forwarded to the backend.

Request flow example:
  Browser: GET https://ambient.corp.stage.redhat.com/api/projects/foo/sessions/bar
  → OAuth-proxy checks auth via --openshift-delegate-urls
  → Matches --upstream=http://backend-service:8080/api/ (longest match)
  → Forwards to: http://backend-service:8080/api/projects/foo/sessions/bar

Fixes browser console errors:
  GET /api/projects/.../git/status [503 Service Unavailable]
  AG-UI stream error: Connection error
  The connection to .../agui/events was interrupted

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Chris Mitchell <cmitchel@redhat.com>
fix: add backend API routing to oauth-proxy upstream
Remove --openshift-delegate-urls parameter from oauth-proxy that was
blocking /api/* requests with "no resource mapped path" errors.

Issue:
- openshift-delegate-urls={"/api":{"resource":"projects","verb":"list"}}
  only matches /api exactly, not /api/* subpaths
- All /api/* requests were returning 503 even though backend received
  and processed them successfully (200 OK in backend logs)
- oauth-proxy logs showed: "no resource mapped path"

Solution:
OAuth-proxy still provides authentication (OAuth login required for all
requests) and passes the access token to the backend via --pass-access-token.
The backend handles its own fine-grained authorization based on the token,
so the blanket openshift-delegate-urls check is redundant and overly
restrictive.

Authorization flow after this change:
1. User authenticates via OAuth (enforced by oauth-proxy)
2. oauth-proxy passes access token to backend
3. Backend validates token and checks user permissions per endpoint
4. Backend returns appropriate response (200, 403, 404, etc.)

This matches the backend's existing authorization model where different
API endpoints have different permission requirements that can't be
expressed in a single openshift-delegate-urls pattern.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…urls

fix: remove overly restrictive openshift-delegate-urls check
increased initial prompt deploy seconds to 10 seconds
Kubernetes Health Probes:
- Added readiness probe (3s initial delay, 5s period)
- Added liveness probe (20s initial delay, 30s period)
- Prevents Service routing traffic before FastAPI is ready
- Reduces 503 "runner unavailable" errors

Error Logging Improvements:
- Enhanced retry error logging to include exception type
- Previously logged empty strings for exceptions like asyncio.TimeoutError
- Now logs: "error: TimeoutError: <details>" instead of "error: "

Benefits:
- Prevents premature traffic routing to starting pods
- More informative error logs for debugging
- Better system resilience through health probes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented May 8, 2026

Deploy Preview for cheerful-kitten-f556a0 ready!

Name Link
🔨 Latest commit 0a4d259
🔍 Latest deploy log https://app.netlify.com/projects/cheerful-kitten-f556a0/deploys/69fdf695abec380009e6e203
😎 Deploy Preview https://deploy-preview-1534--cheerful-kitten-f556a0.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 68d85295-bf36-490f-870a-fddec3bb3802

📥 Commits

Reviewing files that changed from the base of the PR and between 070520c and 0a4d259.

📒 Files selected for processing (51)
  • .tekton/ambient-code-ambient-api-server-main-pull-request.yaml
  • .tekton/ambient-code-ambient-api-server-main-push.yaml
  • .tekton/ambient-code-ambient-runner-main-pull-request.yaml
  • .tekton/ambient-code-ambient-runner-main-push.yaml
  • .tekton/ambient-code-backend-main-pull-request.yaml
  • .tekton/ambient-code-backend-main-push.yaml
  • .tekton/ambient-code-frontend-main-pull-request.yaml
  • .tekton/ambient-code-frontend-main-push.yaml
  • .tekton/ambient-code-operator-main-pull-request.yaml
  • .tekton/ambient-code-operator-main-push.yaml
  • .tekton/ambient-code-public-api-main-pull-request.yaml
  • .tekton/ambient-code-public-api-main-push.yaml
  • components/ambient-api-server/templates/db-template.yml
  • components/manifests/README.md
  • components/manifests/base/core/ambient-api-server-service.yml
  • components/manifests/base/core/operator-deployment.yaml
  • components/manifests/base/platform/ambient-api-server-db.yml
  • components/manifests/base/platform/ambient-api-server-secrets.yml
  • components/manifests/base/rbac/frontend-rbac.yaml
  • components/manifests/components/ambient-api-server-db/ambient-api-server-db-json-patch.yaml
  • components/manifests/components/ambient-api-server-db/ambient-api-server-init-db-patch.yaml
  • components/manifests/components/ambient-api-server-db/kustomization.yaml
  • components/manifests/components/oauth-proxy/frontend-oauth-deployment-patch.yaml
  • components/manifests/components/oauth-proxy/frontend-oauth-service-patch.yaml
  • components/manifests/overlays/app-interface/ambient-api-server-db-secret-patch.yaml
  • components/manifests/overlays/app-interface/ambient-api-server-env-patch.yaml
  • components/manifests/overlays/app-interface/ambient-api-server-route.yaml
  • components/manifests/overlays/app-interface/ambient-api-server-service-ca-patch.yaml
  • components/manifests/overlays/app-interface/ambient-api-server-ssl-patch.yaml
  • components/manifests/overlays/app-interface/backend-route.yaml
  • components/manifests/overlays/app-interface/kustomization.yaml
  • components/manifests/overlays/app-interface/namespace-patch.yaml
  • components/manifests/overlays/app-interface/namespace.yaml
  • components/manifests/overlays/app-interface/operator-config-openshift.yaml
  • components/manifests/overlays/app-interface/operator-runner-image-patch.yaml
  • components/manifests/overlays/app-interface/public-api-route.yaml
  • components/manifests/overlays/app-interface/route.yaml
  • components/manifests/overlays/kind/api-server-db-security-patch.yaml
  • components/manifests/overlays/kind/api-server-no-jwt-patch.yaml
  • components/manifests/overlays/local-dev/ambient-api-server-db-credentials-patch.yaml
  • components/manifests/overlays/local-dev/ambient-api-server-db-json-patch.yaml
  • components/manifests/overlays/local-dev/ambient-api-server-init-db-patch.yaml
  • components/manifests/overlays/production/ambient-api-server-jwt-args-patch.yaml
  • components/manifests/overlays/production/ambient-api-server-migration-ssl-patch.yaml
  • components/manifests/overlays/production/kustomization.yaml
  • components/manifests/templates/template-operator.yaml
  • components/manifests/templates/template-services.yaml
  • components/manifests/templates/validate.sh
  • components/operator/internal/controller/otel_metrics.go
  • components/operator/internal/handlers/sessions.go
  • components/runners/ambient-runner/ambient_runner/app.py

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting


📝 Walkthrough

Walkthrough

This PR introduces Tekton CI/CD pipelines for building multiple service components, refactors database secret naming from ambient-api-server-db to ambient-code-rds across manifests, configures OpenShift OAuth proxy and RBAC for the frontend, sets up an app-interface staging overlay with routes and TLS, defines operator and service templates with CRDs, and adds runtime health probes and observability improvements.

Changes

Tekton CI/CD Pipeline Definitions

Layer / File(s) Summary
Pipeline Configuration Structure
.tekton/ambient-code-*-{pull-request,push}.yaml (8 files)
Eight new PipelineRun definitions for API server, runner, backend, frontend, operator, and public-api components on PR and push events. Each defines trigger annotations, parameters, task graph orchestration, and conditional security scan gating.
Task Graph & Execution Flow
.tekton/ambient-code-*-{pull-request,push}.yaml
Tasks chain init → clone → prefetch → build → index → (optional source image) → security checks → tag/push/sign, with skip-checks and Coverity availability gating for conditional scans.
Validation
components/manifests/templates/validate.sh
New bash script validates both template manifests using oc process with test image tag.

Database Secret Refactoring (ambient-api-server-db → ambient-code-rds)

Layer / File(s) Summary
Template Parameters
components/ambient-api-server/templates/db-template.yml
DATABASE_SERVICE_NAME default changed from ambient-api-server-db to ambient-code-rds, affecting generated resource names and labels.
Base Manifests
components/manifests/base/core/ambient-api-server-service.yml, components/manifests/base/platform/ambient-api-server-db.yml, components/manifests/base/platform/ambient-api-server-secrets.yml
Deployment/Service secret references and Secret metadata name updated to use ambient-code-rds; PostgreSQL env vars (POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB) rewired to new secret.
Component Patches
components/manifests/components/ambient-api-server-db/*.yaml
JSON/container patches update PostgreSQL image, env vars, probes, and volume mounts to reference ambient-code-rds secret with POSTGRESQL_* naming convention.
Environment Overlays
components/manifests/overlays/{kind,local-dev}/*.yaml
Deployment patches update secret references for database credentials across kind and local-dev environments; init-db and security patches target ambient-code-rds.
Production Overlay
components/manifests/overlays/production/ambient-api-server-jwt-args-patch.yaml
Switches DB SSL mode from disable to require for external RDS; new migration SSL patch applies --db-sslmode=require.
App-Interface Overlay
components/manifests/overlays/app-interface/ambient-api-server-db-secret-patch.yaml
New Vault-injected Secret manifest for ambient-code-rds with RDS connection placeholders.
Documentation
components/manifests/README.md
Updated to reflect ambient-code-rds as the target secret and component scope.

Frontend OAuth & RBAC Updates

Layer / File(s) Summary
RBAC Binding Changes
components/manifests/base/rbac/frontend-rbac.yaml
ServiceAccount annotation added for OpenShift OAuth redirect; ClusterRoleBinding renamed from ambient-frontend-auth to ambient-frontend-oauth-delegator and bound to system:auth-delegator role.
OAuth Proxy Configuration
components/manifests/components/oauth-proxy/frontend-oauth-deployment-patch.yaml
OAuth proxy sidecar image upgraded; startup args rewritten to use OpenShift provider, TLS cert/key paths, service-account-token cookie secret, delegate URLs, and 5m upstream timeout; probes switched to HTTPS with adjusted timing.
Service TLS Certificate
components/manifests/components/oauth-proxy/frontend-oauth-service-patch.yaml
Service annotation key/value updated from service.beta.openshift.io/serving-cert-secret-name: dashboard-proxy-tls to service.alpha.openshift.io/serving-cert-secret-name: frontend-proxy-tls.

App-Interface Staging Overlay

Layer / File(s) Summary
Namespace & Labels
components/manifests/overlays/app-interface/{namespace,namespace-patch}.yaml
New Namespace ambient-code with environment: stage and service: ambient-code-platform labels; includes annotations for app identity.
Route Definitions
components/manifests/overlays/app-interface/{ambient-api-server-route,backend-route,public-api-route,route}.yaml
Four new OpenShift Routes: API server (HTTP + gRPC), backend, public-api, and frontend; all configured with TLS edge termination and redirect behavior.
API Server Configuration
components/manifests/overlays/app-interface/{ambient-api-server-env-patch,ambient-api-server-ssl-patch,ambient-api-server-service-ca-patch}.yaml
Patches set AMBIENT_ENV=stage, add --db-sslmode=require to migration/server containers, and configure TLS certificate auto-provisioning via ambient-api-server-tls secret.
Operator Configuration
components/manifests/overlays/app-interface/{operator-config-openshift,operator-runner-image-patch}.yaml
ConfigMap enables Vertex AI with Google credentials path; Deployment patch sets AMBIENT_CODE_RUNNER_IMAGE to quay.io/redhat-services-prod/.../ambient-code-ambient-runner-main:latest.
Kustomize Composition
components/manifests/overlays/app-interface/kustomization.yaml
Overlay composes base + routes + OAuth component; applies patches for replica scaling (0 for DB/storage), secret injection, TLS, and image overrides to quay.io/redhat-services-prod/.../latest.

Operator & Service Templates

Layer / File(s) Summary
Custom Resource Definitions
components/manifests/templates/template-operator.yaml
Introduces AgenticSession (namespaced, v1alpha1) with workflow, prompt, repo, timeout, and status fields; and ProjectSettings (singleton validated) with group RBAC, inactivity timeout, and runner secrets configuration.
RBAC for Operator
components/manifests/templates/template-operator.yaml
Primary ClusterRole (agentic-operator) grants CRUD on CRDs, pod/service/job/namespace management, RBAC admin, MLflow CRD, and NetworkPolicy permissions; aggregate roles for admin access; separate roles for frontend (tokenreviews), namespace viewing, and backend API.
Service Template
components/manifests/templates/template-services.yaml
OpenShift Template bundles Namespace, Services, PVCs, LimitRange, Deployments (API server with init migration, backend, frontend with OAuth proxy, public-api), PodDisruptionBudget, and Routes with parameterized image tags and upstream timeout.
Operator Deployment
components/manifests/templates/template-operator.yaml
Operator Deployment configures env vars from ConfigMaps/Secrets, probes, and mounts configmaps (ambient-agent-registry, ambient-models) for runner framework and model catalog.
Configuration
components/manifests/templates/template-operator.yaml
Four ConfigMaps: ambient-agent-registry (runner metadata), ambient-api-server-auth (ACL/JWKS), ambient-flags (feature flags), ambient-models (catalog), operator-config (Vertex settings).

Runtime Configuration & Observability

Layer / File(s) Summary
Metrics Initialization
components/operator/internal/controller/otel_metrics.go
When OTEL_EXPORTER_OTLP_ENDPOINT is unset, metrics now initialize a no-op meter and all instruments via initInstruments() instead of short-circuiting; logged as initialized with no-op provider.
Runner Health Probes
components/operator/internal/handlers/sessions.go
Runner container now includes ReadinessProbe and LivenessProbe (both HTTPGet /health on runner port with appropriate delays); adds INITIAL_PROMPT_DELAY_SECONDS=10 env var.
Logging Improvement
components/runners/ambient-runner/ambient_runner/app.py
HTTP retry failure log now includes exception type name (type(e).__name__) alongside the message for better debugging.
Operator OTEL Config
components/manifests/base/core/operator-deployment.yaml
OTEL_EXPORTER_OTLP_ENDPOINT env var removed; commented guidance added to enable when OpenTelemetry collector is available.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch fix/add-health-probes-and-improve-logging
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@maknop
Copy link
Copy Markdown
Contributor Author

maknop commented May 8, 2026

Wrong repository - creating PR on RedHatInsights/ambient-code-platform instead

@maknop maknop closed this May 8, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

CodeRabbit chat interactions are restricted to organization members for this repository. Ask an organization member to interact with CodeRabbit, or set chat.allow_non_org_members: true in your configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants