Skip to content

Recover stuck service operations after transient DB failures#5011

Merged
johha merged 15 commits into
cloudfoundry:mainfrom
sap-contributions:fix/recover-failed-delayed-jobs
Jun 3, 2026
Merged

Recover stuck service operations after transient DB failures#5011
johha merged 15 commits into
cloudfoundry:mainfrom
sap-contributions:fix/recover-failed-delayed-jobs

Conversation

@serdarozerr
Copy link
Copy Markdown
Contributor

@serdarozerr serdarozerr commented Apr 9, 2026

Problem

When CCDB becomes temporarily unreachable during a service broker polling cycle,
the polling job fails permanently even though the broker is still processing the
operation. This leaves the service resource in an inconsistent state:

  • last_operation.state = 'in progress' (broker still working or already finished)
  • pollable_job.state = FAILED (CC gave up)

Solution

Adds a new periodic scheduled job ServiceOperationsCreateInProgressCleanup that
detects stuck create operations whose polling job has permanently failed and
resolves them by marking the operation as failed and triggering orphan mitigation
to deprovision any broker-side resource, giving clients a definitive final state.

Detection chain:
service_instance/binding/key_operations.state = 'in progress'
AND type = 'create'
AND created_at within the max async poll window
→ JOIN service_instances/bindings/keys (via foreign key)
→ JOIN pollable_jobs (via resource_guid) WHERE state IN (POLLING, FAILED)
AND operation = 'service_instance/bindings/keys.create'
→ JOIN delayed_jobs (via delayed_job_guid) WHERE failed_at IS NOT NULL

Recovery:
Marks the stuck operation as failed, sets the pollable job to FAILED, and
calls OrphanMitigator to enqueue a broker-side DELETE for the potentially
orphaned resource. A row-level FOR UPDATE SKIP LOCKED guard prevents double
mitigation when multiple CC instances run concurrently.

Scope:
Covers service_instance.create, service_bindings.create, and
service_keys.create. Delete and update operations are intentionally excluded —
retrying a delete on the same resource GUID can cross-match with the old failed
pollable job, making safe recovery impossible without additional guards.

…e operations

Introduces a new periodic recovery job that scans permanently failed delayed_jobs
and re-enqueues polling for service operations still in progress at the broker.

Recovers cases where a transient DB connection error caused the polling job to
fail permanently (max_attempts=1) while the broker operation was still running,
leaving the service instance stuck in 'in progress' with no active poller.
The previous implementation queried dead delayed_jobs then performed
separate lookups per row to find the pollable job, entity, and last
operation state. Replace with a single 4-table join across
service_instance_operations, service_instances, jobs, and delayed_jobs,
filtering all conditions in one query
@serdarozerr serdarozerr force-pushed the fix/recover-failed-delayed-jobs branch from a3a10fe to f37a14c Compare May 5, 2026 15:21
Comment thread app/jobs/runtime/delayed_jobs_recover.rb Outdated
Comment thread app/jobs/runtime/delayed_jobs_recover.rb Outdated
Comment thread app/jobs/runtime/delayed_jobs_recover.rb Outdated
Comment thread app/jobs/runtime/delayed_jobs_recover.rb Outdated
Comment thread app/jobs/runtime/delayed_jobs_recover.rb Outdated
Comment thread app/jobs/runtime/delayed_jobs_recover.rb Outdated
Comment thread app/jobs/runtime/delayed_jobs_recover.rb Outdated
Comment thread app/jobs/runtime/delayed_jobs_recover.rb Outdated
@philippthun
Copy link
Copy Markdown
Member

When focusing on failed jobs where the pollable job is still POLLING, this PR could be extended to all async operations:

service_instance.create, service_instance.update, service_instance.delete
service_binding.create, service_binding.delete
service_key.create, service_key.delete
service_route_binding.create, service_route_binding.delete

Comment thread spec/migrations/20260505071445_add_jobs_operation_state_index_spec.rb Outdated
Comment thread service_operations_create_in_progress_cleanup Outdated
Comment thread app/jobs/runtime/delayed_jobs_recover.rb Outdated
Comment thread service_operations_create_in_progress_cleanup Outdated
…failed polling jobs

When a CC polling job permanently fails due to a transient error (e.g. a brief
DB connection flip), the client is left with no path to a final state: the
pollable job shows FAILED while the service instance operation remains stuck
'in progress' indefinitely. Previously this was addressed by reenqueuing the
delayed job, but that approach was fragile and incomplete.

This cleanup job detects stuck create operations whose polling job has
permanently failed (delayed_jobs.failed_at IS NOT NULL) and resolves them by
marking the operation and pollable job as failed and triggering OrphanMitigator
to deprovision any broker-side resource, giving clients a definitive final state.

Extends coverage to service bindings and service keys. Renames the class and
file from DelayedJobsRecover to ServiceOperationsCreateInProgressCleanup to
reflect the correct scope.
@serdarozerr serdarozerr marked this pull request as ready for review May 26, 2026 09:08
Comment thread app/jobs/runtime/service_operations_create_in_progress_cleanup.rb
Comment thread app/jobs/runtime/service_operations_create_in_progress_cleanup.rb
Comment thread app/jobs/runtime/service_operations_create_in_progress_cleanup.rb
Comment thread spec/unit/jobs/runtime/service_operations_create_in_progress_cleanup.rb Outdated
Copy link
Copy Markdown
Contributor

@jochenehret jochenehret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just consider a lower job frequency.

Comment thread config/cloud_controller.yml Outdated
@johha johha merged commit b19c596 into cloudfoundry:main Jun 3, 2026
16 checks passed
@johha johha deleted the fix/recover-failed-delayed-jobs branch June 3, 2026 08:45
ari-wg-gitbot added a commit to cloudfoundry/capi-release that referenced this pull request Jun 3, 2026
Changes in cloud_controller_ng:

- Recover stuck service operations after transient DB failures
    PR: cloudfoundry/cloud_controller_ng#5011
    Author: serdar özer <serdar.oezer@sap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants