Recover stuck service operations after transient DB failures#5011
Merged
johha merged 15 commits intoJun 3, 2026
Conversation
…e operations Introduces a new periodic recovery job that scans permanently failed delayed_jobs and re-enqueues polling for service operations still in progress at the broker. Recovers cases where a transient DB connection error caused the polling job to fail permanently (max_attempts=1) while the broker operation was still running, leaving the service instance stuck in 'in progress' with no active poller.
The previous implementation queried dead delayed_jobs then performed separate lookups per row to find the pollable job, entity, and last operation state. Replace with a single 4-table join across service_instance_operations, service_instances, jobs, and delayed_jobs, filtering all conditions in one query
a3a10fe to
f37a14c
Compare
philippthun
reviewed
May 18, 2026
Member
|
When focusing on failed jobs where the pollable job is still |
johha
requested changes
May 19, 2026
…failed polling jobs When a CC polling job permanently fails due to a transient error (e.g. a brief DB connection flip), the client is left with no path to a final state: the pollable job shows FAILED while the service instance operation remains stuck 'in progress' indefinitely. Previously this was addressed by reenqueuing the delayed job, but that approach was fragile and incomplete. This cleanup job detects stuck create operations whose polling job has permanently failed (delayed_jobs.failed_at IS NOT NULL) and resolves them by marking the operation and pollable job as failed and triggering OrphanMitigator to deprovision any broker-side resource, giving clients a definitive final state. Extends coverage to service bindings and service keys. Renames the class and file from DelayedJobsRecover to ServiceOperationsCreateInProgressCleanup to reflect the correct scope.
johha
requested changes
Jun 1, 2026
jochenehret
reviewed
Jun 1, 2026
Contributor
jochenehret
left a comment
There was a problem hiding this comment.
Looks good, just consider a lower job frequency.
3 tasks
johha
approved these changes
Jun 3, 2026
jochenehret
approved these changes
Jun 3, 2026
ari-wg-gitbot
added a commit
to cloudfoundry/capi-release
that referenced
this pull request
Jun 3, 2026
Changes in cloud_controller_ng:
- Recover stuck service operations after transient DB failures
PR: cloudfoundry/cloud_controller_ng#5011
Author: serdar özer <serdar.oezer@sap.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When CCDB becomes temporarily unreachable during a service broker polling cycle,
the polling job fails permanently even though the broker is still processing the
operation. This leaves the service resource in an inconsistent state:
last_operation.state = 'in progress'(broker still working or already finished)pollable_job.state = FAILED(CC gave up)Solution
Adds a new periodic scheduled job
ServiceOperationsCreateInProgressCleanupthatdetects stuck
createoperations whose polling job has permanently failed andresolves them by marking the operation as failed and triggering orphan mitigation
to deprovision any broker-side resource, giving clients a definitive final state.
Detection chain:
service_instance/binding/key_operations.state = 'in progress'AND
type = 'create'AND
created_atwithin the max async poll window→ JOIN
service_instances/bindings/keys(via foreign key)→ JOIN
pollable_jobs(viaresource_guid) WHEREstate IN (POLLING, FAILED)AND
operation = 'service_instance/bindings/keys.create'→ JOIN
delayed_jobs(viadelayed_job_guid) WHEREfailed_at IS NOT NULLRecovery:
Marks the stuck operation as
failed, sets the pollable job toFAILED, andcalls
OrphanMitigatorto enqueue a broker-side DELETE for the potentiallyorphaned resource. A row-level
FOR UPDATE SKIP LOCKEDguard prevents doublemitigation when multiple CC instances run concurrently.
Scope:
Covers
service_instance.create,service_bindings.create, andservice_keys.create. Delete and update operations are intentionally excluded —retrying a delete on the same resource GUID can cross-match with the old failed
pollable job, making safe recovery impossible without additional guards.
Related links
New Spec Parameter for Clock job capi-release#653
I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the
mainbranchI have run all the unit tests using
bundle exec rakeI have run CF Acceptance Tests