Recover stuck service operations after transient DB failures by serdarozerr · Pull Request #5011 · cloudfoundry/cloud_controller_ng

serdarozerr · 2026-04-09T15:34:16Z

Problem

When CCDB becomes temporarily unreachable during a service broker polling cycle,
the polling job fails permanently even though the broker is still processing the
operation. This leaves the service resource in an inconsistent state:

last_operation.state = 'in progress' (broker still working or already finished)
pollable_job.state = FAILED (CC gave up)

Solution

Adds a new periodic scheduled job ServiceOperationsCreateInProgressCleanup that
detects stuck create operations whose polling job has permanently failed and
resolves them by marking the operation as failed and triggering orphan mitigation
to deprovision any broker-side resource, giving clients a definitive final state.

Detection chain:
service_instance/binding/key_operations.state = 'in progress'
AND type = 'create'
AND created_at within the max async poll window
→ JOIN service_instances/bindings/keys (via foreign key)
→ JOIN pollable_jobs (via resource_guid) WHERE state IN (POLLING, FAILED)
AND operation = 'service_instance/bindings/keys.create'
→ JOIN delayed_jobs (via delayed_job_guid) WHERE failed_at IS NOT NULL

Recovery:
Marks the stuck operation as failed, sets the pollable job to FAILED, and
calls OrphanMitigator to enqueue a broker-side DELETE for the potentially
orphaned resource. A row-level FOR UPDATE SKIP LOCKED guard prevents double
mitigation when multiple CC instances run concurrently.

Scope:
Covers service_instance.create, service_bindings.create, and
service_keys.create. Delete and update operations are intentionally excluded —
retrying a delete on the same resource GUID can cross-match with the old failed
pollable job, making safe recovery impossible without additional guards.

Related links
New Spec Parameter for Clock job capi-release#653
I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the main branch
I have run all the unit tests using bundle exec rake
I have run CF Acceptance Tests

…e operations Introduces a new periodic recovery job that scans permanently failed delayed_jobs and re-enqueues polling for service operations still in progress at the broker. Recovers cases where a transient DB connection error caused the polling job to fail permanently (max_attempts=1) while the broker operation was still running, leaving the service instance stuck in 'in progress' with no active poller.

The previous implementation queried dead delayed_jobs then performed separate lookups per row to find the pollable job, entity, and last operation state. Replace with a single 4-table join across service_instance_operations, service_instances, jobs, and delayed_jobs, filtering all conditions in one query

…query

philippthun · 2026-05-18T13:50:54Z

When focusing on failed jobs where the pollable job is still POLLING, this PR could be extended to all async operations:

service_instance.create, service_instance.update, service_instance.delete
service_binding.create, service_binding.delete
service_key.create, service_key.delete
service_route_binding.create, service_route_binding.delete

…tion

…failed polling jobs When a CC polling job permanently fails due to a transient error (e.g. a brief DB connection flip), the client is left with no path to a final state: the pollable job shows FAILED while the service instance operation remains stuck 'in progress' indefinitely. Previously this was addressed by reenqueuing the delayed job, but that approach was fragile and incomplete. This cleanup job detects stuck create operations whose polling job has permanently failed (delayed_jobs.failed_at IS NOT NULL) and resolves them by marking the operation and pollable job as failed and triggering OrphanMitigator to deprovision any broker-side resource, giving clients a definitive final state. Extends coverage to service bindings and service keys. Renames the class and file from DelayedJobsRecover to ServiceOperationsCreateInProgressCleanup to reflect the correct scope.

jochenehret

Looks good, just consider a lower job frequency.

Changes in cloud_controller_ng: - Recover stuck service operations after transient DB failures PR: cloudfoundry/cloud_controller_ng#5011 Author: serdar özer <serdar.oezer@sap.com>

serdarozerr added 5 commits May 5, 2026 17:20

fix: comment is fixed

b8da6a7

feat: new sheduling jobs test is added

0ec67e4

fix: warn added to mock logger

f37a14c

serdarozerr force-pushed the fix/recover-failed-delayed-jobs branch from a3a10fe to f37a14c Compare May 5, 2026 15:21

fix: removed the state condition, since it doesn't add any valued to …

655a7db

…query

philippthun reviewed May 18, 2026

View reviewed changes

johha requested changes May 19, 2026

View reviewed changes

serdarozerr added 8 commits May 21, 2026 10:44

fix: instead of reenqueuing the job we started orphan migration opera…

f5a1881

…tion

fix: removed test file in main folder

6a18d44

fix: config param added

d497a38

fix: delayed job recovery remainings are removed

d16821b

fix: func args namings were fixed

304ca98

fix: index check logic simplified

901150f

fix: fix wording

87742e0

serdarozerr marked this pull request as ready for review May 26, 2026 09:08

johha requested changes Jun 1, 2026

View reviewed changes

jochenehret reviewed Jun 1, 2026

View reviewed changes

Comment thread config/cloud_controller.yml Outdated

fix: test logic simplified

b3375c2

serdarozerr mentioned this pull request Jun 2, 2026

New Spec Parameter for Clock job cloudfoundry/capi-release#653

Merged

3 tasks

johha approved these changes Jun 3, 2026

View reviewed changes

jochenehret approved these changes Jun 3, 2026

View reviewed changes

johha merged commit b19c596 into cloudfoundry:main Jun 3, 2026
16 checks passed

johha deleted the fix/recover-failed-delayed-jobs branch June 3, 2026 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover stuck service operations after transient DB failures#5011

Recover stuck service operations after transient DB failures#5011
johha merged 15 commits into
cloudfoundry:mainfrom
sap-contributions:fix/recover-failed-delayed-jobs

serdarozerr commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

philippthun commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jochenehret left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

serdarozerr commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

philippthun commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jochenehret left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

serdarozerr commented Apr 9, 2026 •

edited

Loading