Skip to content

fix: drop updated_at guard from reconcile work-item ack#1149

Merged
adityachoudhari26 merged 3 commits into
mainfrom
fix-reconcile-ack-stale-leak
May 26, 2026
Merged

fix: drop updated_at guard from reconcile work-item ack#1149
adityachoudhari26 merged 3 commits into
mainfrom
fix-reconcile-ack-stale-leak

Conversation

@adityachoudhari26
Copy link
Copy Markdown
Member

@adityachoudhari26 adityachoudhari26 commented May 26, 2026

DeleteClaimedReconcileWorkItem required s.updated_at <= the
row's updated_at captured at claim time. That guard was meant
to detect concurrent modifications, but in practice the worker's
own lease heartbeat (every 5s) advanced updated_at, invalidating
the worker's own delete whenever Process() ran longer than one
heartbeat tick.

Effect: AckSuccess returned Owned=true, Deleted=false silently;
the row persisted and was re-claimed on every subsequent worker
poll, with attempt_count never incrementing. Items in prod
cycled this way for 5+ days.

Fix: drop the staleness guard. claimed_by = me is the actual
ownership invariant, and no other writer touches a claimed row
in this codebase. Mirror the change in the in-memory queue.

Also surface Owned=true/Deleted=false as an explicit error log
plus OnDropped hook in worker.processClaimedItem, so any future
regression of this class is visible immediately rather than
silently leaking items.

Summary by CodeRabbit

  • Bug Fixes

    • Acknowledging completed work now deletes claimed items based on current ownership only, so ack succeeds despite lease/heartbeat-updated timestamps, reducing spurious “dropped” cases.
  • Refactor

    • Simplified the ack payload and deletion validation to rely solely on ownership identity.
  • Tests

    • Added regression and safety tests verifying ack deletes after heartbeats and workers handle non-deletion (invoke drop handling).

Review Change Stack

Copilot AI review requested due to automatic review settings May 26, 2026 19:54
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

📝 Walkthrough

Walkthrough

Removed the optimistic-concurrency ClaimedUpdatedAt check: ack requests now send only ItemID and WorkerID; SQL/generated bindings and memory/postgres implementations delete by (id, claimed_by) only; worker and tests validate deletion via ackResult.Deleted.

Changes

Work Item Deletion Simplification

Layer / File(s) Summary
AckSuccessParams contract
apps/workspace-engine/pkg/reconcile/workqueue.go
AckSuccessParams struct is reduced to ItemID and WorkerID, removing ClaimedUpdatedAt.
SQL query and generated bindings
apps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sql, apps/workspace-engine/pkg/reconcile/postgres/db/reconcile_work_item.sql.go
DeleteClaimedReconcileWorkItem SQL no longer uses the updated_at argument; generated DeleteClaimedReconcileWorkItemParams drops UpdatedAt and callers stop passing it.
Memory queue implementation
apps/workspace-engine/pkg/reconcile/memory/memory.go, apps/workspace-engine/pkg/reconcile/memory/memory_test.go
In-memory AckSuccess removes the s.UpdatedAt.After(params.ClaimedUpdatedAt) check and tests now call AckSuccess with only ItemID and WorkerID.
Postgres implementation and integration tests
apps/workspace-engine/pkg/reconcile/postgres/pg.go, apps/workspace-engine/pkg/reconcile/postgres/pg_integration_test.go
Postgres AckSuccess omits UpdatedAt when calling the delete; integration tests updated to omit ClaimedUpdatedAt; added TestQueue_AckSucceedsAfterHeartbeat.
Worker deletion validation
apps/workspace-engine/pkg/reconcile/worker.go
processClaimedItem captures ackResult/ackErr; on success, if ackResult.Deleted == false it logs, calls Hooks.OnDropped with an error, and returns without calling OnProcessed.
Test harness & worker tests
apps/workspace-engine/test/controllers/harness/helpers.go, apps/workspace-engine/pkg/reconcile/worker_test.go
DrainQueue uses the simplified AckSuccessParams; new unit test asserts worker behavior when AckSuccess returns Deleted=false.

Sequence Diagram (high-level ack/delete flow):

sequenceDiagram
  participant Worker
  participant Queue
  participant PostgresDB
  Worker->>Queue: AckSuccess(ItemID, WorkerID)
  Queue->>PostgresDB: DELETE ... WHERE id = ItemID AND claimed_by = WorkerID
  PostgresDB-->>Queue: ackResult{Deleted: bool}
  Queue-->>Worker: ackResult
  alt Deleted == false
    Worker->>Worker: log error
    Worker->>Hooks: OnDropped(error)
  else Deleted == true
    Worker->>Worker: continue / call OnProcessed
  end
Loading

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers:

  • jsbroks

"🐰 I hopped through timestamps, trimmed their thread,
Kept only who claimed it, and who said 'it's dead',
The queue now trusts ownership, not clock,
The worker checks the delete — no more tick-tock,
A simpler hop forward, carrot in my head."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: removing the updated_at staleness guard from the AckSuccess operation for reconcile work items.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-reconcile-ack-stale-leak

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/workspace-engine/test/controllers/harness/helpers.go`:
- Around line 75-78: In DrainQueue, don't ignore the result of queue.AckSuccess:
capture the returned error/result from queue.AckSuccess(ctx,
reconcile.AckSuccessParams{ItemID: item.ID, WorkerID: testWorkerID}) and handle
failures explicitly (e.g., return the error or record/log and stop processing
the item so it can be retried) instead of discarding it; update the DrainQueue
logic to treat non-deleting/failed acks as unsuccessful by checking the error
and acting accordingly so failing acks won't be treated as processed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ebe4684e-6477-4968-91bf-b4a5f8a40402

📥 Commits

Reviewing files that changed from the base of the PR and between 4c39882 and 54fa9d9.

📒 Files selected for processing (9)
  • apps/workspace-engine/pkg/reconcile/memory/memory.go
  • apps/workspace-engine/pkg/reconcile/memory/memory_test.go
  • apps/workspace-engine/pkg/reconcile/postgres/db/reconcile_work_item.sql.go
  • apps/workspace-engine/pkg/reconcile/postgres/pg.go
  • apps/workspace-engine/pkg/reconcile/postgres/pg_integration_test.go
  • apps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sql
  • apps/workspace-engine/pkg/reconcile/worker.go
  • apps/workspace-engine/pkg/reconcile/workqueue.go
  • apps/workspace-engine/test/controllers/harness/helpers.go
💤 Files with no reviewable changes (1)
  • apps/workspace-engine/pkg/reconcile/postgres/pg.go

Comment thread apps/workspace-engine/test/controllers/harness/helpers.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes reconcile work-item acknowledgements failing after lease heartbeats by removing the updated_at-snapshot guard from AckSuccess across the reconcile queue API and its implementations (Postgres + in-memory). It also adds a defensive runtime check in the worker to surface unexpected “owned-but-not-deleted” ack results.

Changes:

  • Remove ClaimedUpdatedAt from reconcile.AckSuccessParams and update all call sites.
  • Update Postgres DeleteClaimedReconcileWorkItem to drop the updated_at predicate so lease heartbeats no longer block deletion.
  • Add worker-side handling/logging when AckSuccess returns Deleted=false without error.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
apps/workspace-engine/test/controllers/harness/helpers.go Updates test harness to call AckSuccess without the removed ClaimedUpdatedAt.
apps/workspace-engine/pkg/reconcile/workqueue.go Removes ClaimedUpdatedAt from the public reconcile queue ack params type.
apps/workspace-engine/pkg/reconcile/worker.go Uses ack result and adds a branch for Deleted=false to drop/log rather than marking processed.
apps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sql Removes the updated_at guard from the delete-claimed query and updates its documentation.
apps/workspace-engine/pkg/reconcile/postgres/pg.go Stops passing UpdatedAt into the sqlc-generated delete-claimed query.
apps/workspace-engine/pkg/reconcile/postgres/pg_integration_test.go Updates ack calls for the new AckSuccessParams shape.
apps/workspace-engine/pkg/reconcile/postgres/db/reconcile_work_item.sql.go Regenerates sqlc output for the updated delete-claimed query signature and SQL.
apps/workspace-engine/pkg/reconcile/memory/memory.go Removes the in-memory updated-at snapshot check from AckSuccess.
apps/workspace-engine/pkg/reconcile/memory/memory_test.go Updates ack calls for the new AckSuccessParams shape.
Files not reviewed (1)
  • apps/workspace-engine/pkg/reconcile/postgres/db/reconcile_work_item.sql.go: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -137,7 +140,6 @@ WITH target_scope AS (
DELETE FROM reconcile_work_scope AS s
USING target_scope AS t
WHERE s.id = t.id
Comment on lines 354 to 357
ack, err := queue.AckSuccess(ctx, reconcile.AckSuccessParams{
ItemID: item.ID,
WorkerID: "worker-a",
ClaimedUpdatedAt: item.UpdatedAt,
ItemID: item.ID,
WorkerID: "worker-a",
})
Comment on lines +299 to +303
if !ackResult.Deleted {
slog.ErrorContext(ctx,
"ack reported ownership but did not delete row — row will be re-claimed",
"item", item.ID,
"kind", item.Kind,
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/workspace-engine/pkg/reconcile/worker_test.go`:
- Around line 522-524: The mock ackFn in the subtest returns
AckSuccessResult{Deleted: false} but leaves Owned as the zero value; update the
mock return to AckSuccessResult{Owned: true, Deleted: false} in the ackFn used
by this subtest so it matches the intended regression shape (Owned=true,
Deleted=false) when invoking the function under test (ackFn with
AckSuccessParams).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7e269e65-0b34-46c3-8820-1c0028c312a7

📥 Commits

Reviewing files that changed from the base of the PR and between 54fa9d9 and 750c6c2.

📒 Files selected for processing (2)
  • apps/workspace-engine/pkg/reconcile/postgres/pg_integration_test.go
  • apps/workspace-engine/pkg/reconcile/worker_test.go

Comment on lines +522 to +524
ackFn: func(context.Context, AckSuccessParams) (AckSuccessResult, error) {
return AckSuccessResult{Deleted: false}, nil
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Match the mocked ack result to the intended regression shape (Owned=true, Deleted=false).

This subtest targets the Owned=true/Deleted=false case, but the mock currently leaves Owned at its zero value (false), which weakens the regression coverage.

Suggested fix
 		q := &fakeQueue{
 			ackFn: func(context.Context, AckSuccessParams) (AckSuccessResult, error) {
-				return AckSuccessResult{Deleted: false}, nil
+				return AckSuccessResult{Owned: true, Deleted: false}, nil
 			},
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ackFn: func(context.Context, AckSuccessParams) (AckSuccessResult, error) {
return AckSuccessResult{Deleted: false}, nil
},
ackFn: func(context.Context, AckSuccessParams) (AckSuccessResult, error) {
return AckSuccessResult{Owned: true, Deleted: false}, nil
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/workspace-engine/pkg/reconcile/worker_test.go` around lines 522 - 524,
The mock ackFn in the subtest returns AckSuccessResult{Deleted: false} but
leaves Owned as the zero value; update the mock return to
AckSuccessResult{Owned: true, Deleted: false} in the ackFn used by this subtest
so it matches the intended regression shape (Owned=true, Deleted=false) when
invoking the function under test (ackFn with AckSuccessParams).

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sql (1)

128-149: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Mirror this row-lock ownership check into the other claim-mutating queries.

This fixes the snapshot-to-row-lock race for DeleteClaimedReconcileWorkItem, but AckPermanentlyFailedReconcileWorkItem and RetryReconcileWorkItem below still resolve ownership in target_scope and then mutate on s.id = t.id only. After lease expiry, cleanup, and a re-claim by another worker, those statements can still delete or release a row the caller no longer owns.

Proposed follow-up
 -- name: AckPermanentlyFailedReconcileWorkItem :one
 WITH target_scope AS (
   SELECT s.id
   FROM reconcile_work_scope AS s
   WHERE s.id = sqlc.arg(id)
     AND s.claimed_by = sqlc.arg(claimed_by)
 ), deleted_scope AS (
   DELETE FROM reconcile_work_scope AS s
   USING target_scope AS t
   WHERE s.id = t.id
+    AND s.claimed_by = sqlc.arg(claimed_by)
   RETURNING s.id
 )
 SELECT
   EXISTS (SELECT 1 FROM target_scope) AS owned,
   EXISTS (SELECT 1 FROM deleted_scope) AS deleted;
@@
 -- name: RetryReconcileWorkItem :one
 WITH target_scope AS (
   SELECT s.id
   FROM reconcile_work_scope AS s
   WHERE s.id = sqlc.arg(id)
     AND s.claimed_by = sqlc.arg(claimed_by)
 ), released_scope AS (
   UPDATE reconcile_work_scope AS s
   SET
     attempt_count = s.attempt_count + 1,
     last_error = sqlc.arg(last_error),
     not_before = now() + make_interval(secs => sqlc.arg(retry_backoff_seconds)::int),
     claimed_by = NULL,
     claimed_until = NULL,
     updated_at = now()
   FROM target_scope AS t
-  WHERE s.id = t.id
+  WHERE s.id = t.id
+    AND s.claimed_by = sqlc.arg(claimed_by)
   RETURNING s.id
 )
 SELECT
   EXISTS (SELECT 1 FROM target_scope) AS owned,
   EXISTS (SELECT 1 FROM released_scope) AS released;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sql`
around lines 128 - 149, The other claim-mutating SQL statements
(AckPermanentlyFailedReconcileWorkItem and RetryReconcileWorkItem) currently
resolve ownership only in the target_scope CTE and then mutate rows using s.id =
t.id, leaving a snapshot-to-row-lock race; update those queries to mirror the
pattern used in DeleteClaimedReconcileWorkItem: add/keep a WITH target_scope AS
(...) selecting id WHERE id = sqlc.arg(id) AND claimed_by =
sqlc.arg(claimed_by), then perform DELETE/UPDATE using target_scope AS t and
include AND s.claimed_by = sqlc.arg(claimed_by) in the WHERE so the claimed_by
is re-checked while the row is locked (apply the same deleted_scope/RETURNING
pattern for consistency).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@apps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sql`:
- Around line 128-149: The other claim-mutating SQL statements
(AckPermanentlyFailedReconcileWorkItem and RetryReconcileWorkItem) currently
resolve ownership only in the target_scope CTE and then mutate rows using s.id =
t.id, leaving a snapshot-to-row-lock race; update those queries to mirror the
pattern used in DeleteClaimedReconcileWorkItem: add/keep a WITH target_scope AS
(...) selecting id WHERE id = sqlc.arg(id) AND claimed_by =
sqlc.arg(claimed_by), then perform DELETE/UPDATE using target_scope AS t and
include AND s.claimed_by = sqlc.arg(claimed_by) in the WHERE so the claimed_by
is re-checked while the row is locked (apply the same deleted_scope/RETURNING
pattern for consistency).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7a3b0eda-b791-46c2-b022-9ca8a90c6e9f

📥 Commits

Reviewing files that changed from the base of the PR and between 750c6c2 and ece485e.

📒 Files selected for processing (3)
  • apps/workspace-engine/pkg/reconcile/postgres/db/reconcile_work_item.sql.go
  • apps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sql
  • apps/workspace-engine/test/controllers/harness/helpers.go

@adityachoudhari26 adityachoudhari26 merged commit 79cd36c into main May 26, 2026
11 checks passed
@adityachoudhari26 adityachoudhari26 deleted the fix-reconcile-ack-stale-leak branch May 26, 2026 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants