fix: periodic reaper revives orphan jobs left stuck by lost events by ancongui · Pull Request #24 · firefly-operationOS/flydocs

ancongui · 2026-05-19T09:51:14Z

Summary

A second-pass audit on top of #23 found that the atomic-claim fix had a regression: when a worker crashes mid-RUN, the EDA redelivery's mark_running is rejected by the fresh lease, the bus cursor advances past the event, and the job is stuck in RUNNING forever. Four other orphan classes shared the same shape.

This adds a periodic reaper (JobReaper + BboxReaper) that closes the gap.

The five orphan classes (all confirmed against real Postgres before this change)

#	Status	Cause	Reaper
1	`QUEUED`	Submit handler crashed between row INSERT and outbox PUBLISH (the two are not co-transactional)	`JobReaper`
2	`RUNNING`	Worker crashed mid-extraction; fresh-lease redelivery is rejected, cursor advances	`JobReaper`
3	`QUEUED`	Worker's retry-path `_delayed_publish` task was killed before its `asyncio.sleep` completed	`JobReaper`
4	`PARTIAL_SUCCEEDED`	Main worker crashed between `mark_partial_succeeded` and the bbox-refine publish	`BboxReaper`
5	`REFINING_BBOXES`	Bbox worker crashed mid-grounding past its lease	`BboxReaper`

How the reaper works

Each worker process now runs a reaper as a sidecar task alongside its main consume loop. Every reaper_sweep_interval_s (default 60s) the reaper queries for rows in stuck states past their threshold and republishes a fresh IDPJobSubmitted / IDPBboxRefineRequested event. Duplicate publishes from multiple replicas are deduped at claim time by the atomic mark_* transitions from #23 — running the reaper in every worker replica is safe.

Lease tweaks

Tightens defaults from 2 * timeout to timeout + 60s:

Setting	Before	After
`job_run_lease_s`	2400	1260
`bbox_refine_lease_s`	1200	660

Crash-recovery time is now bounded by reaper_sweep_interval_s + job_run_lease_s ≈ ~22 min default, and falls proportionally for tighter async_timeout_s overrides.

New settings

reaper_sweep_interval_s: int = 60
queued_orphan_threshold_s: int = 600           # 2x retry_max_delay_s
partial_succeeded_orphan_threshold_s: int = 1320  # async_timeout_s + 120

Test plan

tests/unit/test_reapers.py — 10 unit tests covering each new repository finder + reaper sweep behaviour + clean-shutdown + publisher-error resilience.
tests/integration/test_reaper_postgres.py — 3 real-Postgres tests proving end-to-end revival of all five orphan classes plus a full crash-recovery cycle (claim → backdate started_at → reaper → fresh worker claims with attempts=2).
Full unit + integration + pyfly EDA: 353 passed, 2 skipped (was 340; +13 net).
CLI smoke import for both cmd_worker and cmd_bbox_worker after the new asyncio.wait(FIRST_COMPLETED) wiring.

The atomic-claim fix in #23 prevented duplicate processing but introduced a regression: when a worker crashed mid-RUN, the EDA redelivery's `mark_running` was rejected by the fresh lease, the bus cursor advanced past the event, and the job was left stuck in RUNNING forever. Same shape orphan exists for: 1. QUEUED rows whose submit handler crashed between row INSERT and outbox PUBLISH (the two are not co-transactional); 2. QUEUED rows whose retry-path `_delayed_publish` task was killed before its `asyncio.sleep` completed; 3. RUNNING rows whose claimant crashed past `job_run_lease_s`; 4. PARTIAL_SUCCEEDED rows whose bbox event was never published (main worker crashed between `mark_partial_succeeded` and `publisher.publish` for the refine topic); 5. REFINING_BBOXES rows whose bbox claimant crashed past `bbox_refine_lease_s`. All five were demonstrated against real Postgres before this change. Fix: a `JobReaper` (RUNNING + QUEUED) and `BboxReaper` (REFINING_BBOXES + PARTIAL_SUCCEEDED-pending) periodically query for the stuck rows and republish a fresh EDA event for each. Duplicate publishes across replicas are deduped at claim time by the existing atomic `mark_*` transitions, so running a reaper in every worker container is safe. Recovery time is bounded by `reaper_sweep_interval_s + job_run_lease_s` (default ~22 min for the shipped `async_timeout_s=1200`; falls proportionally for tighter timeouts). Tightens lease defaults from `2 * timeout` to `timeout + 60s` so crash recovery is faster while still leaving headroom for commit latency on the legitimate finalisation transitions. Wires both reapers as a sidecar task to their respective worker processes via `asyncio.wait(FIRST_COMPLETED)` -- worker failure takes the reaper down with it and vice versa, so a container restart fully resets both. Adds 4 new `find_stale_*` repository helpers + 10 unit tests covering each finder + reaper sweep behaviour + 3 real-Postgres integration tests proving end-to-end orphan revival. 353 passing (was 340).

* chore: clean up ruff findings introduced by #23 / #24 / #25 The Lint check failed on all three merged PRs because the new files tripped ``F401`` (unused ``typing.Any``), ``SIM105`` (replace ``try``/``except``/``pass`` with ``contextlib.suppress``), ``UP041`` (replace ``asyncio.TimeoutError`` with builtin ``TimeoutError``), ``I001`` (import ordering), and ``F841`` (unused local). The other CI jobs (Unit tests, SDK Python, SDK Java, Typecheck, Docling) were all green on each PR; the merges weren't gated on Lint. This is the follow-up sweep so ``ruff check`` is clean on ``main``. 11 errors fixed (8 auto-fixed by ``ruff --fix``, 3 manual). * chore: apply ``ruff format`` to the same files The Lint job runs both ``ruff check`` and ``ruff format --check``. The previous commit cleared the ``check`` half; this one runs ``ruff format`` over the 8 files in the same change set so the formatter half passes too. No behaviour change. --------- Co-authored-by: ancongui <andres.contreras@soon.es>

The atomic-claim fix in #23 prevented duplicate processing but introduced a regression: when a worker crashed mid-RUN, the EDA redelivery's `mark_running` was rejected by the fresh lease, the bus cursor advanced past the event, and the job was left stuck in RUNNING forever. Same shape orphan exists for: 1. QUEUED rows whose submit handler crashed between row INSERT and outbox PUBLISH (the two are not co-transactional); 2. QUEUED rows whose retry-path `_delayed_publish` task was killed before its `asyncio.sleep` completed; 3. RUNNING rows whose claimant crashed past `job_run_lease_s`; 4. PARTIAL_SUCCEEDED rows whose bbox event was never published (main worker crashed between `mark_partial_succeeded` and `publisher.publish` for the refine topic); 5. REFINING_BBOXES rows whose bbox claimant crashed past `bbox_refine_lease_s`. All five were demonstrated against real Postgres before this change. Fix: a `JobReaper` (RUNNING + QUEUED) and `BboxReaper` (REFINING_BBOXES + PARTIAL_SUCCEEDED-pending) periodically query for the stuck rows and republish a fresh EDA event for each. Duplicate publishes across replicas are deduped at claim time by the existing atomic `mark_*` transitions, so running a reaper in every worker container is safe. Recovery time is bounded by `reaper_sweep_interval_s + job_run_lease_s` (default ~22 min for the shipped `async_timeout_s=1200`; falls proportionally for tighter timeouts). Tightens lease defaults from `2 * timeout` to `timeout + 60s` so crash recovery is faster while still leaving headroom for commit latency on the legitimate finalisation transitions. Wires both reapers as a sidecar task to their respective worker processes via `asyncio.wait(FIRST_COMPLETED)` -- worker failure takes the reaper down with it and vice versa, so a container restart fully resets both. Adds 4 new `find_stale_*` repository helpers + 10 unit tests covering each finder + reaper sweep behaviour + 3 real-Postgres integration tests proving end-to-end orphan revival. 353 passing (was 340). Co-authored-by: ancongui <andres.contreras@soon.es>

* chore: clean up ruff findings introduced by #23 / #24 / #25 The Lint check failed on all three merged PRs because the new files tripped ``F401`` (unused ``typing.Any``), ``SIM105`` (replace ``try``/``except``/``pass`` with ``contextlib.suppress``), ``UP041`` (replace ``asyncio.TimeoutError`` with builtin ``TimeoutError``), ``I001`` (import ordering), and ``F841`` (unused local). The other CI jobs (Unit tests, SDK Python, SDK Java, Typecheck, Docling) were all green on each PR; the merges weren't gated on Lint. This is the follow-up sweep so ``ruff check`` is clean on ``main``. 11 errors fixed (8 auto-fixed by ``ruff --fix``, 3 manual). * chore: apply ``ruff format`` to the same files The Lint job runs both ``ruff check`` and ``ruff format --check``. The previous commit cleared the ``check`` half; this one runs ``ruff format`` over the 8 files in the same change set so the formatter half passes too. No behaviour change. --------- Co-authored-by: ancongui <andres.contreras@soon.es>

ancongui merged commit ffa7316 into main May 19, 2026
5 of 6 checks passed

ancongui deleted the fix/orphan-revival-reapers branch May 19, 2026 09:51

ancongui mentioned this pull request May 19, 2026

chore: clean up ruff findings from #23 / #24 / #25 #26

Merged

ancongui mentioned this pull request May 19, 2026

docs: concurrency model + SDK 26.05.02 + new env settings #27

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: periodic reaper revives orphan jobs left stuck by lost events#24

fix: periodic reaper revives orphan jobs left stuck by lost events#24
ancongui merged 1 commit into
mainfrom
fix/orphan-revival-reapers

ancongui commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ancongui commented May 19, 2026

Summary

The five orphan classes (all confirmed against real Postgres before this change)

How the reaper works

Lease tweaks

New settings

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant