Reliability: Fix job completion idempotency and stale check grace period (#455, #456) by filthyrake · Pull Request #490 · filthyrake/vlog

filthyrake · 2026-01-03T15:54:24Z

Summary

This PR addresses two reliability issues identified during code review:

Issue #455: Job completion retry can overwrite metadata

Problem: Job completion is retried up to 3 times on network errors, but metadata fields (duration, source_width, source_height, streaming_format, primary_codec) were unconditionally overwritten on retry
Solution:
- Added completion_token field to CompleteJobRequest for idempotency
- Worker generates a unique token (job_id-UUID) on each completion attempt
- Server checks Redis for duplicate tokens and returns early if found
- Made all metadata updates idempotent (only set if not already set)
- If job already marked completed in DB, returns success without re-processing

Issue #456: Stale job checker grace period resets on API restart

Problem: The 2-minute grace period was based on API process uptime, not wall clock time. Frequent restarts prevented stale job cleanup from ever running
Solution:
- Store last stale check time in Redis (stale_job_checker:last_run)
- On startup, check Redis for recent checks before applying grace period
- If another API instance recently checked, skip local grace period
- After each successful check, record timestamp with 5-minute TTL
- Applied same fix to orphan quality directory cleanup

Test plan

Run linter (ruff check) - passes
Run job completion tests - passes
Run stale job detection tests - passes
Run HTTP client tests - passes

Files changed

api/worker_api.py - Add Redis-based idempotency and grace period tracking
api/worker_schemas.py - Add completion_token field
worker/http_client.py - Generate completion tokens

Closes #455
Closes #456

🤖 Generated with Claude Code

…iod (#455, #456) Issue #455: Job completion retry can overwrite metadata - Add completion_token field to CompleteJobRequest for idempotency - Worker generates unique token (job_id + UUID) on completion - Server checks Redis for duplicate tokens, returns early if found - Make metadata updates idempotent (only set if not already set) - If job already completed, return success without re-processing Issue #456: Stale job checker grace period resets on API restart - Store last stale check time in Redis (key: stale_job_checker:last_run) - On startup, check Redis for recent check before applying grace period - If another API instance recently checked, skip local grace period - After each successful check, record timestamp in Redis (5-min TTL) - Apply same fix to orphan quality directory cleanup 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fixes identified by specialist reviewers: 1. Fix .decode() bug - Redis client uses decode_responses=True, so redis.get() returns strings, not bytes. Removed .decode() call. 2. Add input validation to completion_token: - max_length=100 to prevent memory abuse - Regex pattern validation for format: {job_id}-{uuid4} - Normalize to lowercase 3. Scope token Redis key to job_id: - Key format: vlog:completion_token:{job_id}:{token} - Prevents cross-job token collisions 4. Use atomic SETNX for token storage: - SET with NX flag (set-if-not-exists) before transaction - Prevents race condition where two requests pass check before either stores - Token initially set to "processing", updated to "completed" after success 5. Increase token TTL from 5 to 15 minutes: - Covers extended retry scenarios with exponential backoff - New constant: COMPLETION_TOKEN_TTL = 900 6. Add Redis key namespace prefix: - All keys now prefixed with "vlog:" to prevent collisions - New constant: REDIS_KEY_PREFIX = "vlog:" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

filthyrake · 2026-01-03T16:17:08Z

Code Review Fixes Applied

Based on the comprehensive specialist reviews, the following issues have been addressed in commit 60ea529:

🔴 Critical Fixes

Fixed .decode() bug - Redis client uses decode_responses=True, so redis.get() returns strings, not bytes. Removed the erroneous .decode() call that would have caused AttributeError at runtime.
Fixed race condition with atomic SETNX - Now using redis.set(key, value, nx=True) (set-if-not-exists) BEFORE the database transaction. This prevents the TOCTOU race where two concurrent requests could both pass the token check before either stores.
Added input validation to completion_token - Now validates:
- max_length=100 to prevent memory exhaustion attacks
- Regex pattern: {job_id}-{uuid4} format
- Normalizes UUID portion to lowercase

🟠 High Priority Fixes

Scoped token Redis key to job_id - Key format changed from completion_token:{token} to vlog:completion_token:{job_id}:{token}. Prevents cross-job token collisions.
Increased token TTL to 15 minutes - Extended from 5 minutes to 900 seconds. Covers extended retry scenarios with exponential backoff (worker retries can exceed 5 minutes).
Added Redis key namespace prefix - All keys now use vlog: prefix to prevent collisions if Redis is shared with other applications.

Implementation Details

The atomic token handling now works as follows:

Before transaction: SET vlog:completion_token:{job_id}:{token} "processing" EX 900 NX
If SET returns False → token already exists → return "Job already completed"
If SET returns True → we own the token → proceed with transaction
After successful transaction: Update token value to "completed"

This eliminates the race window identified by reviewers.

If the database transaction fails (DatabaseLockedError), the completion token was left in Redis with status "processing" for 15 minutes. This blocked any retry attempts with the same token, even though the job completion never actually succeeded. Now we delete the token from Redis when the transaction fails, allowing the worker to retry immediately with the same token. Changes: - Move token_key to outer scope for access in exception handler - Add token cleanup in DatabaseLockedError exception handler - Best-effort cleanup with logging (don't mask original error) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

filthyrake · 2026-01-03T16:26:00Z

Additional Fix: Token Cleanup on Transaction Failure

Based on Gafton's review feedback, added token cleanup in commit 0bbacf6.

Problem: If the database transaction fails (DatabaseLockedError), the completion token was left in Redis with status "processing" for 15 minutes. This created a retry blackout period where subsequent attempts with the same token would be rejected, even though the job never actually completed.

Fix: Now we delete the token from Redis when the transaction fails:

except DatabaseLockedError as e:
    # Clean up token so retry can succeed
    if token_acquired and token_key:
        try:
            redis = await get_redis()
            if redis:
                await redis.delete(token_key)
        except Exception as cleanup_err:
            logger.warning(f"Failed to clean up completion token: {cleanup_err}")
    raise HTTPException(...)

This allows the worker to retry immediately with the same token after a transient database failure.

filthyrake and others added 2 commits January 3, 2026 07:53

filthyrake merged commit 3949db8 into dev Jan 3, 2026
5 checks passed

filthyrake deleted the feature/455-456-reliability-improvements branch January 3, 2026 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliability: Fix job completion idempotency and stale check grace period (#455, #456)#490

Reliability: Fix job completion idempotency and stale check grace period (#455, #456)#490
filthyrake merged 3 commits intodevfrom
feature/455-456-reliability-improvements

filthyrake commented Jan 3, 2026

Uh oh!

filthyrake commented Jan 3, 2026

Uh oh!

filthyrake commented Jan 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

filthyrake commented Jan 3, 2026

Summary

Issue #455: Job completion retry can overwrite metadata

Issue #456: Stale job checker grace period resets on API restart

Test plan

Files changed

Uh oh!

filthyrake commented Jan 3, 2026

Code Review Fixes Applied

🔴 Critical Fixes

🟠 High Priority Fixes

Implementation Details

Uh oh!

filthyrake commented Jan 3, 2026

Additional Fix: Token Cleanup on Transaction Failure

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant