Skip to content

Reliability: Fix job completion idempotency and stale check grace period (#455, #456)#490

Merged
filthyrake merged 3 commits intodevfrom
feature/455-456-reliability-improvements
Jan 3, 2026
Merged

Reliability: Fix job completion idempotency and stale check grace period (#455, #456)#490
filthyrake merged 3 commits intodevfrom
feature/455-456-reliability-improvements

Conversation

@filthyrake
Copy link
Copy Markdown
Owner

Summary

This PR addresses two reliability issues identified during code review:

Issue #455: Job completion retry can overwrite metadata

  • Problem: Job completion is retried up to 3 times on network errors, but metadata fields (duration, source_width, source_height, streaming_format, primary_codec) were unconditionally overwritten on retry
  • Solution:
    • Added completion_token field to CompleteJobRequest for idempotency
    • Worker generates a unique token (job_id-UUID) on each completion attempt
    • Server checks Redis for duplicate tokens and returns early if found
    • Made all metadata updates idempotent (only set if not already set)
    • If job already marked completed in DB, returns success without re-processing

Issue #456: Stale job checker grace period resets on API restart

  • Problem: The 2-minute grace period was based on API process uptime, not wall clock time. Frequent restarts prevented stale job cleanup from ever running
  • Solution:
    • Store last stale check time in Redis (stale_job_checker:last_run)
    • On startup, check Redis for recent checks before applying grace period
    • If another API instance recently checked, skip local grace period
    • After each successful check, record timestamp with 5-minute TTL
    • Applied same fix to orphan quality directory cleanup

Test plan

  • Run linter (ruff check) - passes
  • Run job completion tests - passes
  • Run stale job detection tests - passes
  • Run HTTP client tests - passes

Files changed

  • api/worker_api.py - Add Redis-based idempotency and grace period tracking
  • api/worker_schemas.py - Add completion_token field
  • worker/http_client.py - Generate completion tokens

Closes #455
Closes #456

🤖 Generated with Claude Code

filthyrake and others added 2 commits January 3, 2026 07:53
…iod (#455, #456)

Issue #455: Job completion retry can overwrite metadata
- Add completion_token field to CompleteJobRequest for idempotency
- Worker generates unique token (job_id + UUID) on completion
- Server checks Redis for duplicate tokens, returns early if found
- Make metadata updates idempotent (only set if not already set)
- If job already completed, return success without re-processing

Issue #456: Stale job checker grace period resets on API restart
- Store last stale check time in Redis (key: stale_job_checker:last_run)
- On startup, check Redis for recent check before applying grace period
- If another API instance recently checked, skip local grace period
- After each successful check, record timestamp in Redis (5-min TTL)
- Apply same fix to orphan quality directory cleanup

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixes identified by specialist reviewers:

1. Fix .decode() bug - Redis client uses decode_responses=True, so
   redis.get() returns strings, not bytes. Removed .decode() call.

2. Add input validation to completion_token:
   - max_length=100 to prevent memory abuse
   - Regex pattern validation for format: {job_id}-{uuid4}
   - Normalize to lowercase

3. Scope token Redis key to job_id:
   - Key format: vlog:completion_token:{job_id}:{token}
   - Prevents cross-job token collisions

4. Use atomic SETNX for token storage:
   - SET with NX flag (set-if-not-exists) before transaction
   - Prevents race condition where two requests pass check before either stores
   - Token initially set to "processing", updated to "completed" after success

5. Increase token TTL from 5 to 15 minutes:
   - Covers extended retry scenarios with exponential backoff
   - New constant: COMPLETION_TOKEN_TTL = 900

6. Add Redis key namespace prefix:
   - All keys now prefixed with "vlog:" to prevent collisions
   - New constant: REDIS_KEY_PREFIX = "vlog:"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@filthyrake
Copy link
Copy Markdown
Owner Author

Code Review Fixes Applied

Based on the comprehensive specialist reviews, the following issues have been addressed in commit 60ea529:

🔴 Critical Fixes

  1. Fixed .decode() bug - Redis client uses decode_responses=True, so redis.get() returns strings, not bytes. Removed the erroneous .decode() call that would have caused AttributeError at runtime.

  2. Fixed race condition with atomic SETNX - Now using redis.set(key, value, nx=True) (set-if-not-exists) BEFORE the database transaction. This prevents the TOCTOU race where two concurrent requests could both pass the token check before either stores.

  3. Added input validation to completion_token - Now validates:

    • max_length=100 to prevent memory exhaustion attacks
    • Regex pattern: {job_id}-{uuid4} format
    • Normalizes UUID portion to lowercase

🟠 High Priority Fixes

  1. Scoped token Redis key to job_id - Key format changed from completion_token:{token} to vlog:completion_token:{job_id}:{token}. Prevents cross-job token collisions.

  2. Increased token TTL to 15 minutes - Extended from 5 minutes to 900 seconds. Covers extended retry scenarios with exponential backoff (worker retries can exceed 5 minutes).

  3. Added Redis key namespace prefix - All keys now use vlog: prefix to prevent collisions if Redis is shared with other applications.

Implementation Details

The atomic token handling now works as follows:

  1. Before transaction: SET vlog:completion_token:{job_id}:{token} "processing" EX 900 NX
  2. If SET returns False → token already exists → return "Job already completed"
  3. If SET returns True → we own the token → proceed with transaction
  4. After successful transaction: Update token value to "completed"

This eliminates the race window identified by reviewers.

If the database transaction fails (DatabaseLockedError), the completion
token was left in Redis with status "processing" for 15 minutes. This
blocked any retry attempts with the same token, even though the job
completion never actually succeeded.

Now we delete the token from Redis when the transaction fails, allowing
the worker to retry immediately with the same token.

Changes:
- Move token_key to outer scope for access in exception handler
- Add token cleanup in DatabaseLockedError exception handler
- Best-effort cleanup with logging (don't mask original error)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@filthyrake
Copy link
Copy Markdown
Owner Author

Additional Fix: Token Cleanup on Transaction Failure

Based on Gafton's review feedback, added token cleanup in commit 0bbacf6.

Problem: If the database transaction fails (DatabaseLockedError), the completion token was left in Redis with status "processing" for 15 minutes. This created a retry blackout period where subsequent attempts with the same token would be rejected, even though the job never actually completed.

Fix: Now we delete the token from Redis when the transaction fails:

except DatabaseLockedError as e:
    # Clean up token so retry can succeed
    if token_acquired and token_key:
        try:
            redis = await get_redis()
            if redis:
                await redis.delete(token_key)
        except Exception as cleanup_err:
            logger.warning(f"Failed to clean up completion token: {cleanup_err}")
    raise HTTPException(...)

This allows the worker to retry immediately with the same token after a transient database failure.

@filthyrake filthyrake merged commit 3949db8 into dev Jan 3, 2026
5 checks passed
@filthyrake filthyrake deleted the feature/455-456-reliability-improvements branch January 3, 2026 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reliability: Stale job checker grace period resets on API restart Reliability: Job completion retry can overwrite metadata

1 participant