Reliability: Fix job completion idempotency and stale check grace period (#455, #456)#490
Conversation
…iod (#455, #456) Issue #455: Job completion retry can overwrite metadata - Add completion_token field to CompleteJobRequest for idempotency - Worker generates unique token (job_id + UUID) on completion - Server checks Redis for duplicate tokens, returns early if found - Make metadata updates idempotent (only set if not already set) - If job already completed, return success without re-processing Issue #456: Stale job checker grace period resets on API restart - Store last stale check time in Redis (key: stale_job_checker:last_run) - On startup, check Redis for recent check before applying grace period - If another API instance recently checked, skip local grace period - After each successful check, record timestamp in Redis (5-min TTL) - Apply same fix to orphan quality directory cleanup 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixes identified by specialist reviewers:
1. Fix .decode() bug - Redis client uses decode_responses=True, so
redis.get() returns strings, not bytes. Removed .decode() call.
2. Add input validation to completion_token:
- max_length=100 to prevent memory abuse
- Regex pattern validation for format: {job_id}-{uuid4}
- Normalize to lowercase
3. Scope token Redis key to job_id:
- Key format: vlog:completion_token:{job_id}:{token}
- Prevents cross-job token collisions
4. Use atomic SETNX for token storage:
- SET with NX flag (set-if-not-exists) before transaction
- Prevents race condition where two requests pass check before either stores
- Token initially set to "processing", updated to "completed" after success
5. Increase token TTL from 5 to 15 minutes:
- Covers extended retry scenarios with exponential backoff
- New constant: COMPLETION_TOKEN_TTL = 900
6. Add Redis key namespace prefix:
- All keys now prefixed with "vlog:" to prevent collisions
- New constant: REDIS_KEY_PREFIX = "vlog:"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Code Review Fixes AppliedBased on the comprehensive specialist reviews, the following issues have been addressed in commit 60ea529: 🔴 Critical Fixes
🟠 High Priority Fixes
Implementation DetailsThe atomic token handling now works as follows:
This eliminates the race window identified by reviewers. |
If the database transaction fails (DatabaseLockedError), the completion token was left in Redis with status "processing" for 15 minutes. This blocked any retry attempts with the same token, even though the job completion never actually succeeded. Now we delete the token from Redis when the transaction fails, allowing the worker to retry immediately with the same token. Changes: - Move token_key to outer scope for access in exception handler - Add token cleanup in DatabaseLockedError exception handler - Best-effort cleanup with logging (don't mask original error) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Additional Fix: Token Cleanup on Transaction FailureBased on Gafton's review feedback, added token cleanup in commit 0bbacf6. Problem: If the database transaction fails ( Fix: Now we delete the token from Redis when the transaction fails: except DatabaseLockedError as e:
# Clean up token so retry can succeed
if token_acquired and token_key:
try:
redis = await get_redis()
if redis:
await redis.delete(token_key)
except Exception as cleanup_err:
logger.warning(f"Failed to clean up completion token: {cleanup_err}")
raise HTTPException(...)This allows the worker to retry immediately with the same token after a transient database failure. |
Summary
This PR addresses two reliability issues identified during code review:
Issue #455: Job completion retry can overwrite metadata
duration,source_width,source_height,streaming_format,primary_codec) were unconditionally overwritten on retrycompletion_tokenfield toCompleteJobRequestfor idempotencyjob_id-UUID) on each completion attemptIssue #456: Stale job checker grace period resets on API restart
stale_job_checker:last_run)Test plan
ruff check) - passesFiles changed
api/worker_api.py- Add Redis-based idempotency and grace period trackingapi/worker_schemas.py- Addcompletion_tokenfieldworker/http_client.py- Generate completion tokensCloses #455
Closes #456
🤖 Generated with Claude Code