Skip to content

Reliability: Job completion retry can overwrite metadata #455

@filthyrake

Description

@filthyrake

Summary

Job completion is retried up to 3 times, but some metadata fields are unconditionally overwritten on retry.

Location

worker/remote_transcoder.py:798-828 and api/worker_api.py:1343-1361

Issue

Race condition:

  1. Worker completes job, calls complete_job
  2. Server processes request: marks job complete, inserts quality records
  3. Network error before response reaches worker
  4. Worker retries complete_job
  5. Server processes again

Analysis of idempotency:

  • Quality inserts: ✅ Idempotent (checks for existing records)
  • published_at: ✅ Idempotent (only sets if null)
  • duration, source_width, source_height: ❌ Unconditionally overwritten

While this is unlikely to cause issues in practice (values should be the same), it's not technically idempotent.

Recommended Fix

Add idempotency token to complete job request:

# Worker generates unique completion token
completion_token = f"{job_id}-{uuid4()}"

# Server tracks completed tokens in cache for 5 minutes
if await redis.get(f"completion:{completion_token}"):
    return {"status": "already_completed"}
    
# Process completion...
await redis.set(f"completion:{completion_token}", "1", ex=300)

Severity

Low - Unlikely to cause issues but not fully idempotent


Identified during reliability review by Margo

Metadata

Metadata

Assignees

No one assigned

    Labels

    reliabilityReliability, error handling, and failure recovery

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions