fix(api): handle concurrent RTIF writes to prevent unique constraint violation#63581
fix(api): handle concurrent RTIF writes to prevent unique constraint violation#63581YoannAbriel wants to merge 1 commit intoapache:mainfrom
Conversation
644f361 to
f3a8d41
Compare
|
Hey @YoannAbriel, nice writeup on this one — the race condition between concurrent RTIF writes hitting the unique constraint is a real problem, and catching it at the endpoint level with a retry is a pragmatic approach. A few things I noticed while reading through: The None guard after re-fetch could be tighter. After The On the tests — Longer term thought — have you considered using Good catch overall — RTIF writes being killed by a 409 that the task-sdk doesn't retry is a subtle failure mode. |
2eb106d to
4686e4c
Compare
|
Were you able to reproduce this issue, could you share screenshots of before and after please |
When multiple workers try to write rendered task instance fields for the same task instance simultaneously, a race condition in session.merge() can cause an IntegrityError (unique constraint violation). This happens because both workers SELECT (find no record), then both try to INSERT. Fix the ti_put_rtif endpoint to catch IntegrityError, rollback the failed transaction, re-fetch the task instance, and retry the write. The retry succeeds because merge() now finds the existing record and performs an UPDATE instead of INSERT. Closes: apache#61705
4686e4c to
bab4940
Compare
|
Not yet — will set up a local reproducer with concurrent API calls hitting the RTIF endpoint and share logs. Will update here. |
Problem
When multiple workers try to write rendered task instance fields (RTIF) for the same task instance simultaneously, the API server returns a
409 Conflicterror due to a unique constraint violation onrendered_task_instance_fields_pkey. This causes the task runner to fail withAirflowRuntimeError, marking the task as failed even though it completed successfully.This is particularly common with CeleryExecutor when parallel tasks render fields at the same time, or when task retries overlap.
Closes: #61705
Root Cause
The
update_rtifmethod usessession.merge()which performs a SELECT-then-INSERT/UPDATE pattern. When two concurrent requests both SELECT and find no existing record, they both attempt an INSERT, and the second one fails with an IntegrityError.The global
_UniqueConstraintErrorHandlercatches this IntegrityError and converts it to a409 ConflictHTTP response, which the task-sdk treats as a fatal error.Fix
Handle
IntegrityErrorin theti_put_rtifendpoint with a retry strategy:IntegrityErrorfrom the firstupdate_rtifcallupdate_rtif— this timesession.merge()will find the existing record and perform an UPDATE instead of INSERTThis is safe because RTIF writes are idempotent — the last writer wins, which is the correct semantic for rendered template fields.
Testing
test_ti_put_rtif_concurrent_write: verifies that two sequential writes to the same RTIF succeed (the second updates rather than conflicts)test_ti_put_rtif_integrity_error_handled: simulates the race condition by mockingupdate_rtifto raiseIntegrityErroron the first call, verifying the retry succeedsVerified with unit tests. No external service dependencies.
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4, claude-opus-4-6) following the guidelines