Don't surface duplicate-key Dag-write race between Dag processors as import error#66788
Don't surface duplicate-key Dag-write race between Dag processors as import error#667881fanwang wants to merge 1 commit into
Conversation
When multiple Dag processors race to INSERT the same brand-new Dag row, the loser's transaction raises ``IntegrityError`` (1062 / UNIQUE constraint failed / unique violation). The current handler in ``_serialize_dag_capturing_errors`` records that as an import error and leaves the SQLAlchemy session in a state that causes ``PendingRollbackError`` on the next per-Dag write in the same parsing cycle. The user sees a healthy Dag file flagged as broken in the UI even though the peer processor's INSERT succeeded. Detect unique-constraint violations across SQLite, MySQL, and Postgres using the same message-prefix approach as ``_UniqueConstraintErrorHandler`` in the public API. On a hit, roll the session back, log at INFO, and return an empty import-error list — the row exists, so the write is a no-op and there's no value in retrying. Other IntegrityErrors (e.g. NOT-NULL violations from genuinely malformed Dags) still flow into the generic exception arm and surface as import errors. Closes: apache#66786 Signed-off-by: 1fanwang <1fannnw@gmail.com>
| # avoid PendingRollbackError on subsequent per-Dag work in this parsing cycle. | ||
| # The winning peer already produced the correct row, so this is not an import error | ||
| # and we don't retry. Non-unique IntegrityErrors (e.g. NOT-NULL violations from a | ||
| # genuinely malformed Dag) fall through to the generic Exception arm. |
There was a problem hiding this comment.
Can you share a log of where you saw this?
There was a problem hiding this comment.
The trace that lands in import_errors on each dialect, captured by reverting the new except IntegrityError arm and rerunning the regression test against main:
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: serialized_dag.dag_id
sqlalchemy.exc.IntegrityError: (MySQLdb.IntegrityError) (1062, "Duplicate entry 'my_dag' for key 'serialized_dag.PRIMARY'")
sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "serialized_dag_pkey"
The IntegrityError raised by the losing processor's SerializedDagModel.write_dag is caught by the existing generic except Exception arm in _serialize_dag_capturing_errors, fed through traceback.format_exc(...), and recorded as the import-error value for the parsing cycle. The loser's now-invalid transaction also causes PendingRollbackError on the next per-Dag write in the same update_dag_parsing_results_in_db call. Added a before/after pytest snippet to the PR body that surfaces each dialect's exact message.
ephraimbuddy
left a comment
There was a problem hiding this comment.
Thanks for the detailed write-up. Two things this PR needs before it can land.
1. Multi-DFP over a shared bundle is not a supported topology.
production-deployment.rst is explicit: "the Dag processor is not horizontally scaled — even if you have more of them there is usually one Dag processor running at a time per specific folder." Today the CLI default (--bundle-name=None → parse every bundle, see manager.py:361) makes it easy to accidentally run two DFPs over the same bundle, but that is a gap to close, not a topology to silently support. If we want to make multi-DFP a first-class deployment, that needs a maintainer-level decision and an actual coordination layer (bundle lease at DFP startup, sharded file queue, advisory lock per (bundle_name, relative_fileloc), etc.) — not an exception handler in the write path.
2. If we do fix the symptom for now, this is the wrong layer.
For a brand-new Dag, the race is at the INSERT in DagVersion.write_dag / SerializedDagModel.write_dag. The cleaner fix is at the write site (INSERT ... ON CONFLICT DO NOTHING for the new-DAG path, or refetch-and-skip after catching at the model layer), so the loser never enters _serialize_dag_capturing_errors with a poisoned session in the first place. Catching IntegrityError here and string-matching three dialect-specific message prefixes (UNIQUE constraint failed / Duplicate entry / violates unique constraint) is fragile:
- The prefixes change across driver versions (e.g. psycopg vs psycopg2 wording, MySQL message text on different connectors).
_UniqueConstraintErrorHandlergets away with it at the API boundary; an internal write path is a different risk profile. - It silently swallows any future legitimate unique-constraint violation introduced in this path (new migration, new dedup constraint, real data corruption). That class of bug is much harder to debug than a transient import-error banner.
SQLAlchemyError.origmay beNonein some paths;str(None)won't match and we silently fall through, which is fine, but the helper should be defensive about that.
3. The tests don't actually exercise the race.
Both new tests mock SerializedDagModel.write_dag and raise a hand-constructed IntegrityError whose orig is a plain Exception(msg). That confirms the new except arm routes correctly but does not demonstrate the race occurs, that session.rollback() is sufficient to recover the surrounding update_dag_parsing_results_in_db cycle, or that subsequent per-Dag writes in the same call actually proceed. A db_test that spawns two threads/sessions racing on SerializedDagModel.write_dag for a brand-new dag would be much stronger — and would also show whether INSERT ... ON CONFLICT DO NOTHING at the model layer is the better fix.
Asks before re-review:
- Open a discussion / linked issue with maintainers about whether multi-DFP-per-bundle is something we want to support in 3.x, or whether we should instead refuse-to-start when another DFP holds the bundle. Either way, document the chosen position.
- Move the fix to the write site (
DagVersion.write_dag/SerializedDagModel.write_dag), so the loser is a no-op rather than an exception we have to classify. - Replace the mocked tests with an integration test that actually races two sessions on the same brand-new Dag.
Happy to help shape the coordination approach if option (a) goes the "refuse to start" way.
Drafted-by: Claude Code (Opus 4.7); reviewed by @ephraimbuddy before posting
| _UNIQUE_CONSTRAINT_ERROR_PREFIXES = ( | ||
| "UNIQUE constraint failed", # SQLite | ||
| "Duplicate entry", # MySQL | ||
| "violates unique constraint", # Postgres |
There was a problem hiding this comment.
String-matching exc.orig across three drivers is fragile and tightly couples this code to current driver wordings. Prefer fixing at the write site with INSERT ... ON CONFLICT DO NOTHING (or equivalent merge for the brand-new-dag path), so this except arm doesn't need to classify dialects at all. If we keep this approach, please guard against exc.orig is None.
Drafted-by: Claude Code (Opus 4.7); reviewed by @ephraimbuddy before posting
| return [] | ||
| except OperationalError: | ||
| raise | ||
| except IntegrityError as exc: |
There was a problem hiding this comment.
Even if we keep the dialect-prefix approach for now, log.info(...) here will fire on every collision and is going to be noisy on a healthy multi-DFP cluster — and silently spammy if a real unique-constraint regression ever lands. Suggest log.warning at minimum, and including the constraint name in the message if SQLAlchemy exposes it on exc.orig.
Drafted-by: Claude Code (Opus 4.7); reviewed by @ephraimbuddy before posting
| @pytest.mark.parametrize( | ||
| "orig_message", | ||
| [ | ||
| pytest.param( |
There was a problem hiding this comment.
This test asserts that the handler routes a mocked IntegrityError correctly; it does not assert that two concurrent DFPs actually race or that the surrounding update_dag_parsing_results_in_db cycle recovers. Please add a db_test that races two sessions on the same brand-new Dag (e.g. two threads calling SerializedDagModel.write_dag against the same dag_id), so we have a regression test for the actual scenario, not just for the symptom string.
Drafted-by: Claude Code (Opus 4.7); reviewed by @ephraimbuddy before posting
|
Thanks for the context @ephraimbuddy, taking a closer look, will update |
Fix for the parallel-Dag-processor race called out in #66786. We run multiple DFP instances on our LinkedIn DI cluster; occasionally two of them parse the same file at the same moment and race on the first INSERT — the loser gets a duplicate-key
IntegrityErrorand surfaces it to users as an import error (red banner in the UI). It's a benign first-write race, not a parse problem; this PR treats it that way.When more than one Dag processor writes the same brand-new Dag in the same scheduler heartbeat, only one of them wins the INSERT into
serialized_dag/dag/dag_version. The losers raise a unique-constraintIntegrityError(UNIQUE constraint failedon SQLite,Duplicate entry/ 1062 on MySQL,violates unique constrainton Postgres), which_serialize_dag_capturing_errorscurrently routes through the genericexcept Exceptionarm.That has two user-visible effects in the same parsing cycle:
import_errorseven though the winner produced the correct row, so the UI flags a healthy file as broken until the next cycle.update_dag_parsing_results_in_dbcall raisePendingRollbackError.The winner already produced the right row, so the loser's failure is a no-op rather than an import error. Add an
except IntegrityErrorarm before the generic catch that:_UniqueConstraintErrorHandleralready uses in the public API.Non-unique IntegrityErrors (e.g. NOT-NULL violations from a genuinely malformed Dag) still flow into the generic arm and surface as import errors, so users keep seeing real serialization problems.
Reproducer and before/after evidence
TestUpdateDagParsingResults::test_duplicate_key_from_concurrent_processor_is_not_import_errorexercises the three dialect message shapes (SQLite, MySQL, Postgres) as parametrized variants;test_non_unique_integrity_error_is_still_import_errorpins the NOT-NULL fall-through.Before fix (with the new
except IntegrityErrorarm removed), each dialect variant fails with the actual SQLAlchemyIntegrityErrorrecorded as an import-error entry — i.e. exactly the trace a scheduler would log in production:After fix, every variant passes: no
import_errorsentry is recorded for the unique-constraint race,session.rollback()runs so subsequent per-Dag work in the same parsing cycle does not hitPendingRollbackError, and the message-prefix detection covers each dialect. The NOT-NULL fall-through case is unchanged: a genuinely malformed Dag still surfaces as an import error.Closes: #66786