Fix stuck DSRs#8211
Conversation
…or disabled Async callback tasks in awaiting_processing status were permanently stuck when their ConnectionConfig was deleted — the watchdog couldn't see them, the status poller couldn't help, and reprocessing reused the same stuck tasks. Three changes: - GraphTask.__init__ tolerates missing connectors (sets self.connector = None), and skip_if_disabled() checks for None first, raising CollectionDisabled through the existing @Retry skip path - Watchdog iterator now includes awaiting_processing tasks, with a per-task connection check (_task_is_orphaned) to distinguish orphaned tasks from those legitimately waiting for callbacks - Status filter added to _has_async_tasks_awaiting_external_completion so completed/errored async tasks no longer permanently blind the watchdog Also fixes a previously-undetected bug: disabled connections with callback tasks had the same blind spot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
4b86b1c to
030a54f
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8211 +/- ##
==========================================
- Coverage 85.16% 85.09% -0.08%
==========================================
Files 670 670
Lines 43498 43592 +94
Branches 5093 5120 +27
==========================================
+ Hits 37046 37094 +48
- Misses 5346 5393 +47
+ Partials 1106 1105 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add upfront validation in build_erasure_networkx_digraph that checks all erase_after references point to collections that exist in the traversal nodes, end nodes, or artificial nodes. A dangling reference (e.g. from a deleted integration) previously created a phantom node via implicit networkx.add_edge, causing a KeyError at dataset_graph.nodes[node] in base_task_data that killed erasure task creation partway through and left the privacy request in an unrecoverable state. Now raises TraversalError with a clear message identifying the offending collection and the dangling reference before any tasks are persisted.
|
/code-review |
There was a problem hiding this comment.
Code Review — PR #8211: Fix stuck DSRs with orphaned async tasks
This PR addresses a real and impactful bug class (permanently stuck DSRs) with three distinct, well-scoped fixes:
_get_request_task_ids_in_progress— addsawaiting_processingto the in-progress statuses so the watchdog can see async tasks._has_async_tasks_awaiting_external_completion— adds.notin_(EXITED_EXECUTION_LOG_STATUSES)so completed/errored/skipped async tasks no longer permanently blind the watchdog._task_is_orphaned— new helper that detects deleted or disabledConnectionConfigfor an async task, allowing the watchdog to requeue orphaned tasks instead of waiting forever.GraphTask.__init__— gracefully handlesConnectorNotFoundException(hard-deleted config) by settingself.connector = Noneand lettingskip_if_disabled()raiseCollectionDisabledthrough the existing@retrypath.build_erasure_networkx_digraph— validateserase_afterreferences upfront before building the graph.
The logic is clear and the test coverage is thorough. A few things worth looking at:
Issues
-
generate_dry_run_queryis not guarded againstNoneconnector (see inline comment atgraph_task.py:290). The production task-execution path is safe because@retrycallsskip_if_disabled()first, butgenerate_dry_run_queryhas no such guard and would raiseAttributeErrorif called on a task with a deleted connection. -
erase_aftervalidation is a breaking change for stale configs (seecreate_request_tasks.py:195). Previously, danglingerase_afterreferences would silently corrupt the task graph with aKeyErrormid-execution. Now, they fail fast with a clearTraversalError. This is strictly better from a data-integrity standpoint, but customers with active datasets referencing deleted integrations will start seeing new errors on erasure. Worth a note in the changelog or docs. -
_task_is_orphanedreturnsFalsewhen connection key is missing (seerequest_service.py:563). The conservative default is appropriate to avoid false positives, but a comment explaining the intent would help, and the log level may be too low if this represents unexpected state.
Minor
- Test docstring wording is slightly misleading (see
test_requeue_interrupted_tasks.py:618) — describes the pre-fix bug rather than what the test verifies.
Overall this is solid work. The watchdog interaction changes are well-thought-out, and the two-branch logic (orphaned check for awaiting_processing, requeue for everything else) is a clean solution. The main thing to address before merge is the generate_dry_run_query null-safety gap.
🔬 Codegraph: connected (50527 nodes)
💡 Write /code-review in a comment to re-run this review.
…a/fides into ENG-3834/skip-orphaned-async-tasks
eastandwestwind
left a comment
There was a problem hiding this comment.
Small nits / Qs but otherwise good to go!
| # Orphaned task tests (ENG-3834) | ||
| # | ||
| # These tests document the behavior of the watchdog when async tasks have | ||
| # deleted or disabled connections. Tests marked xfail demonstrate the | ||
| # current bug — they will pass once the fix is applied. | ||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
Nice Test-driven approach, thanks for adding these!
| f"Collection {node_name} has an erase_after reference to " | ||
| f"{ref} which does not exist in the dataset graph. This may " | ||
| f"indicate a deleted integration that is still referenced." |
There was a problem hiding this comment.
Could we make this msg more actionable, e.g. "Erasure cannot proceed: collection 'active_api:users' has an 'Erase After' dependency on 'deleted_api:users', which no longer exists in the system. Update the 'Erase After' setting on this collection in the dataset configuration to remove the stale reference."
There was a problem hiding this comment.
Done — updated the error message to be more specific and actionable, naming the collection and stale reference with guidance on how to fix it.
| f"Request task {request_task_id} has no dataset_connection_key " | ||
| f"in traversal_details — possible data integrity issue" | ||
| ) |
There was a problem hiding this comment.
Does this mean the task is actually stuck? Is it worth surfacing an Execution error so it it's evident in Admin-UI?
Make the TraversalError message more actionable by naming the specific collection and stale reference, with guidance on how to fix it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add None guard in generate_dry_run_query to prevent AttributeError when ConnectionConfig is deleted during a dry run - Add comment documenting Redis-only manual-webhook input fragility when requeuing requires_input/pending_external PRs - Add TestDeletedConnectionConfig tests covering connector=None init, saas_version fallback, dry_run_query guard, and skip_if_disabled - Add test for missing connection_key not being treated as orphaned Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>


Ticket ENG-3834
Description Of Changes
Two related fixes for stuck DSRs caused by deleted or misconfigured integrations.
1. Orphaned async callback tasks
Async callback tasks in
awaiting_processingstatus were permanently stuck when their ConnectionConfig was deleted or disabled. Three safety mechanisms all failed simultaneously:in_processing/pendingtasks —awaiting_processingtasks were invisible_has_async_tasks_awaiting_external_completionhad no status filter, so any PR that ever had an async task was permanently invisible to the watchdog (even after the async task completed)GraphTask.__init__crashed on deleted connections instead of skipping gracefullyThe fix uses the existing skip machinery (
CollectionDisabled→@retry→log_skipped) — no new endpoints, no migrations, no UI changes.2. Dangling
erase_afterreferences crash erasure task creationWhen a collection's
erase_afterreferences a collection belonging to a deleted integration, the unvalidated reference creates a phantom node in the erasure networkx graph via implicitnetworkx.add_edgenode creation. This phantom node then causes aKeyErroratdataset_graph.nodes[node]inbase_task_data, which kills the erasure task creation loop partway through. The result: partial erasure tasks in the DB, no TERMINATE task, and a privacy request that completes without actually performing any erasure — a silent compliance failure.The fix adds upfront validation in
build_erasure_networkx_digraphthat checks allerase_afterreferences point to collections that exist in the traversal before building edges. RaisesTraversalErrorwith a clear message before any tasks are persisted.Code Changes
graph_task.py:GraphTask.__init__now tolerates missing connectors (self.connector = None);skip_if_disabled()checks for None connector first and raisesCollectionDisabled, which the existing@retrydecorator handles identically to a disabled connectionrequest_service.py: Watchdog iterator now includesawaiting_processingtasks; added status filter to_has_async_tasks_awaiting_external_completionso exited async tasks don't blind the watchdog; new_task_is_orphaned()function checks if a task's connection is deleted or disabled; watchdog per-task loop uses this to distinguish orphaned tasks from legitimately waiting onescreate_request_tasks.py:build_erasure_networkx_digraphnow validates allerase_afterreferences against the set of known nodes (traversal nodes + end nodes + artificial nodes) before building the graph. RaisesTraversalErrorwith a message identifying the offending collection and dangling referencetest_requeue_interrupted_tasks.py: 6 new tests covering deleted connection, disabled connection, valid connection (not requeued), and 3 exited-status-blinding casestest_erase_after_dangling_ref.py: 3 new tests — graph builder rejects dangling refs, task creation raises before persisting partial state, validerase_aftercontinues to workSteps to Confirm
Prerequisites
docker compose up)200 {}for all requests (used as the SaaS endpoint target):docker exec fidesplus-slim curl -s http://host.docker.internal:9999/testTOKEN=<your_token>Register connector templates (one-time setup)
Two custom SaaS connector templates are needed: one with async callback, one without.
Async callback template (
async_callback_test):Simple (non-async) template (
simple_test):Scenario 1: Orphaned async callback tasks
Setup: Create a SaaS connection using the
async_callback_testtemplate, pointing at the mock server.Steps:
localhost:3001) and approve it in the Admin UI (localhost:3000).async_callback_test:usershowsawaiting_processing:Expected result (API server logs):
Expected result (worker logs):
Scenario 2: Dangling
erase_afterreferencesSetup: Create two SaaS connections — A (async) and B (simple) — where B's erasure depends on A. Then delete A.
Steps:
Expected result (worker logs):
Key details:
TraversalError, not a crypticKeyErrorerase_afteris set on the SaaS config endpoint, not the datasetfides_metaaccess: "write"to participate in erasureidentityinfides_meta— identities go in SaaS configparam_valuesCleanup
Pre-Merge Checklist
CHANGELOG.mdupdated