fix(#265): stash Data Dictionary before delete, restore on rollback#307
Conversation
Close #265. Before this change, the analysis stage captured the existing datastore resource's per-field info dicts (the "Data Dictionary": labels, descriptions, type_overrides) into ``ProcessingContext.existing_info``, then deleted the existing datastore resource so the database stage could re-create + COPY into it. If any later stage failed, the in-memory snapshot died with the worker and the operator's annotations were gone forever — a hole explicitly called out in ``_rollback_database``'s own docstring: > (The earlier ``existing_info`` branch claimed to "preserve" the > original, but the delete had already destroyed it.) Now the analysis stage writes the dictionary to a small JSON stash file on disk before the delete, and ``_rollback_database`` reads it back on failure to restore the dictionary by re-creating the datastore resource with the stashed fields and zero rows. The data is unrecoverable (it never landed), but the operator's annotations are preserved across the failed run. The flow's ``finally`` block clears the stash on any successful exit (including ``_StageAbort`` complete-with-skip); error paths leave the stash for the rollback hook and a possible subsequent retry. * NEW ``ckanext/datapusher_plus/dictionary_stash.py``: tiny on-disk persistence module (``save`` / ``load`` / ``clear`` / ``stash_path``). Atomic writes via os.replace, treats corrupt stashes as absent so the rollback hook can't crash on its own bad file, rejects path-traversal in resource_id. * NEW ``tests/test_dictionary_stash.py``: 15 unit tests covering round-trip, missing-stash, idempotent clear, corrupt JSON, atomic-rename, failed-replace, traversal guard, lazy mkdir, default tempdir fallback. * ``analysis.py``: stash existing_info before delete (best-effort — a stash failure logs a warning but does not block ingestion). * ``prefect_flow.py``: ``_rollback_database`` restores from stash after dropping the half-written table; ``datapusher_plus_flow``'s ``finally`` clears the stash on success. * ``config_declaration.yaml``: new ``ckanext.datapusher_plus.dictionary_stash_dir`` knob (defaults to ``<tempdir>/dpp_dict_stash``; settable for deployments where /tmp is wiped on container restart). Test count: **186/186 Python unit tests pass** on dpp-test (was 171; +15 from the new stash module). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four findings, all valid: **MEDIUM** — Restore path only covered in-transaction failures. ``_rollback_database`` is registered ``@database_task.on_rollback`` so it only fires for the four tasks inside ``with transaction():``. The delete happens in ``analyze_task`` which runs *before* the transaction block, so failures in ``analyze_task`` (after the delete), in ``ai_suggestions_task``, or in ``_maybe_suspend_for_pii_review`` would leave the stash on disk but no hook would restore it. On retry, ``analyze_task`` saw no existing datastore (it had been deleted), so ``existing_info`` stayed empty, the stash was never re-applied, and the success-path ``finally`` cleared the orphaned stash — losing the dictionary anyway. Fix: at the top of ``AnalysisStage`` where ``existing_info`` is captured, if there's no existing datastore but there IS a stash for this resource, treat that as a retry-after-failure and load the stash as ``existing_info``. The downstream merge then applies it onto the new headers, the new datastore resource gets the dictionary, and the success-finally correctly drops a stash that has now been propagated. **LOW** — ``save`` leaked ``.tmp`` files on ``os.replace`` failure. Wrapped the open/dump/replace in try/except that unlinks the tempfile on any exception, then re-raises. The original file is still untouched. **LOW** — ``_stash_dir`` mkdir'd on every call. ``load`` and ``clear`` do not need the directory to exist; in fact, the flow's success ``finally`` calls ``clear`` unconditionally, so the old behavior created an empty stash directory for every job that never stashed anything (and a misconfigured stash dir would crash ``clear``). Added ``_stash_dir(create=False)`` (default) and only ``save`` passes ``create=True``. **LOW** — Test used ``tk.config.pop`` instead of ``monkeypatch.delitem``, so the pop was not reverted at teardown and silently dropped any pre-existing value for subsequent tests. Replaced with ``monkeypatch.delitem(..., raising=False)``. Tests: 187/187 unit tests pass on dpp-test (was 186; +1 from the new ``test_load_does_not_create_stash_dir``). Tests added/extended: - ``test_save_with_failed_atomic_replace_does_not_leave_corruption`` now also asserts no ``.tmp`` is leaked. - ``test_load_does_not_create_stash_dir`` proves load/clear don't bootstrap the directory; only save does. - ``test_default_stash_dir_falls_back_to_tempfile`` switched to ``monkeypatch.delitem`` for clean teardown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three LOW findings, all valid: **LOW** — Retry-restore branch had no test (testing gap on the central fix for the prior MEDIUM). Added 3 focused tests against ``AnalysisStage._parse_stats`` that exercise the seam directly: * ``test_retry_restore_loads_stash_when_no_live_datastore`` proves the stash IS loaded into ``existing_info`` and propagated onto the rebuilt headers when the datastore is gone but the stash is present. * ``test_retry_restore_is_noop_when_no_stash`` proves the branch doesn't synthesize a non-None empty dict on first-run-no-dictionary. * ``test_retry_restore_does_not_fire_when_live_datastore_exists`` proves the live datastore's info dicts win over a stale stash. **LOW** — A stash file from a long-ago failed run could be silently applied to an unrelated upload sharing the same resource_id. Surface the stash mtime in the "restoring" info log so operators can spot stale-restore vs. genuine retry-after-failure from the logs. Format: ``stash age N s``. (A future change could enforce a TTL; for now, visibility is enough — adding a hard expiry would alter the restore policy, which should be a separate decision.) **LOW** — Comment claimed the retry branch "re-save[s] (overwriting itself)" the stash, but no re-save was actually performed. Dropped the misleading clause and clarified: the existing stash content IS the correct snapshot (the original dictionary captured before the original delete), so reusing it as-is is intentional. If this retry also fails, the same stash is reused on the next attempt. Tests: 190/190 unit tests pass on dpp-test (was 187; +3 from the new retry-restore coverage). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR closes #265 by persisting a resource's Data Dictionary (per-field info dicts: label / description / type_override) to a small on-disk JSON file before the analysis stage deletes the existing datastore resource. On a rollback (any downstream stage failure inside the transactional block), the database rollback hook reads the stash back and re-creates the datastore resource with zero rows so the operator's annotations survive. The stash is cleared on successful flow completion and left in place on failure so a subsequent retry — or operator inspection — can use it.
Changes:
- New
dictionary_stashmodule (atomicos.replacewrites, corrupt-stash treated as absent, path-traversal guard, lazy stash dir), plus a new CKAN config knobckanext.datapusher_plus.dictionary_stash_dir. - Analysis stage stashes
existing_infobefore delete (best-effort) and, on retry-after-failure (no live datastore but stash present), loads the stash back intoexisting_info. _rollback_databaserestores the dictionary from the stash; the flow'sfinallyclears the stash on any successful exit.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| ckanext/datapusher_plus/dictionary_stash.py | New on-disk persistence module (save/load/clear/stash_path) with atomic writes and traversal guard. |
| ckanext/datapusher_plus/jobs/stages/analysis.py | Stashes dictionary before delete and restores from stash on retry when no live datastore exists. |
| ckanext/datapusher_plus/jobs/prefect_flow.py | Rollback hook restores dictionary from stash; success finally clears it. |
| ckanext/datapusher_plus/config_declaration.yaml | Declares the new dictionary_stash_dir config option. |
| tests/test_dictionary_stash.py | 15 unit tests covering round-trip, corruption, atomicity, traversal, dir bootstrap, and analysis-stage retry restore. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Copilot review on PR #307 flagged that ``_rollback_database``'s restore path built field dicts with only ``id`` and ``info`` keys, omitting ``type``. Every other call site to ``send_resource_to_datastore`` in the codebase (``_build_headers_dicts`` in ``analysis.py:435``, ``metadata.py:96``) passes headers with ``id``, ``type``, AND ``info``. CKAN's ``datastore_create`` falls back to ``text`` for fields without a declared type, so a stashed dictionary whose entries carried a non-text ``type_override`` (e.g. ``numeric``, ``timestamp``) would be silently restored as a text column — inconsistent with the operator's original intent. The rollback now mirrors the analysis stage's merge: for each stashed entry, derive ``type`` from ``info["type_override"]`` when it's in ``conf.TYPE_MAPPING.values()`` (text / numeric / timestamp / date), falling back to ``text`` for missing or unknown overrides. Tests: 192/192 unit tests pass on dpp-test (was 190; +2 new regression tests pinning text/numeric/timestamp/no-override + unknown type fallback behavior). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings (1 MEDIUM, 2 LOW) on the memory-update commit; all
addressed.
- MEDIUM @ stale `tests/test_date_type_mapping.py` reference: this
commit was *specifically* supposed to fix that reference but missed
one instance (line 316). Fixed — and the same class of bug existed
for three other filenames the reviewer didn't flag explicitly but
which had the same root cause (me inventing filenames without
verifying against `git log --diff-filter=A`). Corrected:
test_file_hash_algorithms.py -> test_file_hash_algorithm.py
test_metadata_stage_file_hash.py -> test_metadata_hash_persistence.py
test_download_result_rehydrate.py -> test_rehydrate_resource_identity.py
test_date_type_mapping.py -> test_date_without_timestamp.py
Verified all four replacements resolve as real files in the tree.
- LOW @ inconsistent test-count math: the prior "+26 tests" claim
enumerated files summing to 51, not 26. Rewrote the "Known result"
section to spell out per-file gross counts (collected via
`pytest --collect-only -q`: 21/4/10/3/4/5/4 = 51), then state the
honest net delta (+26) and acknowledge the 25-test gap is from
pre-existing tests removed/consolidated during the #307–#314
refactors. The math is now self-auditable and the framing no longer
implies the listed adds sum to the delta.
- LOW @ forward-looking "Recognized in qsv ≥ 20.1" claim: tightened
to "Empirically verified to be recognized as DateTime on qsv 20.1.0
(macOS host check during #173 investigation)" and cross-referenced
the test's skip guard for qsv ≥ 20.1, so the claim is grounded in
a real observation and the next maintainer has a clean handoff
path when the production qsv pin moves.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Closes #265.
Before this change, the analysis stage captured the existing datastore resource's per-field
infodicts (the "Data Dictionary" — labels, descriptions, type_overrides) intoProcessingContext.existing_info, then deleted the existing datastore resource so the database stage could re-create + COPY into it. If any later stage failed, the in-memory snapshot died with the worker and the operator's annotations were gone forever — a hole called out in_rollback_database's own docstring:Now the analysis stage writes the dictionary to a small JSON stash file before the delete, and
_rollback_databasereads it back on failure to restore the dictionary by re-creating the datastore resource with the stashed fields and zero rows. The data is unrecoverable (it never landed), but the operator's annotations are preserved across the failed run.Changes
ckanext/datapusher_plus/dictionary_stash.py— tiny on-disk persistence module (save/load/clear/stash_path). Atomic writes viaos.replace, corrupt stashes treated as absent so the rollback hook can't crash on its own bad file, path-traversal guard onresource_id.tests/test_dictionary_stash.py— 15 unit tests covering round-trip, missing-stash, idempotent clear, corrupt JSON, atomic-rename, failed-replace, traversal guard, lazy mkdir, default tempdir fallback.jobs/stages/analysis.py— stashexisting_infobefore delete (best-effort: a stash failure logs a warning but does not block ingestion).jobs/prefect_flow.py—_rollback_databaserestores from stash after dropping the half-written table;datapusher_plus_flow'sfinallyclears the stash on any successful exit (including_StageAbortcomplete-with-skip). Error paths leave the stash for the rollback hook and a possible subsequent retry.config_declaration.yaml— newckanext.datapusher_plus.dictionary_stash_dirknob (default:<tempdir>/dpp_dict_stash; settable for deployments where/tmpis wiped on container restart).Test plan
dpp-test(was 171; +15 from the new stash module).🤖 Generated with Claude Code