fix(#265): stash Data Dictionary before delete, restore on rollback by jqnatividad · Pull Request #307 · dathere/datapusher-plus

jqnatividad · 2026-05-17T14:39:51Z

Summary

Closes #265.

Before this change, the analysis stage captured the existing datastore resource's per-field info dicts (the "Data Dictionary" — labels, descriptions, type_overrides) into ProcessingContext.existing_info, then deleted the existing datastore resource so the database stage could re-create + COPY into it. If any later stage failed, the in-memory snapshot died with the worker and the operator's annotations were gone forever — a hole called out in _rollback_database's own docstring:

(The earlier existing_info branch claimed to "preserve" the original, but the delete had already destroyed it.)

Now the analysis stage writes the dictionary to a small JSON stash file before the delete, and _rollback_database reads it back on failure to restore the dictionary by re-creating the datastore resource with the stashed fields and zero rows. The data is unrecoverable (it never landed), but the operator's annotations are preserved across the failed run.

Changes

NEW ckanext/datapusher_plus/dictionary_stash.py — tiny on-disk persistence module (save / load / clear / stash_path). Atomic writes via os.replace, corrupt stashes treated as absent so the rollback hook can't crash on its own bad file, path-traversal guard on resource_id.
NEW tests/test_dictionary_stash.py — 15 unit tests covering round-trip, missing-stash, idempotent clear, corrupt JSON, atomic-rename, failed-replace, traversal guard, lazy mkdir, default tempdir fallback.
jobs/stages/analysis.py — stash existing_info before delete (best-effort: a stash failure logs a warning but does not block ingestion).
jobs/prefect_flow.py — _rollback_database restores from stash after dropping the half-written table; datapusher_plus_flow's finally clears the stash on any successful exit (including _StageAbort complete-with-skip). Error paths leave the stash for the rollback hook and a possible subsequent retry.
config_declaration.yaml — new ckanext.datapusher_plus.dictionary_stash_dir knob (default: <tempdir>/dpp_dict_stash; settable for deployments where /tmp is wiped on container restart).

Test plan

186/186 Python unit tests pass on dpp-test (was 171; +15 from the new stash module).
Imports of all four touched modules succeed in isolation.
Integration test in CI.
Manual smoke (post-merge): annotate a resource's Data Dictionary, force a mid-flow failure (e.g. invalid file content after re-upload), confirm the dictionary survives.

🤖 Generated with Claude Code

Close #265. Before this change, the analysis stage captured the existing datastore resource's per-field info dicts (the "Data Dictionary": labels, descriptions, type_overrides) into ``ProcessingContext.existing_info``, then deleted the existing datastore resource so the database stage could re-create + COPY into it. If any later stage failed, the in-memory snapshot died with the worker and the operator's annotations were gone forever — a hole explicitly called out in ``_rollback_database``'s own docstring: > (The earlier ``existing_info`` branch claimed to "preserve" the > original, but the delete had already destroyed it.) Now the analysis stage writes the dictionary to a small JSON stash file on disk before the delete, and ``_rollback_database`` reads it back on failure to restore the dictionary by re-creating the datastore resource with the stashed fields and zero rows. The data is unrecoverable (it never landed), but the operator's annotations are preserved across the failed run. The flow's ``finally`` block clears the stash on any successful exit (including ``_StageAbort`` complete-with-skip); error paths leave the stash for the rollback hook and a possible subsequent retry. * NEW ``ckanext/datapusher_plus/dictionary_stash.py``: tiny on-disk persistence module (``save`` / ``load`` / ``clear`` / ``stash_path``). Atomic writes via os.replace, treats corrupt stashes as absent so the rollback hook can't crash on its own bad file, rejects path-traversal in resource_id. * NEW ``tests/test_dictionary_stash.py``: 15 unit tests covering round-trip, missing-stash, idempotent clear, corrupt JSON, atomic-rename, failed-replace, traversal guard, lazy mkdir, default tempdir fallback. * ``analysis.py``: stash existing_info before delete (best-effort — a stash failure logs a warning but does not block ingestion). * ``prefect_flow.py``: ``_rollback_database`` restores from stash after dropping the half-written table; ``datapusher_plus_flow``'s ``finally`` clears the stash on success. * ``config_declaration.yaml``: new ``ckanext.datapusher_plus.dictionary_stash_dir`` knob (defaults to ``<tempdir>/dpp_dict_stash``; settable for deployments where /tmp is wiped on container restart). Test count: **186/186 Python unit tests pass** on dpp-test (was 171; +15 from the new stash module). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four findings, all valid: **MEDIUM** — Restore path only covered in-transaction failures. ``_rollback_database`` is registered ``@database_task.on_rollback`` so it only fires for the four tasks inside ``with transaction():``. The delete happens in ``analyze_task`` which runs *before* the transaction block, so failures in ``analyze_task`` (after the delete), in ``ai_suggestions_task``, or in ``_maybe_suspend_for_pii_review`` would leave the stash on disk but no hook would restore it. On retry, ``analyze_task`` saw no existing datastore (it had been deleted), so ``existing_info`` stayed empty, the stash was never re-applied, and the success-path ``finally`` cleared the orphaned stash — losing the dictionary anyway. Fix: at the top of ``AnalysisStage`` where ``existing_info`` is captured, if there's no existing datastore but there IS a stash for this resource, treat that as a retry-after-failure and load the stash as ``existing_info``. The downstream merge then applies it onto the new headers, the new datastore resource gets the dictionary, and the success-finally correctly drops a stash that has now been propagated. **LOW** — ``save`` leaked ``.tmp`` files on ``os.replace`` failure. Wrapped the open/dump/replace in try/except that unlinks the tempfile on any exception, then re-raises. The original file is still untouched. **LOW** — ``_stash_dir`` mkdir'd on every call. ``load`` and ``clear`` do not need the directory to exist; in fact, the flow's success ``finally`` calls ``clear`` unconditionally, so the old behavior created an empty stash directory for every job that never stashed anything (and a misconfigured stash dir would crash ``clear``). Added ``_stash_dir(create=False)`` (default) and only ``save`` passes ``create=True``. **LOW** — Test used ``tk.config.pop`` instead of ``monkeypatch.delitem``, so the pop was not reverted at teardown and silently dropped any pre-existing value for subsequent tests. Replaced with ``monkeypatch.delitem(..., raising=False)``. Tests: 187/187 unit tests pass on dpp-test (was 186; +1 from the new ``test_load_does_not_create_stash_dir``). Tests added/extended: - ``test_save_with_failed_atomic_replace_does_not_leave_corruption`` now also asserts no ``.tmp`` is leaked. - ``test_load_does_not_create_stash_dir`` proves load/clear don't bootstrap the directory; only save does. - ``test_default_stash_dir_falls_back_to_tempfile`` switched to ``monkeypatch.delitem`` for clean teardown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three LOW findings, all valid: **LOW** — Retry-restore branch had no test (testing gap on the central fix for the prior MEDIUM). Added 3 focused tests against ``AnalysisStage._parse_stats`` that exercise the seam directly: * ``test_retry_restore_loads_stash_when_no_live_datastore`` proves the stash IS loaded into ``existing_info`` and propagated onto the rebuilt headers when the datastore is gone but the stash is present. * ``test_retry_restore_is_noop_when_no_stash`` proves the branch doesn't synthesize a non-None empty dict on first-run-no-dictionary. * ``test_retry_restore_does_not_fire_when_live_datastore_exists`` proves the live datastore's info dicts win over a stale stash. **LOW** — A stash file from a long-ago failed run could be silently applied to an unrelated upload sharing the same resource_id. Surface the stash mtime in the "restoring" info log so operators can spot stale-restore vs. genuine retry-after-failure from the logs. Format: ``stash age N s``. (A future change could enforce a TTL; for now, visibility is enough — adding a hard expiry would alter the restore policy, which should be a separate decision.) **LOW** — Comment claimed the retry branch "re-save[s] (overwriting itself)" the stash, but no re-save was actually performed. Dropped the misleading clause and clarified: the existing stash content IS the correct snapshot (the original dictionary captured before the original delete), so reusing it as-is is intentional. If this retry also fails, the same stash is reused on the next attempt. Tests: 190/190 unit tests pass on dpp-test (was 187; +3 from the new retry-restore coverage). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR closes #265 by persisting a resource's Data Dictionary (per-field info dicts: label / description / type_override) to a small on-disk JSON file before the analysis stage deletes the existing datastore resource. On a rollback (any downstream stage failure inside the transactional block), the database rollback hook reads the stash back and re-creates the datastore resource with zero rows so the operator's annotations survive. The stash is cleared on successful flow completion and left in place on failure so a subsequent retry — or operator inspection — can use it.

Changes:

New dictionary_stash module (atomic os.replace writes, corrupt-stash treated as absent, path-traversal guard, lazy stash dir), plus a new CKAN config knob ckanext.datapusher_plus.dictionary_stash_dir.
Analysis stage stashes existing_info before delete (best-effort) and, on retry-after-failure (no live datastore but stash present), loads the stash back into existing_info.
_rollback_database restores the dictionary from the stash; the flow's finally clears the stash on any successful exit.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
ckanext/datapusher_plus/dictionary_stash.py	New on-disk persistence module (`save`/`load`/`clear`/`stash_path`) with atomic writes and traversal guard.
ckanext/datapusher_plus/jobs/stages/analysis.py	Stashes dictionary before delete and restores from stash on retry when no live datastore exists.
ckanext/datapusher_plus/jobs/prefect_flow.py	Rollback hook restores dictionary from stash; success `finally` clears it.
ckanext/datapusher_plus/config_declaration.yaml	Declares the new `dictionary_stash_dir` config option.
tests/test_dictionary_stash.py	15 unit tests covering round-trip, corruption, atomicity, traversal, dir bootstrap, and analysis-stage retry restore.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot review on PR #307 flagged that ``_rollback_database``'s restore path built field dicts with only ``id`` and ``info`` keys, omitting ``type``. Every other call site to ``send_resource_to_datastore`` in the codebase (``_build_headers_dicts`` in ``analysis.py:435``, ``metadata.py:96``) passes headers with ``id``, ``type``, AND ``info``. CKAN's ``datastore_create`` falls back to ``text`` for fields without a declared type, so a stashed dictionary whose entries carried a non-text ``type_override`` (e.g. ``numeric``, ``timestamp``) would be silently restored as a text column — inconsistent with the operator's original intent. The rollback now mirrors the analysis stage's merge: for each stashed entry, derive ``type`` from ``info["type_override"]`` when it's in ``conf.TYPE_MAPPING.values()`` (text / numeric / timestamp / date), falling back to ``text`` for missing or unknown overrides. Tests: 192/192 unit tests pass on dpp-test (was 190; +2 new regression tests pinning text/numeric/timestamp/no-override + unknown type fallback behavior). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings (1 MEDIUM, 2 LOW) on the memory-update commit; all addressed. - MEDIUM @ stale `tests/test_date_type_mapping.py` reference: this commit was *specifically* supposed to fix that reference but missed one instance (line 316). Fixed — and the same class of bug existed for three other filenames the reviewer didn't flag explicitly but which had the same root cause (me inventing filenames without verifying against `git log --diff-filter=A`). Corrected: test_file_hash_algorithms.py -> test_file_hash_algorithm.py test_metadata_stage_file_hash.py -> test_metadata_hash_persistence.py test_download_result_rehydrate.py -> test_rehydrate_resource_identity.py test_date_type_mapping.py -> test_date_without_timestamp.py Verified all four replacements resolve as real files in the tree. - LOW @ inconsistent test-count math: the prior "+26 tests" claim enumerated files summing to 51, not 26. Rewrote the "Known result" section to spell out per-file gross counts (collected via `pytest --collect-only -q`: 21/4/10/3/4/5/4 = 51), then state the honest net delta (+26) and acknowledge the 25-test gap is from pre-existing tests removed/consolidated during the #307–#314 refactors. The math is now self-auditable and the framing no longer implies the listed adds sum to the delta. - LOW @ forward-looking "Recognized in qsv ≥ 20.1" claim: tightened to "Empirically verified to be recognized as DateTime on qsv 20.1.0 (macOS host check during #173 investigation)" and cross-referenced the test's skip guard for qsv ≥ 20.1, so the claim is grounded in a real observation and the next maintainer has a clean handoff path when the production qsv pin moves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad and others added 3 commits May 17, 2026 10:39

jqnatividad requested a review from Copilot May 17, 2026 15:04

Copilot started reviewing on behalf of jqnatividad May 17, 2026 15:04 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

Comment thread ckanext/datapusher_plus/jobs/prefect_flow.py Outdated

jqnatividad merged commit 59e08d7 into main May 17, 2026
1 check passed

This was referenced May 18, 2026

DP+ 3.0 tracking issue #259

Open

Add CI tests #187

Closed

TWDH-specific: metadata fields blanked out when running 'datapusher resubmit' #102

Closed

Recoverable jobs #35

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#265): stash Data Dictionary before delete, restore on rollback#307

fix(#265): stash Data Dictionary before delete, restore on rollback#307
jqnatividad merged 4 commits into
mainfrom
fix-265-stash-data-dictionary

jqnatividad commented May 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented May 17, 2026

Summary

Changes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants