Skip to content

fix(#265): stash Data Dictionary before delete, restore on rollback#307

Merged
jqnatividad merged 4 commits into
mainfrom
fix-265-stash-data-dictionary
May 17, 2026
Merged

fix(#265): stash Data Dictionary before delete, restore on rollback#307
jqnatividad merged 4 commits into
mainfrom
fix-265-stash-data-dictionary

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

Closes #265.

Before this change, the analysis stage captured the existing datastore resource's per-field info dicts (the "Data Dictionary" — labels, descriptions, type_overrides) into ProcessingContext.existing_info, then deleted the existing datastore resource so the database stage could re-create + COPY into it. If any later stage failed, the in-memory snapshot died with the worker and the operator's annotations were gone forever — a hole called out in _rollback_database's own docstring:

(The earlier existing_info branch claimed to "preserve" the original, but the delete had already destroyed it.)

Now the analysis stage writes the dictionary to a small JSON stash file before the delete, and _rollback_database reads it back on failure to restore the dictionary by re-creating the datastore resource with the stashed fields and zero rows. The data is unrecoverable (it never landed), but the operator's annotations are preserved across the failed run.

Changes

  • NEW ckanext/datapusher_plus/dictionary_stash.py — tiny on-disk persistence module (save / load / clear / stash_path). Atomic writes via os.replace, corrupt stashes treated as absent so the rollback hook can't crash on its own bad file, path-traversal guard on resource_id.
  • NEW tests/test_dictionary_stash.py — 15 unit tests covering round-trip, missing-stash, idempotent clear, corrupt JSON, atomic-rename, failed-replace, traversal guard, lazy mkdir, default tempdir fallback.
  • jobs/stages/analysis.py — stash existing_info before delete (best-effort: a stash failure logs a warning but does not block ingestion).
  • jobs/prefect_flow.py_rollback_database restores from stash after dropping the half-written table; datapusher_plus_flow's finally clears the stash on any successful exit (including _StageAbort complete-with-skip). Error paths leave the stash for the rollback hook and a possible subsequent retry.
  • config_declaration.yaml — new ckanext.datapusher_plus.dictionary_stash_dir knob (default: <tempdir>/dpp_dict_stash; settable for deployments where /tmp is wiped on container restart).

Test plan

  • 186/186 Python unit tests pass on dpp-test (was 171; +15 from the new stash module).
  • Imports of all four touched modules succeed in isolation.
  • Integration test in CI.
  • Manual smoke (post-merge): annotate a resource's Data Dictionary, force a mid-flow failure (e.g. invalid file content after re-upload), confirm the dictionary survives.

🤖 Generated with Claude Code

jqnatividad and others added 3 commits May 17, 2026 10:39
Close #265. Before this change, the analysis stage captured the
existing datastore resource's per-field info dicts (the "Data
Dictionary": labels, descriptions, type_overrides) into
``ProcessingContext.existing_info``, then deleted the existing
datastore resource so the database stage could re-create + COPY into
it. If any later stage failed, the in-memory snapshot died with the
worker and the operator's annotations were gone forever — a hole
explicitly called out in ``_rollback_database``'s own docstring:

> (The earlier ``existing_info`` branch claimed to "preserve" the
>  original, but the delete had already destroyed it.)

Now the analysis stage writes the dictionary to a small JSON stash
file on disk before the delete, and ``_rollback_database`` reads it
back on failure to restore the dictionary by re-creating the
datastore resource with the stashed fields and zero rows. The data
is unrecoverable (it never landed), but the operator's annotations
are preserved across the failed run. The flow's ``finally`` block
clears the stash on any successful exit (including ``_StageAbort``
complete-with-skip); error paths leave the stash for the rollback
hook and a possible subsequent retry.

* NEW ``ckanext/datapusher_plus/dictionary_stash.py``: tiny on-disk
  persistence module (``save`` / ``load`` / ``clear`` / ``stash_path``).
  Atomic writes via os.replace, treats corrupt stashes as absent so
  the rollback hook can't crash on its own bad file, rejects
  path-traversal in resource_id.
* NEW ``tests/test_dictionary_stash.py``: 15 unit tests covering
  round-trip, missing-stash, idempotent clear, corrupt JSON,
  atomic-rename, failed-replace, traversal guard, lazy mkdir, default
  tempdir fallback.
* ``analysis.py``: stash existing_info before delete (best-effort —
  a stash failure logs a warning but does not block ingestion).
* ``prefect_flow.py``: ``_rollback_database`` restores from stash
  after dropping the half-written table; ``datapusher_plus_flow``'s
  ``finally`` clears the stash on success.
* ``config_declaration.yaml``: new
  ``ckanext.datapusher_plus.dictionary_stash_dir`` knob (defaults to
  ``<tempdir>/dpp_dict_stash``; settable for deployments where /tmp
  is wiped on container restart).

Test count: **186/186 Python unit tests pass** on dpp-test
(was 171; +15 from the new stash module).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four findings, all valid:

**MEDIUM** — Restore path only covered in-transaction failures.
``_rollback_database`` is registered ``@database_task.on_rollback`` so
it only fires for the four tasks inside ``with transaction():``. The
delete happens in ``analyze_task`` which runs *before* the transaction
block, so failures in ``analyze_task`` (after the delete), in
``ai_suggestions_task``, or in ``_maybe_suspend_for_pii_review`` would
leave the stash on disk but no hook would restore it. On retry,
``analyze_task`` saw no existing datastore (it had been deleted), so
``existing_info`` stayed empty, the stash was never re-applied, and
the success-path ``finally`` cleared the orphaned stash — losing the
dictionary anyway. Fix: at the top of ``AnalysisStage`` where
``existing_info`` is captured, if there's no existing datastore but
there IS a stash for this resource, treat that as a retry-after-failure
and load the stash as ``existing_info``. The downstream merge then
applies it onto the new headers, the new datastore resource gets the
dictionary, and the success-finally correctly drops a stash that has
now been propagated.

**LOW** — ``save`` leaked ``.tmp`` files on ``os.replace`` failure.
Wrapped the open/dump/replace in try/except that unlinks the tempfile
on any exception, then re-raises. The original file is still untouched.

**LOW** — ``_stash_dir`` mkdir'd on every call. ``load`` and ``clear``
do not need the directory to exist; in fact, the flow's success
``finally`` calls ``clear`` unconditionally, so the old behavior
created an empty stash directory for every job that never stashed
anything (and a misconfigured stash dir would crash ``clear``).
Added ``_stash_dir(create=False)`` (default) and only ``save`` passes
``create=True``.

**LOW** — Test used ``tk.config.pop`` instead of
``monkeypatch.delitem``, so the pop was not reverted at teardown and
silently dropped any pre-existing value for subsequent tests. Replaced
with ``monkeypatch.delitem(..., raising=False)``.

Tests: 187/187 unit tests pass on dpp-test (was 186; +1 from the new
``test_load_does_not_create_stash_dir``). Tests added/extended:
- ``test_save_with_failed_atomic_replace_does_not_leave_corruption``
  now also asserts no ``.tmp`` is leaked.
- ``test_load_does_not_create_stash_dir`` proves load/clear don't
  bootstrap the directory; only save does.
- ``test_default_stash_dir_falls_back_to_tempfile`` switched to
  ``monkeypatch.delitem`` for clean teardown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three LOW findings, all valid:

**LOW** — Retry-restore branch had no test (testing gap on the central
fix for the prior MEDIUM). Added 3 focused tests against
``AnalysisStage._parse_stats`` that exercise the seam directly:
* ``test_retry_restore_loads_stash_when_no_live_datastore`` proves the
  stash IS loaded into ``existing_info`` and propagated onto the
  rebuilt headers when the datastore is gone but the stash is present.
* ``test_retry_restore_is_noop_when_no_stash`` proves the branch
  doesn't synthesize a non-None empty dict on first-run-no-dictionary.
* ``test_retry_restore_does_not_fire_when_live_datastore_exists``
  proves the live datastore's info dicts win over a stale stash.

**LOW** — A stash file from a long-ago failed run could be silently
applied to an unrelated upload sharing the same resource_id. Surface
the stash mtime in the "restoring" info log so operators can spot
stale-restore vs. genuine retry-after-failure from the logs.
Format: ``stash age N s``. (A future change could enforce a TTL; for
now, visibility is enough — adding a hard expiry would alter the
restore policy, which should be a separate decision.)

**LOW** — Comment claimed the retry branch "re-save[s] (overwriting
itself)" the stash, but no re-save was actually performed. Dropped
the misleading clause and clarified: the existing stash content IS
the correct snapshot (the original dictionary captured before the
original delete), so reusing it as-is is intentional. If this retry
also fails, the same stash is reused on the next attempt.

Tests: 190/190 unit tests pass on dpp-test (was 187; +3 from the new
retry-restore coverage).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR closes #265 by persisting a resource's Data Dictionary (per-field info dicts: label / description / type_override) to a small on-disk JSON file before the analysis stage deletes the existing datastore resource. On a rollback (any downstream stage failure inside the transactional block), the database rollback hook reads the stash back and re-creates the datastore resource with zero rows so the operator's annotations survive. The stash is cleared on successful flow completion and left in place on failure so a subsequent retry — or operator inspection — can use it.

Changes:

  • New dictionary_stash module (atomic os.replace writes, corrupt-stash treated as absent, path-traversal guard, lazy stash dir), plus a new CKAN config knob ckanext.datapusher_plus.dictionary_stash_dir.
  • Analysis stage stashes existing_info before delete (best-effort) and, on retry-after-failure (no live datastore but stash present), loads the stash back into existing_info.
  • _rollback_database restores the dictionary from the stash; the flow's finally clears the stash on any successful exit.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
ckanext/datapusher_plus/dictionary_stash.py New on-disk persistence module (save/load/clear/stash_path) with atomic writes and traversal guard.
ckanext/datapusher_plus/jobs/stages/analysis.py Stashes dictionary before delete and restores from stash on retry when no live datastore exists.
ckanext/datapusher_plus/jobs/prefect_flow.py Rollback hook restores dictionary from stash; success finally clears it.
ckanext/datapusher_plus/config_declaration.yaml Declares the new dictionary_stash_dir config option.
tests/test_dictionary_stash.py 15 unit tests covering round-trip, corruption, atomicity, traversal, dir bootstrap, and analysis-stage retry restore.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ckanext/datapusher_plus/jobs/prefect_flow.py Outdated
Copilot review on PR #307 flagged that ``_rollback_database``'s restore
path built field dicts with only ``id`` and ``info`` keys, omitting
``type``. Every other call site to ``send_resource_to_datastore`` in
the codebase (``_build_headers_dicts`` in ``analysis.py:435``,
``metadata.py:96``) passes headers with ``id``, ``type``, AND ``info``.
CKAN's ``datastore_create`` falls back to ``text`` for fields without
a declared type, so a stashed dictionary whose entries carried a
non-text ``type_override`` (e.g. ``numeric``, ``timestamp``) would be
silently restored as a text column — inconsistent with the operator's
original intent.

The rollback now mirrors the analysis stage's merge: for each stashed
entry, derive ``type`` from ``info["type_override"]`` when it's in
``conf.TYPE_MAPPING.values()`` (text / numeric / timestamp / date),
falling back to ``text`` for missing or unknown overrides.

Tests: 192/192 unit tests pass on dpp-test (was 190; +2 new
regression tests pinning text/numeric/timestamp/no-override + unknown
type fallback behavior).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 59e08d7 into main May 17, 2026
1 check passed
jqnatividad added a commit that referenced this pull request May 18, 2026
Three findings (1 MEDIUM, 2 LOW) on the memory-update commit; all
addressed.

- MEDIUM @ stale `tests/test_date_type_mapping.py` reference: this
  commit was *specifically* supposed to fix that reference but missed
  one instance (line 316). Fixed — and the same class of bug existed
  for three other filenames the reviewer didn't flag explicitly but
  which had the same root cause (me inventing filenames without
  verifying against `git log --diff-filter=A`). Corrected:
    test_file_hash_algorithms.py     -> test_file_hash_algorithm.py
    test_metadata_stage_file_hash.py -> test_metadata_hash_persistence.py
    test_download_result_rehydrate.py -> test_rehydrate_resource_identity.py
    test_date_type_mapping.py        -> test_date_without_timestamp.py
  Verified all four replacements resolve as real files in the tree.

- LOW @ inconsistent test-count math: the prior "+26 tests" claim
  enumerated files summing to 51, not 26. Rewrote the "Known result"
  section to spell out per-file gross counts (collected via
  `pytest --collect-only -q`: 21/4/10/3/4/5/4 = 51), then state the
  honest net delta (+26) and acknowledge the 25-test gap is from
  pre-existing tests removed/consolidated during the #307#314
  refactors. The math is now self-auditable and the framing no longer
  implies the listed adds sum to the delta.

- LOW @ forward-looking "Recognized in qsv ≥ 20.1" claim: tightened
  to "Empirically verified to be recognized as DateTime on qsv 20.1.0
  (macOS host check during #173 investigation)" and cross-referenced
  the test's skip guard for qsv ≥ 20.1, so the claim is grounded in
  a real observation and the next maintainer has a clean handoff
  path when the production qsv pin moves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stash existing data dictionary before doing a DP+ job. If DP+job fails, the old Data Dictionary should be restored.

2 participants