fix: correct pii_screening config-key typo + documentation-audit findings#324
Merged
Conversation
A documentation audit found that config.py read PII_SCREENING from ``ckanext.datastore_plus.pii_screening`` — a typo: ``datastore_plus`` instead of ``datapusher_plus``. Every other DP+ setting, the ``config_declaration.yaml`` declaration, and the README all use the ``datapusher_plus`` namespace. Practical effect: operators enabling PII screening via the documented key ``ckanext.datapusher_plus.pii_screening`` had no effect — ``PII_SCREENING`` silently stayed at its ``False`` fallback, so the PII-screening feature could not be turned on as documented. Only someone who happened to set the (undocumented, typo'd) key ``ckanext.datastore_plus.pii_screening`` would have enabled it. Fix: read the documented ``ckanext.datapusher_plus.pii_screening`` key. Adds tests/test_pii_screening_config_key.py with 3 regression tests: the documented key now drives PII_SCREENING, the old typo'd key no longer does, and an AST-based drift guard that pins the config.py key string to the config_declaration.yaml declaration so a reintroduced namespace typo fails CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the documentation audit. Addresses the doc-side findings (the pii_screening code bug is fixed separately): * **config.py — preview_rows fallback**: aligned the inline ``tk.config.get`` fallback from ``"1000"`` to ``"0"`` so it agrees with ``config_declaration.yaml`` (default ``0``). Under CKAN 2.10+ declarative config the declared default already wins, so ``0`` was the effective runtime default all along — the ``"1000"`` fallback was stale, never-reached, and contradicted the declaration. No runtime behavior change. * **README.md — stale describeGPT_api_key**: removed the ``ckanext.datapusher_plus.describeGPT_api_key`` example line. No such config key exists; v3.0 moved AI-suggestion credentials out of ckan.ini into qsv's own config (``describegpt_config_path``) / the ``OPENAI_API_KEY`` env var. The README's own v3.0 config reference table already omitted it. * **README.md — formats example**: dropped ``xlsxb`` (a typo — the extension is ``xlsb``) and ``xlsm`` from both ``formats`` example lines so they match ``config.py``'s actual FORMATS default. Note: ``format_converter.py`` *can* convert ``.xlsm`` / ``.xlsb`` (they're in ``SPREADSHEET_EXTENSIONS``), but the ``FORMATS`` gate doesn't list them, so DP+ won't pick those files up today — adding them to FORMATS is a separate feature decision, not a doc fix. * **README.md — illustrative-values note**: both ckan.ini example config blocks silently contradicted the actual defaults for several keys (preview_rows, chunk_size, dedup, auto_index_threshold, ignore_file_hash, auto_alias_unique). Added a note above each block clarifying the values are illustrative examples, not defaults, and pointing to config_declaration.yaml as authoritative. The individual stale values were left as-is because some non-defaults may be deliberate install-guide recommendations — a value-by-value reconciliation is left to a maintainer pass. * **CONFIG.md — DRUF template overrides**: added the two ``resource_form.html`` override entries (plain + scheming) that were missing from the list — DP+ ships 5 DRUF override templates, the doc listed only 3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a mis-namespaced CKAN config key that prevented pii_screening from being enabled as documented, and updates documentation/config fallbacks found during a documentation audit.
Changes:
- Correct
PII_SCREENINGto read fromckanext.datapusher_plus.pii_screeningand alignpreview_rowsfallback with declarative config defaults. - Add regression tests to ensure the documented PII key works and to guard against declaration-vs-code drift.
- Update README/CONFIG documentation examples and template-override listings to match current behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
tests/test_pii_screening_config_key.py |
Adds regression + drift-guard coverage for the PII config-key fix. |
ckanext/datapusher_plus/config.py |
Fixes PII config key typo; aligns PREVIEW_ROWS fallback with declared default. |
README.md |
Clarifies example-vs-default configs; removes stale/invalid config examples; fixes formats list. |
CONFIG.md |
Documents the missing DRUF template overrides. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Copilot flagged that ``_reload_config`` uses ``importlib.reload`` on ``ckanext.datapusher_plus.config``, which mutates the module object in place. The recomputed constants (from a monkeypatched ``tk.config``) persist in ``sys.modules`` after the test — once ``monkeypatch`` restores ``tk.config`` it does NOT re-run ``config.py`` — so a non-default config snapshot could leak into later tests and make the suite order-dependent. Fix: added a ``_reset_config_module`` autouse fixture that reloads ``config`` on teardown. As an autouse fixture it is set up before the test's ``monkeypatch`` fixture, so its finalizer runs *after* ``monkeypatch`` has restored ``tk.config`` — the teardown reload therefore recomputes the module constants from the pristine test-environment config, leaving no order dependence. Also corrected the ``_reload_config`` docstring, which had cited ``test_file_hash_algorithm`` / the #61 tests as a reload precedent — both actually read ``tk.config`` live and do NOT reload. Verified: the config-adjacent suites (``test_file_hash_algorithm``, ``test_issue_61_download_always_whitelist``) run green alongside this file, and the full suite stays at 277 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
jqnatividad
added a commit
that referenced
this pull request
May 20, 2026
…325) A documentation audit found CLAUDE.md materially stale — it still described the v2.0 in-process pipeline. 13 of 38 verified claims were false. Corrections: * Version v2.0.0 → 3.0.0a0 (per pyproject.toml). * Architecture section rewritten for v3.0: the v2 ``DataProcessingPipeline`` loop and the named ``pipeline.py`` / ``jobs_legacy.py`` files do not exist. Orchestration is ``jobs/prefect_flow.py`` with per-stage ``@task`` functions and the ``datapusher_plus_flow`` entry ``@flow``. Documented the full v3.0 jobs-module layout (prefect_flow, runtime_context, subflows, events, caching, blocks, artifacts, quarantine, file_persistence) and the previously-omitted ``ai_suggestions`` stage. * Key Modules: dropped the non-existent ``jobs_legacy.py``; added ``prefect_client.py``, ``dictionary_stash.py``, ``utils.py``. * Formula type ``suggest_formula`` → ``suggestion_formula`` (the actual scheming-YAML key — verified in jinja2_helpers.py / formula.py). * qsv ``v4.0.0+`` → ``v20.1.0+`` (MINIMUM_QSV_VERSION; two BREAKING bumps since 4.0.0). * ``preview_rows`` default 1000 → 0 (matches config_declaration.yaml and config.py after PR #324). * External Dependencies: RQ → Prefect 3.7+ (the v3.0 runtime; requirements.txt pins prefect, not rq). * CLI commands: added ``prefect-deploy`` and ``migrate-from-rq``. * Build & Test: ``pytest tests/test_unit.py`` (no such file) replaced with a real example + the unit/integration split. Verified-correct claims (plugin.py interface list, formula namespaces, models, job_exceptions, logic actions, Python versions) were left intact. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
jqnatividad
added a commit
that referenced
this pull request
May 20, 2026
… config (#326) * ci: rewrite test.yml as a real unit-suite job The old test.yml ("Tests") was dead 2022-era cruft: a Python 2.7-3.9 matrix, an `actions/checkout@v2` / `setup-python@v2` / `codecov-action@v1` stack, `pip install -r requirements-dev-py2.txt` (that file no longer exists), and `pytest --cov=datapusher` against the pre-extension package name. It cannot have passed since the extension refactor. It also masked a real gap surfaced by the CLAUDE.md audit: NO workflow runs the DP+ unit suite. `ci.yml` runs only the qsv contract regression test + the integration suite; `main.yml` is a manual end-to-end run. Replaced with a `unit-tests` job that runs the full unit suite (`pytest tests/ --ignore=tests/integration`) on push/PR. It mirrors the proven `dpp-test` container setup: the `ckan/ckan-dev:2.11` image, geo system libs, `requirements.txt` + `requirements-dev.txt` + editable install, qsv 20.1.0, and the `PYTEST_DISABLE_PLUGIN_AUTOLOAD` / `QSV_BIN` / `CKAN_INI` env the suite needs. Python 3.10 only: the unit suite imports `ckan.plugins.toolkit`, so it needs a full CKAN install, and the CKAN 2.11 dev image ships Python 3.10. A 3.11-3.13 matrix would require building CKAN from source per version — left as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: add xlsm/xlsb to FORMATS + correct spatial-tolerance config-key casing Two config.py correctness fixes surfaced by the documentation audit: * **xlsm/xlsb were ungated.** ``FormatConverterStage`` advertises ``.xlsm`` and ``.xlsb`` support (both in ``SPREADSHEET_EXTENSIONS``), but the ``FORMATS`` default omitted them — so DP+ silently skipped those files before they ever reached the converter. Added both to the ``FORMATS`` default, grouped with the other Excel formats. New ``tests/test_formats_config.py`` (2 tests): xlsm/xlsb are in ``FORMATS``, and a consistency guard that every extension in ``FormatConverterStage.SPREADSHEET_EXTENSIONS`` is gated in by ``FORMATS`` — so a future converter-side addition can't silently leave a format ungated. * **spatial-tolerance config key had inconsistent casing.** ``config.py`` read the relative-tolerance setting from ``ckanext.datapusher_plus.SPATIAL_SIMPLIFICATION_RELATIVE_TOLERANCE`` — UPPERCASE, unlike every other ~50 DP+ config keys. Operators setting the lowercase form (the only form the README ever showed, albeit misspelled) silently had no effect. Lowercased the key string to ``...spatial_simplification_relative_tolerance``. The Python constant name is unchanged; only the ckan.ini key string moved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: reconcile README ckan.ini example config blocks to current defaults Both ckan.ini example blocks were reconciled to DP+'s actual current defaults (the documentation audit found the values had drifted; PR #324 added a stopgap "illustrative, not defaults" note). Each fix verified against config.py and config_declaration.yaml — where a key is declared, the declaration's default is what CKAN 2.10+ applies at runtime, so that is the authoritative value. Per block: * Removed ``use_proxy`` — there is no such config key. DP+ derives proxy use from whether ``download_proxy`` is set (``config.py``: ``USE_PROXY = "...download_proxy" in tk.config``). * ``ssl_verify``: false -> true (the actual default — verifying TLS). * ``preview_rows``: 100 -> 0 (load all rows; matches config_declaration.yaml after PR #324). * ``auto_index_threshold``: 3 -> 10 (issue #142 raised the default). * ``formats``: added ``xlsm xlsb`` (now gated in — see the FORMATS commit in this PR). * ``auto_spatial_simplication`` -> ``auto_spatial_simplification`` and ``spatial_simplication_relative_tolerance`` -> ``spatial_simplification_relative_tolerance`` — the README had always misspelled "simplification"; the corrected lowercase key now matches config.py. * ``longitude_fields``: ``longitude,long,lon`` -> ``longitude,lon`` (the actual default has no ``long``). * ``jinja2_bytecode_cache_dir``: fixed the ``butecode`` typo in the path default. ``chunk_size`` (16384), ``dedup`` (false), ``ignore_file_hash`` (true) and ``auto_alias_unique`` (false) were left as-is — those already match the config_declaration.yaml defaults that win at runtime. (config.py's *inline fallbacks* for those four still drift from the declaration — behavior-preserving dead code, same shape as the preview_rows fallback PR #324 fixed; flagged for a follow-up.) The note above each block was reworded: the block now genuinely lists current defaults, so it is a usable starting-point reference rather than an "illustrative, differs from defaults" disclaimer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address roborev #2306/#2307/#2308 on PR #326 roborev reviewed the three PR #326 commits. Findings actioned: **#2306 (LOW) — test.yml comment.** The "Run unit test suite" step attributed the `--pdbcls` addopt to `setup.cfg`; pytest config moved to `pyproject.toml` (`[tool.pytest.ini_options]`). Comment corrected. **#2307 (LOW) — xlsxb typo in CI workflows.** `ci.yml` and `main.yml` set `formats` with `xlsxb` — a typo for `xlsb` — so an explicitly-configured `.xlsb` resource stayed ungated in CI even though the converter supports it. Corrected `xlsxb` → `xlsb` in all four spots: the `ckanext.datapusher_plus.formats` and legacy `ckan.datapusher.formats` lines in `ci.yml`, the `main.yml` formats line, and `ci.yml`'s test-file `case` branch (whose MIME type, `application/vnd.ms-excel.sheet.binary.macroEnabled.12`, confirms `.xlsb` was the intent). **#2307 (LOW) — missing spatial-key regression test + redundant importorskip.** Added `test_spatial_tolerance_config_key_is_lowercase` (AST drift guard against a revert to the uppercase key typo) and dropped the redundant `pytest.importorskip("ckan")` in `test_formats_covers_every_converter_spreadsheet_extension` (the `formats` fixture already does it). **#2308 (LOW x2) — README config-block notes.** The reworded notes named `config_declaration.yaml` as the sole authoritative source, but `ssl_verify` / `formats` / `unsafe_prefix` / `reserved_colnames` are declared only in `config.py`. Notes now cite both. Also restored the "set only the keys you want to change" guidance and explained why pasting the whole block verbatim is risky (pins every value as an explicit override, defeating future default updates). Not changed (already resolved earlier in the PR stack): * #2307 (MEDIUM) — README documented a misspelled spatial key. The README spelling fix landed in this PR's commit a9e64db; #2307 reviewed the earlier commit 1638bf7 in isolation. Full unit suite: 279 → 280, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Copilot review on PR #326 Six Copilot findings, all applied: * **plugin.py had a second, drifting formats list.** ``_submit_to_datapusher`` fell back to its own hardcoded format list when neither ``ckan.datapusher.formats`` nor ``ckanext.datapusher_plus.formats`` was set — and that list lacked ``xlsm``/``xlsb``, so the FORMATS fix would not reach deployments that don't set ``formats`` explicitly. Centralized the fallback on ``conf.FORMATS`` so there is one canonical default that cannot drift. * **Spatial-tolerance config-key rename could break deployments.** The previous commit lowercased the key string; an operator who had set the legacy UPPERCASE key would silently lose the setting. Added a nested fallback to the legacy ``...SPATIAL_SIMPLIFICATION_RELATIVE_TOLERANCE`` key — the same backward-compat shape ``SSL_VERIFY`` already uses. * **tests/test_formats_config.py was runtime/CKAN-dependent.** Rewrote both FORMATS tests to AST-parse the ``config.py`` default list literal and ``format_converter.py``'s ``SPREADSHEET_EXTENSIONS`` — deterministic, CKAN-free config-drift guards in the same style as the other config tests in the repo. * **The spatial-key AST test could capture a nested fallback key.** With the legacy fallback now nested as a second ``tk.config.get`` arg, walking for "the first ``.get``" is fragile. The test now targets the assignment's ``.value`` directly — the outermost ``tk.config.get`` — so it pins the *primary* key regardless of nesting. * **Dead ``use_proxy`` config key in CI workflows.** Removed ``ckanext.datapusher_plus.use_proxy`` from the ci.yml and main.yml CKAN config blocks — there is no such key (DP+ derives proxy use from whether ``download_proxy`` is set). Full unit suite: 280 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A documentation audit of
datapusher-plus(~70 claims checked across README.md, CONFIG.md, docs/) surfaced one functional bug and several doc-vs-code inconsistencies. This PR fixes them in two focused commits.Commit 1 —
fix:thepii_screeningconfig-key typo (functional bug)config.pyreadPII_SCREENINGfromckanext.datastore_plus.pii_screening— a namespace typo (datastore_plusinstead ofdatapusher_plus). The documented keyckanext.datapusher_plus.pii_screening(correctly declared inconfig_declaration.yaml, documented in the README) never populatedPII_SCREENING— it silently stayed at itsFalsefallback, so PII screening could not be enabled as documented.config.py.tests/test_pii_screening_config_key.py(3 tests): documented key now drivesPII_SCREENING, the old typo'd key no longer does, and an AST-based drift guard pins theconfig.pykey string to theconfig_declaration.yamldeclaration.Commit 2 —
docs:correct documentation-audit findingsconfig.pypreview_rowsfallback"1000"→"0"to agree withconfig_declaration.yaml(default0). Under CKAN 2.10+ declarative config the declared default already wins, so0was the effective default all along — no runtime behavior change, just removes a stale never-reached fallback.describeGPT_api_keyexample line (no such key; v3.0 moved AI credentials to qsv's config /OPENAI_API_KEY).formatsexample: droppedxlsxb(typo forxlsb) andxlsmto matchconfig.py's actualFORMATSdefault. (Note:format_converter.pycan convert.xlsm/.xlsb, but theFORMATSgate omits them — adding them is a separate feature decision.)preview_rows,chunk_size,dedup,auto_index_threshold,ignore_file_hash,auto_alias_unique).resource_form.htmlDRUF template-override entries (DP+ ships 5, the doc listed 3).Deliberately out of scope
ignore_file_hash = true,dedup = false) may be deliberate install-guide recommendations rather than stale drift — left to a maintainer pass. The added note resolves the immediate "silently contradicts defaults" honesty problem.xlsm/xlsbto theFORMATSdefault — a feature decision, not a doc fix.Test plan
tests/test_pii_screening_config_key.py— 3 new tests, all passing.preview_rowschange verified behavior-preserving (existing tests monkeypatchPREVIEW_ROWSexplicitly; declared default already0).🤖 Generated with Claude Code