Skip to content

fix: correct pii_screening config-key typo + documentation-audit findings#324

Merged
jqnatividad merged 3 commits into
mainfrom
fix-docs-audit-config-keys
May 20, 2026
Merged

fix: correct pii_screening config-key typo + documentation-audit findings#324
jqnatividad merged 3 commits into
mainfrom
fix-docs-audit-config-keys

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

A documentation audit of datapusher-plus (~70 claims checked across README.md, CONFIG.md, docs/) surfaced one functional bug and several doc-vs-code inconsistencies. This PR fixes them in two focused commits.

Commit 1 — fix: the pii_screening config-key typo (functional bug)

config.py read PII_SCREENING from ckanext.datastore_plus.pii_screening — a namespace typo (datastore_plus instead of datapusher_plus). The documented key ckanext.datapusher_plus.pii_screening (correctly declared in config_declaration.yaml, documented in the README) never populated PII_SCREENING — it silently stayed at its False fallback, so PII screening could not be enabled as documented.

  • Fixed the key string in config.py.
  • Added tests/test_pii_screening_config_key.py (3 tests): documented key now drives PII_SCREENING, the old typo'd key no longer does, and an AST-based drift guard pins the config.py key string to the config_declaration.yaml declaration.

Commit 2 — docs: correct documentation-audit findings

  • config.py preview_rows fallback "1000""0" to agree with config_declaration.yaml (default 0). Under CKAN 2.10+ declarative config the declared default already wins, so 0 was the effective default all along — no runtime behavior change, just removes a stale never-reached fallback.
  • README — removed the stale describeGPT_api_key example line (no such key; v3.0 moved AI credentials to qsv's config / OPENAI_API_KEY).
  • READMEformats example: dropped xlsxb (typo for xlsb) and xlsm to match config.py's actual FORMATS default. (Note: format_converter.py can convert .xlsm/.xlsb, but the FORMATS gate omits them — adding them is a separate feature decision.)
  • README — added an "illustrative example, not defaults" note above both ckan.ini config blocks, which silently contradicted actual defaults for several keys (preview_rows, chunk_size, dedup, auto_index_threshold, ignore_file_hash, auto_alias_unique).
  • CONFIG.md — added the two missing resource_form.html DRUF template-override entries (DP+ ships 5, the doc listed 3).

Deliberately out of scope

  • Value-by-value reconciliation of the README example config blocks. Some non-default values (e.g. ignore_file_hash = true, dedup = false) may be deliberate install-guide recommendations rather than stale drift — left to a maintainer pass. The added note resolves the immediate "silently contradicts defaults" honesty problem.
  • Adding xlsm/xlsb to the FORMATS default — a feature decision, not a doc fix.

Test plan

  • tests/test_pii_screening_config_key.py — 3 new tests, all passing.
  • Full Python unit suite: 274 → 277 passing.
  • preview_rows change verified behavior-preserving (existing tests monkeypatch PREVIEW_ROWS explicitly; declared default already 0).

🤖 Generated with Claude Code

jqnatividad and others added 2 commits May 19, 2026 23:05
A documentation audit found that config.py read PII_SCREENING from
``ckanext.datastore_plus.pii_screening`` — a typo: ``datastore_plus``
instead of ``datapusher_plus``. Every other DP+ setting, the
``config_declaration.yaml`` declaration, and the README all use the
``datapusher_plus`` namespace.

Practical effect: operators enabling PII screening via the
documented key ``ckanext.datapusher_plus.pii_screening`` had no
effect — ``PII_SCREENING`` silently stayed at its ``False`` fallback,
so the PII-screening feature could not be turned on as documented.
Only someone who happened to set the (undocumented, typo'd) key
``ckanext.datastore_plus.pii_screening`` would have enabled it.

Fix: read the documented ``ckanext.datapusher_plus.pii_screening``
key. Adds tests/test_pii_screening_config_key.py with 3 regression
tests: the documented key now drives PII_SCREENING, the old typo'd
key no longer does, and an AST-based drift guard that pins the
config.py key string to the config_declaration.yaml declaration so
a reintroduced namespace typo fails CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the documentation audit. Addresses the doc-side
findings (the pii_screening code bug is fixed separately):

* **config.py — preview_rows fallback**: aligned the inline
  ``tk.config.get`` fallback from ``"1000"`` to ``"0"`` so it
  agrees with ``config_declaration.yaml`` (default ``0``). Under
  CKAN 2.10+ declarative config the declared default already wins,
  so ``0`` was the effective runtime default all along — the
  ``"1000"`` fallback was stale, never-reached, and contradicted
  the declaration. No runtime behavior change.

* **README.md — stale describeGPT_api_key**: removed the
  ``ckanext.datapusher_plus.describeGPT_api_key`` example line.
  No such config key exists; v3.0 moved AI-suggestion credentials
  out of ckan.ini into qsv's own config
  (``describegpt_config_path``) / the ``OPENAI_API_KEY`` env var.
  The README's own v3.0 config reference table already omitted it.

* **README.md — formats example**: dropped ``xlsxb`` (a typo —
  the extension is ``xlsb``) and ``xlsm`` from both ``formats``
  example lines so they match ``config.py``'s actual FORMATS
  default. Note: ``format_converter.py`` *can* convert ``.xlsm`` /
  ``.xlsb`` (they're in ``SPREADSHEET_EXTENSIONS``), but the
  ``FORMATS`` gate doesn't list them, so DP+ won't pick those
  files up today — adding them to FORMATS is a separate feature
  decision, not a doc fix.

* **README.md — illustrative-values note**: both ckan.ini example
  config blocks silently contradicted the actual defaults for
  several keys (preview_rows, chunk_size, dedup,
  auto_index_threshold, ignore_file_hash, auto_alias_unique).
  Added a note above each block clarifying the values are
  illustrative examples, not defaults, and pointing to
  config_declaration.yaml as authoritative. The individual stale
  values were left as-is because some non-defaults may be
  deliberate install-guide recommendations — a value-by-value
  reconciliation is left to a maintainer pass.

* **CONFIG.md — DRUF template overrides**: added the two
  ``resource_form.html`` override entries (plain + scheming) that
  were missing from the list — DP+ ships 5 DRUF override
  templates, the doc listed only 3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad requested a review from Copilot May 20, 2026 03:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a mis-namespaced CKAN config key that prevented pii_screening from being enabled as documented, and updates documentation/config fallbacks found during a documentation audit.

Changes:

  • Correct PII_SCREENING to read from ckanext.datapusher_plus.pii_screening and align preview_rows fallback with declarative config defaults.
  • Add regression tests to ensure the documented PII key works and to guard against declaration-vs-code drift.
  • Update README/CONFIG documentation examples and template-override listings to match current behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
tests/test_pii_screening_config_key.py Adds regression + drift-guard coverage for the PII config-key fix.
ckanext/datapusher_plus/config.py Fixes PII config key typo; aligns PREVIEW_ROWS fallback with declared default.
README.md Clarifies example-vs-default configs; removes stale/invalid config examples; fixes formats list.
CONFIG.md Documents the missing DRUF template overrides.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_pii_screening_config_key.py
Copilot flagged that ``_reload_config`` uses ``importlib.reload`` on
``ckanext.datapusher_plus.config``, which mutates the module object
in place. The recomputed constants (from a monkeypatched
``tk.config``) persist in ``sys.modules`` after the test — once
``monkeypatch`` restores ``tk.config`` it does NOT re-run
``config.py`` — so a non-default config snapshot could leak into
later tests and make the suite order-dependent.

Fix: added a ``_reset_config_module`` autouse fixture that reloads
``config`` on teardown. As an autouse fixture it is set up before
the test's ``monkeypatch`` fixture, so its finalizer runs *after*
``monkeypatch`` has restored ``tk.config`` — the teardown reload
therefore recomputes the module constants from the pristine
test-environment config, leaving no order dependence.

Also corrected the ``_reload_config`` docstring, which had cited
``test_file_hash_algorithm`` / the #61 tests as a reload precedent
— both actually read ``tk.config`` live and do NOT reload.

Verified: the config-adjacent suites (``test_file_hash_algorithm``,
``test_issue_61_download_always_whitelist``) run green alongside
this file, and the full suite stays at 277 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 5bac296 into main May 20, 2026
1 check passed
jqnatividad added a commit that referenced this pull request May 20, 2026
…325)

A documentation audit found CLAUDE.md materially stale — it still
described the v2.0 in-process pipeline. 13 of 38 verified claims were
false. Corrections:

* Version v2.0.0 → 3.0.0a0 (per pyproject.toml).
* Architecture section rewritten for v3.0: the v2
  ``DataProcessingPipeline`` loop and the named ``pipeline.py`` /
  ``jobs_legacy.py`` files do not exist. Orchestration is
  ``jobs/prefect_flow.py`` with per-stage ``@task`` functions and the
  ``datapusher_plus_flow`` entry ``@flow``. Documented the full v3.0
  jobs-module layout (prefect_flow, runtime_context, subflows, events,
  caching, blocks, artifacts, quarantine, file_persistence) and the
  previously-omitted ``ai_suggestions`` stage.
* Key Modules: dropped the non-existent ``jobs_legacy.py``; added
  ``prefect_client.py``, ``dictionary_stash.py``, ``utils.py``.
* Formula type ``suggest_formula`` → ``suggestion_formula`` (the
  actual scheming-YAML key — verified in jinja2_helpers.py /
  formula.py).
* qsv ``v4.0.0+`` → ``v20.1.0+`` (MINIMUM_QSV_VERSION; two BREAKING
  bumps since 4.0.0).
* ``preview_rows`` default 1000 → 0 (matches config_declaration.yaml
  and config.py after PR #324).
* External Dependencies: RQ → Prefect 3.7+ (the v3.0 runtime;
  requirements.txt pins prefect, not rq).
* CLI commands: added ``prefect-deploy`` and ``migrate-from-rq``.
* Build & Test: ``pytest tests/test_unit.py`` (no such file) replaced
  with a real example + the unit/integration split.

Verified-correct claims (plugin.py interface list, formula
namespaces, models, job_exceptions, logic actions, Python versions)
were left intact.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request May 20, 2026
… config (#326)

* ci: rewrite test.yml as a real unit-suite job

The old test.yml ("Tests") was dead 2022-era cruft: a Python
2.7-3.9 matrix, an `actions/checkout@v2` / `setup-python@v2` /
`codecov-action@v1` stack, `pip install -r requirements-dev-py2.txt`
(that file no longer exists), and `pytest --cov=datapusher` against
the pre-extension package name. It cannot have passed since the
extension refactor.

It also masked a real gap surfaced by the CLAUDE.md audit: NO
workflow runs the DP+ unit suite. `ci.yml` runs only the qsv
contract regression test + the integration suite; `main.yml` is a
manual end-to-end run.

Replaced with a `unit-tests` job that runs the full unit suite
(`pytest tests/ --ignore=tests/integration`) on push/PR. It mirrors
the proven `dpp-test` container setup: the `ckan/ckan-dev:2.11`
image, geo system libs, `requirements.txt` + `requirements-dev.txt`
+ editable install, qsv 20.1.0, and the
`PYTEST_DISABLE_PLUGIN_AUTOLOAD` / `QSV_BIN` / `CKAN_INI` env the
suite needs.

Python 3.10 only: the unit suite imports `ckan.plugins.toolkit`, so
it needs a full CKAN install, and the CKAN 2.11 dev image ships
Python 3.10. A 3.11-3.13 matrix would require building CKAN from
source per version — left as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: add xlsm/xlsb to FORMATS + correct spatial-tolerance config-key casing

Two config.py correctness fixes surfaced by the documentation audit:

* **xlsm/xlsb were ungated.** ``FormatConverterStage`` advertises
  ``.xlsm`` and ``.xlsb`` support (both in
  ``SPREADSHEET_EXTENSIONS``), but the ``FORMATS`` default omitted
  them — so DP+ silently skipped those files before they ever
  reached the converter. Added both to the ``FORMATS`` default,
  grouped with the other Excel formats.

  New ``tests/test_formats_config.py`` (2 tests): xlsm/xlsb are in
  ``FORMATS``, and a consistency guard that every extension in
  ``FormatConverterStage.SPREADSHEET_EXTENSIONS`` is gated in by
  ``FORMATS`` — so a future converter-side addition can't silently
  leave a format ungated.

* **spatial-tolerance config key had inconsistent casing.**
  ``config.py`` read the relative-tolerance setting from
  ``ckanext.datapusher_plus.SPATIAL_SIMPLIFICATION_RELATIVE_TOLERANCE``
  — UPPERCASE, unlike every other ~50 DP+ config keys. Operators
  setting the lowercase form (the only form the README ever showed,
  albeit misspelled) silently had no effect. Lowercased the key
  string to ``...spatial_simplification_relative_tolerance``. The
  Python constant name is unchanged; only the ckan.ini key string
  moved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: reconcile README ckan.ini example config blocks to current defaults

Both ckan.ini example blocks were reconciled to DP+'s actual current
defaults (the documentation audit found the values had drifted; PR
#324 added a stopgap "illustrative, not defaults" note). Each fix
verified against config.py and config_declaration.yaml — where a key
is declared, the declaration's default is what CKAN 2.10+ applies at
runtime, so that is the authoritative value.

Per block:
* Removed ``use_proxy`` — there is no such config key. DP+ derives
  proxy use from whether ``download_proxy`` is set
  (``config.py``: ``USE_PROXY = "...download_proxy" in tk.config``).
* ``ssl_verify``: false -> true (the actual default — verifying TLS).
* ``preview_rows``: 100 -> 0 (load all rows; matches
  config_declaration.yaml after PR #324).
* ``auto_index_threshold``: 3 -> 10 (issue #142 raised the default).
* ``formats``: added ``xlsm xlsb`` (now gated in — see the FORMATS
  commit in this PR).
* ``auto_spatial_simplication`` -> ``auto_spatial_simplification``
  and ``spatial_simplication_relative_tolerance`` ->
  ``spatial_simplification_relative_tolerance`` — the README had
  always misspelled "simplification"; the corrected lowercase key
  now matches config.py.
* ``longitude_fields``: ``longitude,long,lon`` -> ``longitude,lon``
  (the actual default has no ``long``).
* ``jinja2_bytecode_cache_dir``: fixed the ``butecode`` typo in the
  path default.

``chunk_size`` (16384), ``dedup`` (false), ``ignore_file_hash``
(true) and ``auto_alias_unique`` (false) were left as-is — those
already match the config_declaration.yaml defaults that win at
runtime. (config.py's *inline fallbacks* for those four still drift
from the declaration — behavior-preserving dead code, same shape as
the preview_rows fallback PR #324 fixed; flagged for a follow-up.)

The note above each block was reworded: the block now genuinely
lists current defaults, so it is a usable starting-point reference
rather than an "illustrative, differs from defaults" disclaimer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address roborev #2306/#2307/#2308 on PR #326

roborev reviewed the three PR #326 commits. Findings actioned:

**#2306 (LOW) — test.yml comment.** The "Run unit test suite" step
attributed the `--pdbcls` addopt to `setup.cfg`; pytest config moved
to `pyproject.toml` (`[tool.pytest.ini_options]`). Comment corrected.

**#2307 (LOW) — xlsxb typo in CI workflows.** `ci.yml` and `main.yml`
set `formats` with `xlsxb` — a typo for `xlsb` — so an
explicitly-configured `.xlsb` resource stayed ungated in CI even
though the converter supports it. Corrected `xlsxb` → `xlsb` in all
four spots: the `ckanext.datapusher_plus.formats` and legacy
`ckan.datapusher.formats` lines in `ci.yml`, the `main.yml` formats
line, and `ci.yml`'s test-file `case` branch (whose MIME type,
`application/vnd.ms-excel.sheet.binary.macroEnabled.12`, confirms
`.xlsb` was the intent).

**#2307 (LOW) — missing spatial-key regression test + redundant
importorskip.** Added `test_spatial_tolerance_config_key_is_lowercase`
(AST drift guard against a revert to the uppercase key typo) and
dropped the redundant `pytest.importorskip("ckan")` in
`test_formats_covers_every_converter_spreadsheet_extension` (the
`formats` fixture already does it).

**#2308 (LOW x2) — README config-block notes.** The reworded notes
named `config_declaration.yaml` as the sole authoritative source,
but `ssl_verify` / `formats` / `unsafe_prefix` / `reserved_colnames`
are declared only in `config.py`. Notes now cite both. Also restored
the "set only the keys you want to change" guidance and explained
why pasting the whole block verbatim is risky (pins every value as
an explicit override, defeating future default updates).

Not changed (already resolved earlier in the PR stack):
* #2307 (MEDIUM) — README documented a misspelled spatial key. The
  README spelling fix landed in this PR's commit a9e64db; #2307
  reviewed the earlier commit 1638bf7 in isolation.

Full unit suite: 279 → 280, all passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Copilot review on PR #326

Six Copilot findings, all applied:

* **plugin.py had a second, drifting formats list.**
  ``_submit_to_datapusher`` fell back to its own hardcoded format
  list when neither ``ckan.datapusher.formats`` nor
  ``ckanext.datapusher_plus.formats`` was set — and that list
  lacked ``xlsm``/``xlsb``, so the FORMATS fix would not reach
  deployments that don't set ``formats`` explicitly. Centralized
  the fallback on ``conf.FORMATS`` so there is one canonical
  default that cannot drift.

* **Spatial-tolerance config-key rename could break deployments.**
  The previous commit lowercased the key string; an operator who
  had set the legacy UPPERCASE key would silently lose the setting.
  Added a nested fallback to the legacy
  ``...SPATIAL_SIMPLIFICATION_RELATIVE_TOLERANCE`` key — the same
  backward-compat shape ``SSL_VERIFY`` already uses.

* **tests/test_formats_config.py was runtime/CKAN-dependent.**
  Rewrote both FORMATS tests to AST-parse the ``config.py`` default
  list literal and ``format_converter.py``'s
  ``SPREADSHEET_EXTENSIONS`` — deterministic, CKAN-free config-drift
  guards in the same style as the other config tests in the repo.

* **The spatial-key AST test could capture a nested fallback key.**
  With the legacy fallback now nested as a second
  ``tk.config.get`` arg, walking for "the first ``.get``" is
  fragile. The test now targets the assignment's ``.value``
  directly — the outermost ``tk.config.get`` — so it pins the
  *primary* key regardless of nesting.

* **Dead ``use_proxy`` config key in CI workflows.** Removed
  ``ckanext.datapusher_plus.use_proxy`` from the ci.yml and
  main.yml CKAN config blocks — there is no such key (DP+ derives
  proxy use from whether ``download_proxy`` is set).

Full unit suite: 280 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants