fix(ai-suggestions): real-qsv schema + CLI flag corrections (PR #301/#302 follow-up) by jqnatividad · Pull Request #303 · dathere/datapusher-plus

jqnatividad · 2026-05-17T10:47:39Z

Summary

End-to-end verification against LM Studio (gemma-4-e4b on host.docker.internal:1234) surfaced two real bugs in the AI suggestions feature that landed in #301 + #302. The unit suite passed because subprocess.run was mocked with a synthetic JSON shape I invented — nothing exercised qsv's actual CLI surface or output schema.

Bug 1: `--json` flag doesn't exist in qsv 20.0.0

PR #301's QSVCommand.describegpt wrapper unconditionally emitted --json → qsv rejected every real call with Unknown flag: '--json'. The correct flag is --format JSON (qsv 20.0.0 supports Markdown / TSV / JSON / TOON via a single --format flag).

Fix: wrapper now emits --format JSON when json_output=True. Test asserts both presence of --format JSON and absence of --json so this regression can't re-land silently.

Bug 2: `_reshape_for_ui` assumed wrong output schema

PR #301/#302 assumed qsv emits a flat shape {description, tags, dictionary: [{field, summary}]}. Real qsv 20.0.0 output (validated against LM Studio) is a wrapped envelope with PascalCase top-level keys:

{
  "Dictionary": {
    "response": {"fields": [{"name": "...", "type": "...", "label": "...", "description": "...", ...}], ...},
    "reasoning": "", "token_usage": {...}
  },
  "Description": {
    "response": "<markdown string>",
    "reasoning": "", "token_usage": {...}
  },
  "Tags": {
    "response": {"tags": [...], "attribution": "..."},
    "reasoning": "", "token_usage": {...}, "num_tags": N
  }
}

Even if Bug 1 were fixed, the reshape would always produce an empty dict because none of the assumed top-level keys would match. The UI would never see populated suggestions.

Fix: rewrote _reshape_for_ui to walk the real envelope. Description.response → top-level description; Tags.response.tags → top-level tags (comma-joined); Dictionary.response.fields[i] → per-column entries keyed by name, value = LLM-generated description, label folded into source so the popover surfaces both ("Stock Keeping Unit (SKU)" + long description).

Captured a real qsv envelope as tests/fixtures/qsv_describegpt_sample.json so the happy-path test now exercises actual qsv output, not a synthetic shape. Future qsv schema changes that break the reshape get caught here.

Wrapper: two new optional params for non-localhost LLMs

api_key and base_url. qsv 20.0.0 refuses to start when the resolved base URL doesn't contain localhost AND no API key is set:

Neither QSV_LLM_BASE_URL nor QSV_LLM_APIKEY environment variables are set.
Either set `--base-url` to an address with "localhost" in it, or set `--api-key`.

For DP+ specifically, the worker container reaches LM Studio / Ollama on the host via host.docker.internal — not localhost — so this triggers every time. Documented workaround: set QSV_LLM_APIKEY=NONE in the worker environment or pass api_key="NONE" per call; qsv accepts NONE as the explicit "this is unauthenticated, get on with it" sentinel.

`config_declaration.yaml` updates

Operators now have a clear path for both:

Cloud LLM (OpenAI / OpenRouter / etc.): set QSV_LLM_APIKEY in the worker container's environment.
Local LLM via container-host hostname (Ollama / LM Studio on host.docker.internal): set QSV_LLM_APIKEY=NONE.

Also clarified that format = "JSON" should be set in the prompt-file — qsv emits Markdown by default, which the stage's JSON parse would reject (stage is non-blocking so it skips silently — easy to miss).

Test plan

39 tests/test_ai_suggestions.py tests pass (was 36); +3 for new wrapper params (api_key, base_url, omitted-by-default), +2 for partial-envelope and forward-compat reshape, -2 for the synthetic-shape happy-path that was replaced.
Full unit suite: 166 (was 163). No regressions.
End-to-end verified: qsvdp 20.0.0 inside dpp-test container → host.docker.internal:1234/v1 → LM Studio (gemma-4-e4b) → real JSON envelope → reshape → on-disk ai_suggestions with description, tags, sku, name, price, quantity_in_stock entries. Each has {value, source} with label folded into source. This is the first time the feature has actually been confirmed working end-to-end against a real LLM.

Lesson

Mocking subprocess.run with a synthetic JSON shape is a false confidence pattern. The 36 passing unit tests on main couldn't catch either bug because nothing in the test loop ever talked to a real qsv binary. The captured fixture committed here is a partial mitigation; a true integration test (real qsv + a stubbed LLM) would catch more. Tracked as a follow-up.

🤖 Generated with Claude Code

…302 follow-up) End-to-end verification against LM Studio (gemma-4-e4b on host.docker.internal:1234) surfaced two real bugs in the AI suggestions feature that landed in #301 + #302. The unit suite passed because subprocess.run was mocked with a synthetic JSON shape; nothing exercised qsv's actual CLI surface or output schema. Bug 1: ``--json`` flag doesn't exist in qsv 20.0.0 PR #301's ``QSVCommand.describegpt`` wrapper unconditionally emitted ``--json`` → qsv rejected every real call with ``Unknown flag: '--json'``. The correct flag is ``--format JSON`` (qsv 20.0.0 supports Markdown / TSV / JSON / TOON via a single ``--format`` flag). Fix: wrapper now emits ``--format JSON`` when ``json_output=True``. Test asserts both the presence of ``--format JSON`` and the absence of ``--json`` so this regression can't reland silently. Bug 2: ``_reshape_for_ui`` assumed wrong output schema PR #301/#302 assumed qsv emits a flat shape like ``{description: "...", tags: [...], dictionary: [{field, summary}]}``. Real qsv 20.0.0 output (validated against LM Studio) is a wrapped envelope with PascalCase top-level keys:: {"Dictionary": {"response": {"fields": [...], ...}, "reasoning": "...", "token_usage": {...}}, "Description": {"response": "<markdown>", "reasoning": "...", "token_usage": {...}}, "Tags": {"response": {"tags": [...], "attribution": "..."}, "reasoning": "...", "token_usage": {...}, "num_tags": N}} Even if Bug 1 were fixed, the reshape would always produce an empty dict because none of the assumed top-level keys would match. The UI would never see populated suggestions. Fix: rewrote ``_reshape_for_ui`` to walk the real envelope: ``Description.response`` → top-level ``description``, ``Tags.response.tags`` → top-level ``tags`` (comma-joined), ``Dictionary.response.fields[i]`` → per-column entries keyed by ``name`` with the LLM-generated ``description`` as value and the ``label`` folded into source (so the popover surfaces both "Stock Keeping Unit (SKU)" and the long description). Committed a real qsv JSON envelope as ``tests/fixtures/qsv_describegpt_sample.json`` so the ``test_process_happy_path_persists_suggestions`` test now exercises actual qsv output, not a synthetic shape. Future qsv schema changes that break the reshape get caught here. Wrapper: two new optional params for non-localhost LLMs ``api_key`` and ``base_url``. qsv 20.0.0 refuses to start when the resolved base URL doesn't contain ``localhost`` AND no API key is set (``Neither QSV_LLM_BASE_URL nor QSV_LLM_APIKEY environment variables are set``). For DP+ specifically, the worker container reaches LM Studio / Ollama on the host via ``host.docker.internal`` — not localhost — so this triggers every time. Documented workaround: set ``QSV_LLM_APIKEY=NONE`` in the worker environment OR pass ``api_key="NONE"`` per call; qsv accepts ``NONE`` as the explicit "this is unauthenticated, get on with it" sentinel. config_declaration.yaml: expanded ``describegpt_config_path`` docs Operators now have a clear path for both cloud LLMs (``QSV_LLM_APIKEY`` env) and local LLMs reached via container-host hostnames (``QSV_LLM_APIKEY=NONE``). Also clarified that ``format = "JSON"`` should be set in the prompt-file — qsv emits Markdown by default, which the stage's JSON parse would reject. Test plan * 39 ``tests/test_ai_suggestions.py`` tests pass (was 36); +3 for the new wrapper params, partial-envelope reshape, and forward- compat unknown-key handling. Full unit suite: 166 (was 163). * **End-to-end verified**: qsvdp 20.0.0 inside the dpp-test container → ``host.docker.internal:1234/v1`` → LM Studio (gemma-4-e4b) → real JSON envelope → reshape → on-disk ``ai_suggestions`` with ``description``, ``tags``, ``sku``, ``name``, ``price``, ``quantity_in_stock`` entries, each with ``value`` + ``source`` (label folded into source). This is the first time the feature has actually been confirmed working end-to-end against a real LLM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four findings (1 Medium, 3 Low) from the live review of the PR-#303 qsv schema fixes. All real; all addressed. * MEDIUM — column-name vs dataset-key collision (ai_suggestions.py:340): ``_reshape_for_ui`` was writing dataset-level ``description`` / ``tags`` first, then iterating ``Dictionary.response.fields`` and unconditionally assigning ``out[name] = ...``. A CSV column literally named ``description`` / ``tags`` / ``STATUS`` / ``generated_at`` (the bookkeeping keys) would silently overwrite the dataset-level entry or break polling termination. Fix: skip per-column entries whose ``name`` collides with the new module-level ``_RESERVED_AI_KEYS`` frozenset, with a warning log. Kept the flat-map shape (vs. nesting under ``["fields"]``) because the polling JS does ``Object.keys(aiSuggestions).forEach`` against ``[data-field-name=X]`` selectors — adding a sub-key would break that contract. Added the regression test the prior review requested. * LOW — label-only fallback was silent (ai_suggestions.py:362): When a dictionary entry had ``label`` but no ``description``, the reshape set ``value = label`` with no marker in ``source``. Operators would see the label in the popover with no indication it wasn't a real description. Fix: when falling back to label, set ``source = f"{base_source} (label only)"`` so the popover surfaces the distinction. The description→label fallback chain stays (description preferred; label-only is better than no suggestion). * LOW — tags with commas (ai_suggestions.py:355): ``", ".join(...)`` produces ambiguous output when a tag itself contains a comma (e.g. ``"retail, b2b"``). scheming's tag parser would split that into two tags. Unlikely from typical LLM output but defensible. Fix: filter out comma-containing tags before joining, with a warning log per dropped tag. When ALL tags are filtered, the ``tags`` entry isn't written at all (vs. an empty-string entry rendering as a stray "" in the popover). * LOW — truthy guards on api_key / base_url (qsv_utils.py:747): ``if api_key:`` would silently drop an explicit empty string, contradicting the docstring contract that ``None`` is the "omit" sentinel. Fix: ``if api_key is not None:`` (same for ``base_url``). Defensive only; passing ``""`` is nonsensical for these flags, but the wrapper now matches its documented contract. Tests: 44 in tests/test_ai_suggestions.py (was 39); +5 for the new behaviors (column-name collision skip, label-only marker, two tag-comma filter cases, empty-string api_key passthrough). Full unit suite: 171 (was 166). E2E re-verified against LM Studio (gemma-4-e4b) — the reshape still produces the expected description/tags/sku/name/price/quantity_in_stock entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two LOW findings from the re-review of the prior roborev fix commit. Both cleanup, no behavior change. * LOW — inline ``import logging`` in two places (ai_suggestions.py:372,403): The previous commit added two local-scope ``import logging`` statements inside ``_reshape_for_ui``. Function-local imports are cheap after the first call thanks to ``sys.modules`` caching but linters flag them and they split the import block. Fix: moved ``import logging`` to the module-level imports; defined ``_LOG = logging.getLogger(__name__)`` at module scope; replaced both inline ``logging.getLogger(__name__).warning(...)`` calls with ``_LOG.warning(...)``. * LOW — undocumented case-sensitivity of ``_RESERVED_AI_KEYS`` (ai_suggestions.py:41): The set is case-sensitive on purpose: qsv preserves CSV header casing in ``Dictionary.response.fields[i].name``, and the polling JS uses case-sensitive ``[data-field-name="X"]`` selectors. A column ``Description`` (capital D) creates ``ai_suggestions["Description"]`` which the lowercase ``description`` data-field-name doesn't pick up — so the exact-case match is exactly the right granularity. Without a comment a future maintainer could "fix" this by lowercasing both sides and reintroduce the very collision the guard prevents. Fix: added a multi-line NOTE comment near ``_RESERVED_AI_KEYS`` explaining why case-sensitivity is deliberate. Tests: 44 ai-suggestions tests still pass (no test changes — pure refactor + comment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR fixes two real bugs in the AI suggestions feature (introduced in #301/#302) that were uncovered during end-to-end testing against LM Studio. The unit tests previously used a synthetic mocked JSON shape that didn't match qsv 20.0.0's actual CLI surface or output schema, so the bugs slipped through.

Changes:

Fix QSVCommand.describegpt to emit --format JSON instead of the non-existent --json flag, and add new optional api_key / base_url params (with NONE sentinel support for non-localhost local LLMs like host.docker.internal).
Rewrite _reshape_for_ui in AISuggestionsStage to walk qsv's actual wrapped PascalCase envelope (Dictionary/Description/Tags with response sub-objects), filter comma-containing tags, and guard against per-column entries colliding with reserved dataset-level/bookkeeping keys.
Capture a real qsv envelope as a test fixture and expand tests (partial envelope, forward-compat, label-only, comma-tag filtering, reserved-key collisions); update config_declaration.yaml with cloud-vs-local LLM auth guidance and format = "JSON" requirement.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`ckanext/datapusher_plus/qsv_utils.py`	Switch `--json` → `--format JSON`; add `api_key`/`base_url` params with `None`-omit semantics; expand docstring with the real envelope shape.
`ckanext/datapusher_plus/jobs/stages/ai_suggestions.py`	Rewrite `_reshape_for_ui` to consume the wrapped envelope; add reserved-key guard, comma-tag filter, label-only fallback.
`ckanext/datapusher_plus/config_declaration.yaml`	Document `QSV_LLM_APIKEY` (cloud) vs `QSV_LLM_APIKEY=NONE` (local non-localhost) and the required `format = "JSON"` prompt-file setting.
`tests/test_ai_suggestions.py`	Replace synthetic happy-path fixture with real captured envelope; add tests for new params, reserved keys, comma tags, label-only, forward-compat.
`tests/fixtures/qsv_describegpt_sample.json`	Real qsv 20.0.0 describegpt output captured against LM Studio for use as the happy-path fixture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jqnatividad and others added 3 commits May 17, 2026 06:47

jqnatividad requested a review from Copilot May 17, 2026 11:43

Copilot started reviewing on behalf of jqnatividad May 17, 2026 11:43 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

jqnatividad merged commit 4b2eb75 into main May 17, 2026
5 checks passed

jqnatividad deleted the fix-describegpt-format-flag branch May 17, 2026 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ai-suggestions): real-qsv schema + CLI flag corrections (PR #301/#302 follow-up)#303

fix(ai-suggestions): real-qsv schema + CLI flag corrections (PR #301/#302 follow-up)#303
jqnatividad merged 3 commits into
mainfrom
fix-describegpt-format-flag

jqnatividad commented May 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented May 17, 2026

Summary

Bug 1: --json flag doesn't exist in qsv 20.0.0

Bug 2: _reshape_for_ui assumed wrong output schema

Wrapper: two new optional params for non-localhost LLMs

config_declaration.yaml updates

Test plan

Lesson

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug 1: `--json` flag doesn't exist in qsv 20.0.0

Bug 2: `_reshape_for_ui` assumed wrong output schema

`config_declaration.yaml` updates