feat(ai-suggestions): optional qsv describegpt stage for LLM-derived metadata by jqnatividad · Pull Request #301 · dathere/datapusher-plus

jqnatividad · 2026-05-17T04:08:54Z

Summary

Re-implements the AI suggestions slice of PR #253 against the current Prefect pipeline. Last of the three replacement PRs for #253 (after #298 spatial extent and #299 resubmit-CLI robustness).

Stays narrowly backend-scoped: qsv subprocess wrapper, new pipeline stage, Prefect task wiring, a template helper, and unit tests. The UI port from #253 (~420 lines of JS + form-snippet plumbing) is a deliberate follow-up.

What's added

File	Change
`qsv_utils.py`	New `QSVCommand.describegpt()` wrapper, patterned after `stats()` / `frequency()`. qsv owns the LLM round-trip (endpoint, model, prompt, API key all in qsv's own config / env, NOT in ckan.ini).
`jobs/stages/ai_suggestions.py` (NEW)	`AISuggestionsStage`. Opt-in via `ENABLE_AI_SUGGESTIONS` (default False). Non-blocking by design: every failure mode (subprocess, qsv timeout, malformed JSON, package-patch error) logs a warning and returns context unchanged.
`jobs/prefect_flow.py`	New `ai_suggestions_task` chained between `analyze_task` and the database `transaction()` block. Outside the transaction so a (rare) failure can't poison `@on_rollback` hooks. Returns input `AnalyzeResult` unchanged.
`config.py` + `config_declaration.yaml`	Three new keys: `enable_ai_suggestions` (default False), `describegpt_config_path` (qsv prompt/config file), `describegpt_timeout_seconds` (default 120 — distinct from `QSV_COMMAND_TIMEOUT` so a hung LLM doesn't pin a worker for 30 min).
`helpers.py` + `plugin.py`	`scheming_get_ai_suggestion` helper parallel to `scheming_get_suggestion_value`. Direct top-level lookup (`description`/`tags`) → walks `dictionary` for `{"field": ..., "summary": ...}` match (with `description` as alt of `summary`).
`tests/test_ai_suggestions.py` (NEW)	22 unit tests covering wrapper arg assembly, stage gate, happy + failure paths, helper lookup shapes.

Output shape

When the stage succeeds, the package picks up:

{
  "dpp_suggestions": {
    "ai_suggestions": {
      "description": "A dataset of widgets.",
      "tags": ["widgets", "production"],
      "dictionary": [
        {"field": "sku", "summary": "Stock keeping unit"}
      ],
      "generated_at": "2026-05-17T03:55:00Z"
    }
  }
}

Sits alongside (NOT inside) dpp_suggestions["package"] (formula-derived suggestions) — they co-exist by design.

What we deliberately skipped

Per the #253 re-land plan:

Bespoke Python LLM client (+449 lines in Added dpp spatial extent #253) — qsv describegpt already speaks any OpenAI-compatible endpoint (Ollama, OpenRouter, OpenAI, vLLM).
The 10 PR-Added dpp spatial extent #253 config keys (OPENROUTER_API_KEY, AI_TEMPERATURE, AI_MAX_TOKENS, …) — those live in qsv's describegpt config file.
The PR-Added dpp spatial extent #253 fallback heuristic — stage is opt-in + non-blocking, so a failed describegpt IS the heuristic.
The 420-line JS port + form snippets from Added dpp spatial extent #253 — follow-up PR. The backend exposes everything needed (helper, populated ai_suggestions shape, opt-in flag); the JS is just UX plumbing.

Test plan

22 new unit tests pass in dpp-test container (mocked qsv + dsu calls; no real subprocess / CKAN required).
Full unit suite: 149 passed (was 127, +22 new). No regressions.
ENABLE_AI_SUGGESTIONS=False (default): should_skip short-circuits; pipeline unchanged.
Stage syntax / Prefect task imports parse cleanly.
End-to-end verification against Ollama / OpenRouter — needs an operator with a live LLM endpoint configured. Stage is opt-in so unverified-but-merged is safe.

Tracking

Closes the AI-suggestions sub-task on #259 ("DP+ 3.0 tracking issue"). #253 stays open for the UI-port follow-up.

Co-Authored-By: Minhajuddin Mohammed minhajuddin2510@gmail.com

🤖 Generated with Claude Code

…metadata Re-implements the AI suggestions slice of PR #253 against the current Prefect pipeline. Stays narrowly backend-scoped: the qsv subprocess wrapper, the new pipeline stage, the Prefect task wiring, a template helper, and unit tests. The UI port from PR #253 (~420 lines of JS + form snippet plumbing) is a deliberate follow-up. What * ``qsv_utils.QSVCommand.describegpt()`` — thin wrapper around ``qsv describegpt --description --dictionary --tags --json``, patterned after the existing ``stats()`` / ``frequency()`` methods. qsv owns the LLM round-trip (endpoint, model, prompt template, API key all live in qsv's own config or env, deliberately NOT in ckan.ini). * ``jobs/stages/ai_suggestions.py`` — new ``AISuggestionsStage`` (``BaseStage``). Opt-in via ``ENABLE_AI_SUGGESTIONS`` (default False). Non-blocking by design: every internal failure mode (subprocess error, qsv timeout, malformed JSON, package-patch error) drops a warning and returns the context unchanged so the rest of the pipeline keeps going. Writes results under ``package["dpp_suggestions"]["ai_suggestions"]`` — a separate sub-key from the formula-derived ``["package"]`` namespace so they can co-exist. * ``jobs/prefect_flow.py`` — new ``ai_suggestions_task`` chained between ``analyze_task`` and the database ``transaction()`` block. Placement outside the transaction prevents a (rare) failure from poisoning the per-task ``@on_rollback`` hooks; the task returns its input ``AnalyzeResult`` unchanged so downstream tasks see the same shape they would without it. * Three new config keys (``config.py`` + ``config_declaration.yaml``): ``enable_ai_suggestions`` (default False), ``describegpt_config_path`` (path to qsv's prompt/config file, default empty → qsv's own discovery), and ``describegpt_timeout_seconds`` (default 120, distinct from the 1800s ``QSV_COMMAND_TIMEOUT`` so a hung LLM doesn't pin a worker). * ``helpers.scheming_get_ai_suggestion`` + ``plugin.py`` wiring — scheming-form helper parallel to ``scheming_get_suggestion_value``. Looks up direct top-level keys (``description`` / ``tags``) first, then walks ``dictionary`` for a per-field ``{"field": ..., "summary": ...}`` match (with ``description`` as an alternate of ``summary``). What we deliberately skipped (out of scope per #253 re-land plan) * The bespoke Python LLM client from #253 — qsv describegpt already speaks any OpenAI-compatible endpoint (Ollama, OpenRouter, OpenAI, vLLM). No DP+ HTTP code needed. * The 10 PR-#253 config keys (``OPENROUTER_API_KEY``, ``AI_TEMPERATURE``, ``AI_MAX_TOKENS``, …). Those live in qsv's describegpt config file, not in ours. * The PR-#253 fallback heuristic — stage is opt-in and non-blocking already, so a failed describegpt is the heuristic. * The 420-line JS port from PR #253 + form-snippet templates — that's the follow-up PR. The backend exposes everything needed (helper, populated ``ai_suggestions`` shape, opt-in flag); the JS is just UX plumbing. Test plan * 22 new unit tests in ``tests/test_ai_suggestions.py`` (mocked qsv + ``dsu`` calls, no real subprocess / CKAN required). Full unit suite now 149 passed (was 127). * With ``enable_ai_suggestions=False`` (default): pipeline runs unchanged — ``should_skip`` short-circuits. * With it enabled + working describegpt config: stage writes ``package["dpp_suggestions"]["ai_suggestions"]``; downstream ``MetadataStage`` writes it back via the normal package-update path. * With it enabled but qsv errors: stage logs a warning, pipeline completes; ``ai_suggestions`` key simply absent. Tracking issue #259 ("DP+ 3.0 tracking issue" → AI-suggestions sub-task) can be checked off once this lands. Co-Authored-By: Minhajuddin Mohammed <minhajuddin2510@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds an opt-in “AI suggestions” step to the DataPusher Plus Prefect pipeline by wrapping qsv describegpt, persisting the returned JSON onto the CKAN package under dpp_suggestions.ai_suggestions, and exposing a scheming helper to read those suggestions in templates.

Changes:

Introduces QSVCommand.describegpt() and a new AISuggestionsStage that shells out to qsv describegpt and patches the CKAN package with the results.
Wires a new Prefect task (ai_suggestions_task) between analysis and the transactional database stages; adds config keys to gate/parameterize the feature.
Adds scheming_get_ai_suggestion helper + helper registration, plus a new unit-test suite for the feature.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`tests/test_ai_suggestions.py`	New unit tests for qsv wrapper args, stage gating/failure behavior, and helper lookup logic.
`ckanext/datapusher_plus/qsv_utils.py`	Adds `QSVCommand.describegpt()` wrapper with config-default timeout support.
`ckanext/datapusher_plus/plugin.py`	Registers the new `scheming_get_ai_suggestion` template helper.
`ckanext/datapusher_plus/jobs/stages/ai_suggestions.py`	New optional pipeline stage that runs describegpt and persists results to the package.
`ckanext/datapusher_plus/jobs/prefect_flow.py`	Adds a Prefect task and flow wiring to run the AI suggestions stage after analysis.
`ckanext/datapusher_plus/helpers.py`	Adds `scheming_get_ai_suggestion` helper to read AI suggestions from package metadata.
`ckanext/datapusher_plus/config.py`	Adds config keys for enabling AI suggestions, config path, and a dedicated describegpt timeout.
`ckanext/datapusher_plus/config_declaration.yaml`	Documents the new configuration keys for operators.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Three follow-ups from Copilot's inline review: * ai_suggestions.py: reuse ``context.qsv`` (the per-flow-run ``QSVCommand`` constructed in ``_build_runtime_context``) instead of building a fresh one. Saves the binary-existence check + ``qsv --version`` call on every flow. ``getattr(context, "qsv", None) or QSVCommand(logger=context.logger)`` keeps unit-test stand-ins (which pass a ``SimpleNamespace`` without ``qsv``) working. * ai_suggestions.py: when ``package["dpp_suggestions"]`` is present but not a dict (legacy / corrupted state), log a warning and skip the write rather than resetting to ``{}``. The old "reset" behaviour was destructive — it would silently nuke any pre-existing data (formula-derived suggestions, custom-plugin scalars, …) for what is supposed to be a bonus, non-blocking stage. ``None`` still gets initialised to ``{}`` (happy path when scheming/FormulaStage hasn't run); only the "present but wrong type" case bails out. * config_declaration.yaml: aligned the ``describegpt_config_path`` description to match the wrapper docstring (and reality) — qsv's ``--prompt-file`` flag expects TOML, not JSON. Dropped the misleading ``~/.qsv/describegpt.json`` claim; replaced with the actual qsv discovery path (``dp.prompt`` / ``dp_prompt.toml`` next to the binary, fallback to ``OPENAI_API_KEY``). Tests: the old reset-and-write test (``test_process_resets_corrupted_dpp_suggestions``) is replaced with two more accurate tests — ``test_process_skips_when_dpp_suggestions_is_non_dict`` (asserts the original value is preserved and ``patch_package`` is NOT called) and ``test_process_initializes_missing_dpp_suggestions`` (covers the ``None`` → init path). Unit suite still 23/23 on this file, 150/150 overall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…302 follow-up) End-to-end verification against LM Studio (gemma-4-e4b on host.docker.internal:1234) surfaced two real bugs in the AI suggestions feature that landed in #301 + #302. The unit suite passed because subprocess.run was mocked with a synthetic JSON shape; nothing exercised qsv's actual CLI surface or output schema. Bug 1: ``--json`` flag doesn't exist in qsv 20.0.0 PR #301's ``QSVCommand.describegpt`` wrapper unconditionally emitted ``--json`` → qsv rejected every real call with ``Unknown flag: '--json'``. The correct flag is ``--format JSON`` (qsv 20.0.0 supports Markdown / TSV / JSON / TOON via a single ``--format`` flag). Fix: wrapper now emits ``--format JSON`` when ``json_output=True``. Test asserts both the presence of ``--format JSON`` and the absence of ``--json`` so this regression can't reland silently. Bug 2: ``_reshape_for_ui`` assumed wrong output schema PR #301/#302 assumed qsv emits a flat shape like ``{description: "...", tags: [...], dictionary: [{field, summary}]}``. Real qsv 20.0.0 output (validated against LM Studio) is a wrapped envelope with PascalCase top-level keys:: {"Dictionary": {"response": {"fields": [...], ...}, "reasoning": "...", "token_usage": {...}}, "Description": {"response": "<markdown>", "reasoning": "...", "token_usage": {...}}, "Tags": {"response": {"tags": [...], "attribution": "..."}, "reasoning": "...", "token_usage": {...}, "num_tags": N}} Even if Bug 1 were fixed, the reshape would always produce an empty dict because none of the assumed top-level keys would match. The UI would never see populated suggestions. Fix: rewrote ``_reshape_for_ui`` to walk the real envelope: ``Description.response`` → top-level ``description``, ``Tags.response.tags`` → top-level ``tags`` (comma-joined), ``Dictionary.response.fields[i]`` → per-column entries keyed by ``name`` with the LLM-generated ``description`` as value and the ``label`` folded into source (so the popover surfaces both "Stock Keeping Unit (SKU)" and the long description). Committed a real qsv JSON envelope as ``tests/fixtures/qsv_describegpt_sample.json`` so the ``test_process_happy_path_persists_suggestions`` test now exercises actual qsv output, not a synthetic shape. Future qsv schema changes that break the reshape get caught here. Wrapper: two new optional params for non-localhost LLMs ``api_key`` and ``base_url``. qsv 20.0.0 refuses to start when the resolved base URL doesn't contain ``localhost`` AND no API key is set (``Neither QSV_LLM_BASE_URL nor QSV_LLM_APIKEY environment variables are set``). For DP+ specifically, the worker container reaches LM Studio / Ollama on the host via ``host.docker.internal`` — not localhost — so this triggers every time. Documented workaround: set ``QSV_LLM_APIKEY=NONE`` in the worker environment OR pass ``api_key="NONE"`` per call; qsv accepts ``NONE`` as the explicit "this is unauthenticated, get on with it" sentinel. config_declaration.yaml: expanded ``describegpt_config_path`` docs Operators now have a clear path for both cloud LLMs (``QSV_LLM_APIKEY`` env) and local LLMs reached via container-host hostnames (``QSV_LLM_APIKEY=NONE``). Also clarified that ``format = "JSON"`` should be set in the prompt-file — qsv emits Markdown by default, which the stage's JSON parse would reject. Test plan * 39 ``tests/test_ai_suggestions.py`` tests pass (was 36); +3 for the new wrapper params, partial-envelope reshape, and forward- compat unknown-key handling. Full unit suite: 166 (was 163). * **End-to-end verified**: qsvdp 20.0.0 inside the dpp-test container → ``host.docker.internal:1234/v1`` → LM Studio (gemma-4-e4b) → real JSON envelope → reshape → on-disk ``ai_suggestions`` with ``description``, ``tags``, ``sku``, ``name``, ``price``, ``quantity_in_stock`` entries, each with ``value`` + ``source`` (label folded into source). This is the first time the feature has actually been confirmed working end-to-end against a real LLM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(ai-suggestions): real-qsv schema + CLI flag corrections (PR #301/#302 follow-up)

jqnatividad requested a review from Copilot May 17, 2026 04:20

Copilot started reviewing on behalf of jqnatividad May 17, 2026 04:21 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

Comment thread ckanext/datapusher_plus/jobs/stages/ai_suggestions.py Outdated

Comment thread ckanext/datapusher_plus/jobs/stages/ai_suggestions.py Outdated

Comment thread ckanext/datapusher_plus/config_declaration.yaml Outdated

jqnatividad merged commit fdc3714 into main May 17, 2026
1 check passed

jqnatividad deleted the ai-suggestions-describegpt branch May 17, 2026 04:38

This was referenced May 17, 2026

Added dpp spatial extent #253

Closed

DP+ 3.0 tracking issue #259

Open

feat(ai-suggestions): scheming UI for AI-derived metadata (PR #253 follow-up) #302

Merged

fix(ai-suggestions): real-qsv schema + CLI flag corrections (PR #301/#302 follow-up) #303

Merged

jqnatividad added a commit that referenced this pull request May 17, 2026

Merge pull request #303 from dathere/fix-describegpt-format-flag

4b2eb75

fix(ai-suggestions): real-qsv schema + CLI flag corrections (PR #301/#302 follow-up)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai-suggestions): optional qsv describegpt stage for LLM-derived metadata#301

feat(ai-suggestions): optional qsv describegpt stage for LLM-derived metadata#301
jqnatividad merged 2 commits into
mainfrom
ai-suggestions-describegpt

jqnatividad commented May 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented May 17, 2026

Summary

What's added

Output shape

What we deliberately skipped

Test plan

Tracking

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants