feat(ai-suggestions): optional qsv describegpt stage for LLM-derived metadata#301
Merged
Conversation
…metadata Re-implements the AI suggestions slice of PR #253 against the current Prefect pipeline. Stays narrowly backend-scoped: the qsv subprocess wrapper, the new pipeline stage, the Prefect task wiring, a template helper, and unit tests. The UI port from PR #253 (~420 lines of JS + form snippet plumbing) is a deliberate follow-up. What * ``qsv_utils.QSVCommand.describegpt()`` — thin wrapper around ``qsv describegpt --description --dictionary --tags --json``, patterned after the existing ``stats()`` / ``frequency()`` methods. qsv owns the LLM round-trip (endpoint, model, prompt template, API key all live in qsv's own config or env, deliberately NOT in ckan.ini). * ``jobs/stages/ai_suggestions.py`` — new ``AISuggestionsStage`` (``BaseStage``). Opt-in via ``ENABLE_AI_SUGGESTIONS`` (default False). Non-blocking by design: every internal failure mode (subprocess error, qsv timeout, malformed JSON, package-patch error) drops a warning and returns the context unchanged so the rest of the pipeline keeps going. Writes results under ``package["dpp_suggestions"]["ai_suggestions"]`` — a separate sub-key from the formula-derived ``["package"]`` namespace so they can co-exist. * ``jobs/prefect_flow.py`` — new ``ai_suggestions_task`` chained between ``analyze_task`` and the database ``transaction()`` block. Placement outside the transaction prevents a (rare) failure from poisoning the per-task ``@on_rollback`` hooks; the task returns its input ``AnalyzeResult`` unchanged so downstream tasks see the same shape they would without it. * Three new config keys (``config.py`` + ``config_declaration.yaml``): ``enable_ai_suggestions`` (default False), ``describegpt_config_path`` (path to qsv's prompt/config file, default empty → qsv's own discovery), and ``describegpt_timeout_seconds`` (default 120, distinct from the 1800s ``QSV_COMMAND_TIMEOUT`` so a hung LLM doesn't pin a worker). * ``helpers.scheming_get_ai_suggestion`` + ``plugin.py`` wiring — scheming-form helper parallel to ``scheming_get_suggestion_value``. Looks up direct top-level keys (``description`` / ``tags``) first, then walks ``dictionary`` for a per-field ``{"field": ..., "summary": ...}`` match (with ``description`` as an alternate of ``summary``). What we deliberately skipped (out of scope per #253 re-land plan) * The bespoke Python LLM client from #253 — qsv describegpt already speaks any OpenAI-compatible endpoint (Ollama, OpenRouter, OpenAI, vLLM). No DP+ HTTP code needed. * The 10 PR-#253 config keys (``OPENROUTER_API_KEY``, ``AI_TEMPERATURE``, ``AI_MAX_TOKENS``, …). Those live in qsv's describegpt config file, not in ours. * The PR-#253 fallback heuristic — stage is opt-in and non-blocking already, so a failed describegpt is the heuristic. * The 420-line JS port from PR #253 + form-snippet templates — that's the follow-up PR. The backend exposes everything needed (helper, populated ``ai_suggestions`` shape, opt-in flag); the JS is just UX plumbing. Test plan * 22 new unit tests in ``tests/test_ai_suggestions.py`` (mocked qsv + ``dsu`` calls, no real subprocess / CKAN required). Full unit suite now 149 passed (was 127). * With ``enable_ai_suggestions=False`` (default): pipeline runs unchanged — ``should_skip`` short-circuits. * With it enabled + working describegpt config: stage writes ``package["dpp_suggestions"]["ai_suggestions"]``; downstream ``MetadataStage`` writes it back via the normal package-update path. * With it enabled but qsv errors: stage logs a warning, pipeline completes; ``ai_suggestions`` key simply absent. Tracking issue #259 ("DP+ 3.0 tracking issue" → AI-suggestions sub-task) can be checked off once this lands. Co-Authored-By: Minhajuddin Mohammed <minhajuddin2510@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an opt-in “AI suggestions” step to the DataPusher Plus Prefect pipeline by wrapping qsv describegpt, persisting the returned JSON onto the CKAN package under dpp_suggestions.ai_suggestions, and exposing a scheming helper to read those suggestions in templates.
Changes:
- Introduces
QSVCommand.describegpt()and a newAISuggestionsStagethat shells out toqsv describegptand patches the CKAN package with the results. - Wires a new Prefect task (
ai_suggestions_task) between analysis and the transactional database stages; adds config keys to gate/parameterize the feature. - Adds
scheming_get_ai_suggestionhelper + helper registration, plus a new unit-test suite for the feature.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_ai_suggestions.py |
New unit tests for qsv wrapper args, stage gating/failure behavior, and helper lookup logic. |
ckanext/datapusher_plus/qsv_utils.py |
Adds QSVCommand.describegpt() wrapper with config-default timeout support. |
ckanext/datapusher_plus/plugin.py |
Registers the new scheming_get_ai_suggestion template helper. |
ckanext/datapusher_plus/jobs/stages/ai_suggestions.py |
New optional pipeline stage that runs describegpt and persists results to the package. |
ckanext/datapusher_plus/jobs/prefect_flow.py |
Adds a Prefect task and flow wiring to run the AI suggestions stage after analysis. |
ckanext/datapusher_plus/helpers.py |
Adds scheming_get_ai_suggestion helper to read AI suggestions from package metadata. |
ckanext/datapusher_plus/config.py |
Adds config keys for enabling AI suggestions, config path, and a dedicated describegpt timeout. |
ckanext/datapusher_plus/config_declaration.yaml |
Documents the new configuration keys for operators. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Three follow-ups from Copilot's inline review:
* ai_suggestions.py: reuse ``context.qsv`` (the per-flow-run
``QSVCommand`` constructed in ``_build_runtime_context``) instead
of building a fresh one. Saves the binary-existence check + ``qsv
--version`` call on every flow. ``getattr(context, "qsv", None) or
QSVCommand(logger=context.logger)`` keeps unit-test stand-ins (which
pass a ``SimpleNamespace`` without ``qsv``) working.
* ai_suggestions.py: when ``package["dpp_suggestions"]`` is present
but not a dict (legacy / corrupted state), log a warning and skip
the write rather than resetting to ``{}``. The old "reset" behaviour
was destructive — it would silently nuke any pre-existing data
(formula-derived suggestions, custom-plugin scalars, …) for what is
supposed to be a bonus, non-blocking stage. ``None`` still gets
initialised to ``{}`` (happy path when scheming/FormulaStage hasn't
run); only the "present but wrong type" case bails out.
* config_declaration.yaml: aligned the
``describegpt_config_path`` description to match the wrapper
docstring (and reality) — qsv's ``--prompt-file`` flag expects
TOML, not JSON. Dropped the misleading
``~/.qsv/describegpt.json`` claim; replaced with the actual qsv
discovery path (``dp.prompt`` / ``dp_prompt.toml`` next to the
binary, fallback to ``OPENAI_API_KEY``).
Tests: the old reset-and-write test
(``test_process_resets_corrupted_dpp_suggestions``) is replaced
with two more accurate tests — ``test_process_skips_when_dpp_suggestions_is_non_dict``
(asserts the original value is preserved and ``patch_package`` is
NOT called) and ``test_process_initializes_missing_dpp_suggestions``
(covers the ``None`` → init path). Unit suite still 23/23 on this
file, 150/150 overall.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 17, 2026
jqnatividad
added a commit
that referenced
this pull request
May 17, 2026
…302 follow-up) End-to-end verification against LM Studio (gemma-4-e4b on host.docker.internal:1234) surfaced two real bugs in the AI suggestions feature that landed in #301 + #302. The unit suite passed because subprocess.run was mocked with a synthetic JSON shape; nothing exercised qsv's actual CLI surface or output schema. Bug 1: ``--json`` flag doesn't exist in qsv 20.0.0 PR #301's ``QSVCommand.describegpt`` wrapper unconditionally emitted ``--json`` → qsv rejected every real call with ``Unknown flag: '--json'``. The correct flag is ``--format JSON`` (qsv 20.0.0 supports Markdown / TSV / JSON / TOON via a single ``--format`` flag). Fix: wrapper now emits ``--format JSON`` when ``json_output=True``. Test asserts both the presence of ``--format JSON`` and the absence of ``--json`` so this regression can't reland silently. Bug 2: ``_reshape_for_ui`` assumed wrong output schema PR #301/#302 assumed qsv emits a flat shape like ``{description: "...", tags: [...], dictionary: [{field, summary}]}``. Real qsv 20.0.0 output (validated against LM Studio) is a wrapped envelope with PascalCase top-level keys:: {"Dictionary": {"response": {"fields": [...], ...}, "reasoning": "...", "token_usage": {...}}, "Description": {"response": "<markdown>", "reasoning": "...", "token_usage": {...}}, "Tags": {"response": {"tags": [...], "attribution": "..."}, "reasoning": "...", "token_usage": {...}, "num_tags": N}} Even if Bug 1 were fixed, the reshape would always produce an empty dict because none of the assumed top-level keys would match. The UI would never see populated suggestions. Fix: rewrote ``_reshape_for_ui`` to walk the real envelope: ``Description.response`` → top-level ``description``, ``Tags.response.tags`` → top-level ``tags`` (comma-joined), ``Dictionary.response.fields[i]`` → per-column entries keyed by ``name`` with the LLM-generated ``description`` as value and the ``label`` folded into source (so the popover surfaces both "Stock Keeping Unit (SKU)" and the long description). Committed a real qsv JSON envelope as ``tests/fixtures/qsv_describegpt_sample.json`` so the ``test_process_happy_path_persists_suggestions`` test now exercises actual qsv output, not a synthetic shape. Future qsv schema changes that break the reshape get caught here. Wrapper: two new optional params for non-localhost LLMs ``api_key`` and ``base_url``. qsv 20.0.0 refuses to start when the resolved base URL doesn't contain ``localhost`` AND no API key is set (``Neither QSV_LLM_BASE_URL nor QSV_LLM_APIKEY environment variables are set``). For DP+ specifically, the worker container reaches LM Studio / Ollama on the host via ``host.docker.internal`` — not localhost — so this triggers every time. Documented workaround: set ``QSV_LLM_APIKEY=NONE`` in the worker environment OR pass ``api_key="NONE"`` per call; qsv accepts ``NONE`` as the explicit "this is unauthenticated, get on with it" sentinel. config_declaration.yaml: expanded ``describegpt_config_path`` docs Operators now have a clear path for both cloud LLMs (``QSV_LLM_APIKEY`` env) and local LLMs reached via container-host hostnames (``QSV_LLM_APIKEY=NONE``). Also clarified that ``format = "JSON"`` should be set in the prompt-file — qsv emits Markdown by default, which the stage's JSON parse would reject. Test plan * 39 ``tests/test_ai_suggestions.py`` tests pass (was 36); +3 for the new wrapper params, partial-envelope reshape, and forward- compat unknown-key handling. Full unit suite: 166 (was 163). * **End-to-end verified**: qsvdp 20.0.0 inside the dpp-test container → ``host.docker.internal:1234/v1`` → LM Studio (gemma-4-e4b) → real JSON envelope → reshape → on-disk ``ai_suggestions`` with ``description``, ``tags``, ``sku``, ``name``, ``price``, ``quantity_in_stock`` entries, each with ``value`` + ``source`` (label folded into source). This is the first time the feature has actually been confirmed working end-to-end against a real LLM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-implements the AI suggestions slice of PR #253 against the current Prefect pipeline. Last of the three replacement PRs for #253 (after #298 spatial extent and #299 resubmit-CLI robustness).
Stays narrowly backend-scoped: qsv subprocess wrapper, new pipeline stage, Prefect task wiring, a template helper, and unit tests. The UI port from #253 (~420 lines of JS + form-snippet plumbing) is a deliberate follow-up.
What's added
qsv_utils.pyQSVCommand.describegpt()wrapper, patterned afterstats()/frequency(). qsv owns the LLM round-trip (endpoint, model, prompt, API key all in qsv's own config / env, NOT in ckan.ini).jobs/stages/ai_suggestions.py(NEW)AISuggestionsStage. Opt-in viaENABLE_AI_SUGGESTIONS(default False). Non-blocking by design: every failure mode (subprocess, qsv timeout, malformed JSON, package-patch error) logs a warning and returns context unchanged.jobs/prefect_flow.pyai_suggestions_taskchained betweenanalyze_taskand the databasetransaction()block. Outside the transaction so a (rare) failure can't poison@on_rollbackhooks. Returns inputAnalyzeResultunchanged.config.py+config_declaration.yamlenable_ai_suggestions(default False),describegpt_config_path(qsv prompt/config file),describegpt_timeout_seconds(default 120 — distinct fromQSV_COMMAND_TIMEOUTso a hung LLM doesn't pin a worker for 30 min).helpers.py+plugin.pyscheming_get_ai_suggestionhelper parallel toscheming_get_suggestion_value. Direct top-level lookup (description/tags) → walksdictionaryfor{"field": ..., "summary": ...}match (withdescriptionas alt ofsummary).tests/test_ai_suggestions.py(NEW)Output shape
When the stage succeeds, the package picks up:
{ "dpp_suggestions": { "ai_suggestions": { "description": "A dataset of widgets.", "tags": ["widgets", "production"], "dictionary": [ {"field": "sku", "summary": "Stock keeping unit"} ], "generated_at": "2026-05-17T03:55:00Z" } } }Sits alongside (NOT inside)
dpp_suggestions["package"](formula-derived suggestions) — they co-exist by design.What we deliberately skipped
Per the #253 re-land plan:
OPENROUTER_API_KEY,AI_TEMPERATURE,AI_MAX_TOKENS, …) — those live in qsv's describegpt config file.ai_suggestionsshape, opt-in flag); the JS is just UX plumbing.Test plan
dsucalls; no real subprocess / CKAN required).ENABLE_AI_SUGGESTIONS=False(default):should_skipshort-circuits; pipeline unchanged.Tracking
Closes the AI-suggestions sub-task on #259 ("DP+ 3.0 tracking issue"). #253 stays open for the UI-port follow-up.
Co-Authored-By: Minhajuddin Mohammed minhajuddin2510@gmail.com
🤖 Generated with Claude Code