Skip to content

feat(ai-suggestions): optional qsv describegpt stage for LLM-derived metadata#301

Merged
jqnatividad merged 2 commits into
mainfrom
ai-suggestions-describegpt
May 17, 2026
Merged

feat(ai-suggestions): optional qsv describegpt stage for LLM-derived metadata#301
jqnatividad merged 2 commits into
mainfrom
ai-suggestions-describegpt

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

Re-implements the AI suggestions slice of PR #253 against the current Prefect pipeline. Last of the three replacement PRs for #253 (after #298 spatial extent and #299 resubmit-CLI robustness).

Stays narrowly backend-scoped: qsv subprocess wrapper, new pipeline stage, Prefect task wiring, a template helper, and unit tests. The UI port from #253 (~420 lines of JS + form-snippet plumbing) is a deliberate follow-up.

What's added

File Change
qsv_utils.py New QSVCommand.describegpt() wrapper, patterned after stats() / frequency(). qsv owns the LLM round-trip (endpoint, model, prompt, API key all in qsv's own config / env, NOT in ckan.ini).
jobs/stages/ai_suggestions.py (NEW) AISuggestionsStage. Opt-in via ENABLE_AI_SUGGESTIONS (default False). Non-blocking by design: every failure mode (subprocess, qsv timeout, malformed JSON, package-patch error) logs a warning and returns context unchanged.
jobs/prefect_flow.py New ai_suggestions_task chained between analyze_task and the database transaction() block. Outside the transaction so a (rare) failure can't poison @on_rollback hooks. Returns input AnalyzeResult unchanged.
config.py + config_declaration.yaml Three new keys: enable_ai_suggestions (default False), describegpt_config_path (qsv prompt/config file), describegpt_timeout_seconds (default 120 — distinct from QSV_COMMAND_TIMEOUT so a hung LLM doesn't pin a worker for 30 min).
helpers.py + plugin.py scheming_get_ai_suggestion helper parallel to scheming_get_suggestion_value. Direct top-level lookup (description/tags) → walks dictionary for {"field": ..., "summary": ...} match (with description as alt of summary).
tests/test_ai_suggestions.py (NEW) 22 unit tests covering wrapper arg assembly, stage gate, happy + failure paths, helper lookup shapes.

Output shape

When the stage succeeds, the package picks up:

{
  "dpp_suggestions": {
    "ai_suggestions": {
      "description": "A dataset of widgets.",
      "tags": ["widgets", "production"],
      "dictionary": [
        {"field": "sku", "summary": "Stock keeping unit"}
      ],
      "generated_at": "2026-05-17T03:55:00Z"
    }
  }
}

Sits alongside (NOT inside) dpp_suggestions["package"] (formula-derived suggestions) — they co-exist by design.

What we deliberately skipped

Per the #253 re-land plan:

  • Bespoke Python LLM client (+449 lines in Added dpp spatial extent #253) — qsv describegpt already speaks any OpenAI-compatible endpoint (Ollama, OpenRouter, OpenAI, vLLM).
  • The 10 PR-Added dpp spatial extent #253 config keys (OPENROUTER_API_KEY, AI_TEMPERATURE, AI_MAX_TOKENS, …) — those live in qsv's describegpt config file.
  • The PR-Added dpp spatial extent #253 fallback heuristic — stage is opt-in + non-blocking, so a failed describegpt IS the heuristic.
  • The 420-line JS port + form snippets from Added dpp spatial extent #253 — follow-up PR. The backend exposes everything needed (helper, populated ai_suggestions shape, opt-in flag); the JS is just UX plumbing.

Test plan

  • 22 new unit tests pass in dpp-test container (mocked qsv + dsu calls; no real subprocess / CKAN required).
  • Full unit suite: 149 passed (was 127, +22 new). No regressions.
  • ENABLE_AI_SUGGESTIONS=False (default): should_skip short-circuits; pipeline unchanged.
  • Stage syntax / Prefect task imports parse cleanly.
  • End-to-end verification against Ollama / OpenRouter — needs an operator with a live LLM endpoint configured. Stage is opt-in so unverified-but-merged is safe.

Tracking

Closes the AI-suggestions sub-task on #259 ("DP+ 3.0 tracking issue"). #253 stays open for the UI-port follow-up.

Co-Authored-By: Minhajuddin Mohammed minhajuddin2510@gmail.com

🤖 Generated with Claude Code

…metadata

Re-implements the AI suggestions slice of PR #253 against the current
Prefect pipeline. Stays narrowly backend-scoped: the qsv subprocess
wrapper, the new pipeline stage, the Prefect task wiring, a template
helper, and unit tests. The UI port from PR #253 (~420 lines of JS +
form snippet plumbing) is a deliberate follow-up.

What

* ``qsv_utils.QSVCommand.describegpt()`` — thin wrapper around ``qsv
  describegpt --description --dictionary --tags --json``, patterned
  after the existing ``stats()`` / ``frequency()`` methods. qsv owns
  the LLM round-trip (endpoint, model, prompt template, API key all
  live in qsv's own config or env, deliberately NOT in ckan.ini).

* ``jobs/stages/ai_suggestions.py`` — new ``AISuggestionsStage``
  (``BaseStage``). Opt-in via ``ENABLE_AI_SUGGESTIONS`` (default
  False). Non-blocking by design: every internal failure mode
  (subprocess error, qsv timeout, malformed JSON, package-patch
  error) drops a warning and returns the context unchanged so the
  rest of the pipeline keeps going. Writes results under
  ``package["dpp_suggestions"]["ai_suggestions"]`` — a separate
  sub-key from the formula-derived ``["package"]`` namespace so
  they can co-exist.

* ``jobs/prefect_flow.py`` — new ``ai_suggestions_task`` chained
  between ``analyze_task`` and the database ``transaction()`` block.
  Placement outside the transaction prevents a (rare) failure from
  poisoning the per-task ``@on_rollback`` hooks; the task returns
  its input ``AnalyzeResult`` unchanged so downstream tasks see the
  same shape they would without it.

* Three new config keys (``config.py`` + ``config_declaration.yaml``):
  ``enable_ai_suggestions`` (default False),
  ``describegpt_config_path`` (path to qsv's prompt/config file,
  default empty → qsv's own discovery), and
  ``describegpt_timeout_seconds`` (default 120, distinct from the
  1800s ``QSV_COMMAND_TIMEOUT`` so a hung LLM doesn't pin a worker).

* ``helpers.scheming_get_ai_suggestion`` + ``plugin.py`` wiring —
  scheming-form helper parallel to ``scheming_get_suggestion_value``.
  Looks up direct top-level keys (``description`` / ``tags``) first,
  then walks ``dictionary`` for a per-field ``{"field": ..., "summary":
  ...}`` match (with ``description`` as an alternate of ``summary``).

What we deliberately skipped (out of scope per #253 re-land plan)

* The bespoke Python LLM client from #253 — qsv describegpt already
  speaks any OpenAI-compatible endpoint (Ollama, OpenRouter, OpenAI,
  vLLM). No DP+ HTTP code needed.

* The 10 PR-#253 config keys (``OPENROUTER_API_KEY``, ``AI_TEMPERATURE``,
  ``AI_MAX_TOKENS``, …). Those live in qsv's describegpt config file,
  not in ours.

* The PR-#253 fallback heuristic — stage is opt-in and non-blocking
  already, so a failed describegpt is the heuristic.

* The 420-line JS port from PR #253 + form-snippet templates — that's
  the follow-up PR. The backend exposes everything needed (helper,
  populated ``ai_suggestions`` shape, opt-in flag); the JS is just
  UX plumbing.

Test plan

* 22 new unit tests in ``tests/test_ai_suggestions.py`` (mocked qsv
  + ``dsu`` calls, no real subprocess / CKAN required). Full unit
  suite now 149 passed (was 127).
* With ``enable_ai_suggestions=False`` (default): pipeline runs
  unchanged — ``should_skip`` short-circuits.
* With it enabled + working describegpt config: stage writes
  ``package["dpp_suggestions"]["ai_suggestions"]``; downstream
  ``MetadataStage`` writes it back via the normal package-update path.
* With it enabled but qsv errors: stage logs a warning, pipeline
  completes; ``ai_suggestions`` key simply absent.

Tracking issue #259 ("DP+ 3.0 tracking issue" → AI-suggestions
sub-task) can be checked off once this lands.

Co-Authored-By: Minhajuddin Mohammed <minhajuddin2510@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in “AI suggestions” step to the DataPusher Plus Prefect pipeline by wrapping qsv describegpt, persisting the returned JSON onto the CKAN package under dpp_suggestions.ai_suggestions, and exposing a scheming helper to read those suggestions in templates.

Changes:

  • Introduces QSVCommand.describegpt() and a new AISuggestionsStage that shells out to qsv describegpt and patches the CKAN package with the results.
  • Wires a new Prefect task (ai_suggestions_task) between analysis and the transactional database stages; adds config keys to gate/parameterize the feature.
  • Adds scheming_get_ai_suggestion helper + helper registration, plus a new unit-test suite for the feature.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_ai_suggestions.py New unit tests for qsv wrapper args, stage gating/failure behavior, and helper lookup logic.
ckanext/datapusher_plus/qsv_utils.py Adds QSVCommand.describegpt() wrapper with config-default timeout support.
ckanext/datapusher_plus/plugin.py Registers the new scheming_get_ai_suggestion template helper.
ckanext/datapusher_plus/jobs/stages/ai_suggestions.py New optional pipeline stage that runs describegpt and persists results to the package.
ckanext/datapusher_plus/jobs/prefect_flow.py Adds a Prefect task and flow wiring to run the AI suggestions stage after analysis.
ckanext/datapusher_plus/helpers.py Adds scheming_get_ai_suggestion helper to read AI suggestions from package metadata.
ckanext/datapusher_plus/config.py Adds config keys for enabling AI suggestions, config path, and a dedicated describegpt timeout.
ckanext/datapusher_plus/config_declaration.yaml Documents the new configuration keys for operators.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ckanext/datapusher_plus/jobs/stages/ai_suggestions.py Outdated
Comment thread ckanext/datapusher_plus/jobs/stages/ai_suggestions.py Outdated
Comment thread ckanext/datapusher_plus/config_declaration.yaml Outdated
Three follow-ups from Copilot's inline review:

* ai_suggestions.py: reuse ``context.qsv`` (the per-flow-run
  ``QSVCommand`` constructed in ``_build_runtime_context``) instead
  of building a fresh one. Saves the binary-existence check + ``qsv
  --version`` call on every flow. ``getattr(context, "qsv", None) or
  QSVCommand(logger=context.logger)`` keeps unit-test stand-ins (which
  pass a ``SimpleNamespace`` without ``qsv``) working.

* ai_suggestions.py: when ``package["dpp_suggestions"]`` is present
  but not a dict (legacy / corrupted state), log a warning and skip
  the write rather than resetting to ``{}``. The old "reset" behaviour
  was destructive — it would silently nuke any pre-existing data
  (formula-derived suggestions, custom-plugin scalars, …) for what is
  supposed to be a bonus, non-blocking stage. ``None`` still gets
  initialised to ``{}`` (happy path when scheming/FormulaStage hasn't
  run); only the "present but wrong type" case bails out.

* config_declaration.yaml: aligned the
  ``describegpt_config_path`` description to match the wrapper
  docstring (and reality) — qsv's ``--prompt-file`` flag expects
  TOML, not JSON. Dropped the misleading
  ``~/.qsv/describegpt.json`` claim; replaced with the actual qsv
  discovery path (``dp.prompt`` / ``dp_prompt.toml`` next to the
  binary, fallback to ``OPENAI_API_KEY``).

Tests: the old reset-and-write test
(``test_process_resets_corrupted_dpp_suggestions``) is replaced
with two more accurate tests — ``test_process_skips_when_dpp_suggestions_is_non_dict``
(asserts the original value is preserved and ``patch_package`` is
NOT called) and ``test_process_initializes_missing_dpp_suggestions``
(covers the ``None`` → init path). Unit suite still 23/23 on this
file, 150/150 overall.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit fdc3714 into main May 17, 2026
1 check passed
@jqnatividad jqnatividad deleted the ai-suggestions-describegpt branch May 17, 2026 04:38
jqnatividad added a commit that referenced this pull request May 17, 2026
…302 follow-up)

End-to-end verification against LM Studio (gemma-4-e4b on
host.docker.internal:1234) surfaced two real bugs in the AI
suggestions feature that landed in #301 + #302. The unit suite
passed because subprocess.run was mocked with a synthetic JSON
shape; nothing exercised qsv's actual CLI surface or output
schema.

Bug 1: ``--json`` flag doesn't exist in qsv 20.0.0

  PR #301's ``QSVCommand.describegpt`` wrapper unconditionally
  emitted ``--json`` → qsv rejected every real call with
  ``Unknown flag: '--json'``. The correct flag is ``--format JSON``
  (qsv 20.0.0 supports Markdown / TSV / JSON / TOON via a single
  ``--format`` flag).

  Fix: wrapper now emits ``--format JSON`` when ``json_output=True``.
  Test asserts both the presence of ``--format JSON`` and the
  absence of ``--json`` so this regression can't reland silently.

Bug 2: ``_reshape_for_ui`` assumed wrong output schema

  PR #301/#302 assumed qsv emits a flat shape like
  ``{description: "...", tags: [...], dictionary: [{field, summary}]}``.
  Real qsv 20.0.0 output (validated against LM Studio) is a wrapped
  envelope with PascalCase top-level keys::

    {"Dictionary": {"response": {"fields": [...], ...},
                    "reasoning": "...", "token_usage": {...}},
     "Description": {"response": "<markdown>",
                     "reasoning": "...", "token_usage": {...}},
     "Tags": {"response": {"tags": [...], "attribution": "..."},
              "reasoning": "...", "token_usage": {...}, "num_tags": N}}

  Even if Bug 1 were fixed, the reshape would always produce an
  empty dict because none of the assumed top-level keys would
  match. The UI would never see populated suggestions.

  Fix: rewrote ``_reshape_for_ui`` to walk the real envelope:
  ``Description.response`` → top-level ``description``,
  ``Tags.response.tags`` → top-level ``tags`` (comma-joined),
  ``Dictionary.response.fields[i]`` → per-column entries keyed by
  ``name`` with the LLM-generated ``description`` as value and the
  ``label`` folded into source (so the popover surfaces both
  "Stock Keeping Unit (SKU)" and the long description).

  Committed a real qsv JSON envelope as
  ``tests/fixtures/qsv_describegpt_sample.json`` so the
  ``test_process_happy_path_persists_suggestions`` test now
  exercises actual qsv output, not a synthetic shape. Future qsv
  schema changes that break the reshape get caught here.

Wrapper: two new optional params for non-localhost LLMs

  ``api_key`` and ``base_url``. qsv 20.0.0 refuses to start when
  the resolved base URL doesn't contain ``localhost`` AND no API
  key is set (``Neither QSV_LLM_BASE_URL nor QSV_LLM_APIKEY
  environment variables are set``). For DP+ specifically, the
  worker container reaches LM Studio / Ollama on the host via
  ``host.docker.internal`` — not localhost — so this triggers
  every time. Documented workaround: set ``QSV_LLM_APIKEY=NONE``
  in the worker environment OR pass ``api_key="NONE"`` per call;
  qsv accepts ``NONE`` as the explicit
  "this is unauthenticated, get on with it" sentinel.

config_declaration.yaml: expanded ``describegpt_config_path`` docs

  Operators now have a clear path for both cloud LLMs
  (``QSV_LLM_APIKEY`` env) and local LLMs reached via container-host
  hostnames (``QSV_LLM_APIKEY=NONE``). Also clarified that
  ``format = "JSON"`` should be set in the prompt-file — qsv emits
  Markdown by default, which the stage's JSON parse would reject.

Test plan

* 39 ``tests/test_ai_suggestions.py`` tests pass (was 36); +3 for
  the new wrapper params, partial-envelope reshape, and forward-
  compat unknown-key handling. Full unit suite: 166 (was 163).
* **End-to-end verified**: qsvdp 20.0.0 inside the dpp-test
  container → ``host.docker.internal:1234/v1`` → LM Studio
  (gemma-4-e4b) → real JSON envelope → reshape → on-disk
  ``ai_suggestions`` with ``description``, ``tags``, ``sku``,
  ``name``, ``price``, ``quantity_in_stock`` entries, each with
  ``value`` + ``source`` (label folded into source). This is the
  first time the feature has actually been confirmed working
  end-to-end against a real LLM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request May 17, 2026
fix(ai-suggestions): real-qsv schema + CLI flag corrections (PR #301/#302 follow-up)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants