Skip to content

[KYUUBI #7379][2b/4] Data Agent Engine: agent runtime, middleware stack, and OpenAI provider#7417

Open
wangzhigang1999 wants to merge 2 commits intoapache:masterfrom
wangzhigang1999:pr2b/data-agent-runtime
Open

[KYUUBI #7379][2b/4] Data Agent Engine: agent runtime, middleware stack, and OpenAI provider#7417
wangzhigang1999 wants to merge 2 commits intoapache:masterfrom
wangzhigang1999:pr2b/data-agent-runtime

Conversation

@wangzhigang1999
Copy link
Copy Markdown
Contributor

@wangzhigang1999 wangzhigang1999 commented Apr 22, 2026

Why are the changes needed?

Part 2b of 4 for the Data Agent Engine (umbrella, KPIP-7373).

This PR adds the ReAct agent runtime that drives the LLM <-> tool loop, a composable middleware stack around it, and a production OpenAiProvider. It sits on top of the tool system and data source abstraction introduced in PR 2a, and is consumed by the REST layer in PR 3.

Changes include:

  • ReactAgent — ReAct loop with streaming, tool-call dispatch, turn budget, malformed-tool-call recovery
  • ConversationMemory — message history with cumulative prompt-token tracking
  • AgentRunContext / AgentInvocation / ApprovalMode — per-run state plumbing
  • ToolOutputStore — size-gated tool-output offload, keyed by session+call-id, with ReadToolOutputTool / GrepToolOutputTool for LLM-driven retrieval
  • AgentMiddleware interface with onRegister hook for tool wiring, plus four middlewares:
    • LoggingMiddleware — structured request/response logging
    • ApprovalMiddleware — risk-level-based approval gate
    • CompactionMiddleware — token-threshold-driven history summarization keyed by session
    • ToolResultOffloadMiddleware — transparently owns the ToolOutputStore and registers retrieval tools
  • OpenAiProvider — OpenAI-compatible chat completions with streaming and tool calls
  • ExecuteStatement.scala — SSE encoding extended to emit Compaction events
  • Dialects moved under datasource.dialect package for organization
  • New kyuubi.engine.data.agent.compaction.trigger.tokens configuration entry
  • MockLlmProvider — deterministic mock for middleware and runtime tests
  • mysql-connector-j moved to test scope (GPL-licensed; cannot be bundled in an Apache binary release — addresses review feedback on [KYUUBI #7379][2b/4] Data Agent Engine: agent runtime, middleware stack, and OpenAI provider #7417)

How was this patch tested?

  • Unit tests (Java): ConversationMemoryTest, ToolOutputStoreTest, ApprovalMiddlewareTest, CompactionMiddlewareTest, ToolResultOffloadMiddlewareTest, event/EventTest, plus updates to ToolRegistryThreadSafetyTest / ToolTest / RunSelectQueryToolTest / RunMutationQueryToolTest / JdbcDialectTest / MySQL DialectTest
  • Live LLM tests (opt-in, require DATA_AGENT_LLM_API_KEY / DATA_AGENT_LLM_API_URL / DATA_AGENT_LLM_MODEL): ReactAgentLiveTest, CompactionMiddlewareLiveTest — exercise the full loop against a real OpenAI-compatible endpoint
  • E2E (Scala): DataAgentE2ESuite extended with OpenAI-provider paths; new DataAgentCompactionE2ESuite observes compaction via JDBC
  • Existing unit + MySQL Testcontainers tests from PR 2a remain green

Was this patch authored or co-authored using generative AI tooling?

Partially assisted by Claude Code (Claude Opus 4.7) for test generation, code review, and PR formatting. Core design and implementation are human-authored.

Comment thread externals/kyuubi-data-agent-engine/pom.xml
…re stack, OpenAI provider, and live E2E tests

This PR delivers the runtime layer of the Data Agent Engine on top of the tool
system and data source plumbing from 2a/4:

- ReactAgent: ReAct-style loop with streaming LLM responses, per-step tool
  dispatch, and AgentRunContext tracking token usage, iterations, and session.
- Middleware stack (AgentMiddleware + ReactAgent.Builder):
  * LoggingMiddleware -- structured per-step/LLM/tool/finish logs with MDC.
  * ApprovalMiddleware -- CompletableFuture-based resolve for DESTRUCTIVE
    tools; modes NORMAL / STRICT / AUTO_APPROVE.
  * CompactionMiddleware -- token-threshold-triggered history summarization
    with KEEP_RECENT_TURNS=4, emits a Compaction AgentEvent so clients can
    observe the mechanism firing.
  * ToolResultOffloadMiddleware -- spills large tool outputs to disk and
    surfaces `read_tool_output` / `grep_tool_output` companion tools for the
    LLM to re-query truncated previews.
- OpenAiProvider: single shared ReactAgent, per-session ConversationMemory,
  streaming chat completions, Hikari-pooled JDBC data source; reads model and
  thresholds from KyuubiConf.
- ExecuteStatement (Scala): encodes all AgentEvents (including compaction and
  approval_request) as SSE JSON rows streamed through the JDBC reply column.
- KyuubiConf: new keys for LLM provider/api-url/model/api-key, approval mode,
  compaction trigger tokens, offload root/thresholds, max iterations, etc.
- Tests:
  * Unit tests for runtime, middlewares, offload store, and event shapes.
  * Live tests gated on DATA_AGENT_LLM_API_KEY covering full LLM round-trips:
    ReactAgentLiveTest (offload+grep, approval approve/deny), DataAgentE2ESuite
    and DataAgentApprovalE2ESuite (JDBC layer), DataAgentCompactionE2ESuite
    (JDBC-observable compaction event + post-compaction recovery),
    CompactionMiddlewareLiveTest.
  * Compatibility verified against qwen3.6-plus, glm-5, and kimi-k2.5 via
    per-call `model=` logging in ReactAgent.
@wangzhigang1999 wangzhigang1999 force-pushed the pr2b/data-agent-runtime branch from 3011909 to ce4eecc Compare April 23, 2026 02:45
MySQL Connector/J is GPL-licensed and cannot be bundled in an Apache
binary release. Users who need the MySQL/StarRocks datasource at runtime
should provide the driver jar themselves on the engine classpath.

Addresses review feedback on apache#7417.
@wangzhigang1999
Copy link
Copy Markdown
Contributor Author

wangzhigang1999 commented Apr 23, 2026

Evidence: runtime under real workload

Supplementary evidence for this PR. The benchmark harness itself is kept on a local branch (pr2c/...) and is not part of this PR — the numbers below are cited only to show that the runtime shipped here behaves correctly under a non-trivial text-to-SQL workload.

TL;DR

  • End-to-end robustness. 500/500 BIRD mini-dev questions completed, gen_ok=100%, zero framework-level failures (no crashes, no stuck sessions, no rate-limit fallout at concurrency=8).
  • Accuracy (EX=64.80%, F1=68.72% with kimi-k2.5) is included as context — this PR is not a model or accuracy deliverable.

Setup

Runtime under test ReactAgent + ToolResultOffloadMiddleware + CompactionMiddleware (this PR)
Tools submit_sql, run_select_query; max_iterations=30
Model kimi-k2.5 via dashscope OpenAI-compatible endpoint
Workload BIRD mini-dev, 500 questions / 11 SQLite databases
Harness internal, on local branch pr2c/... (not submitted)
Concurrency 8
Run benchmark-out/20260423-012636/

Results

Overall: 500 questions, gen_ok=100%, EX=64.80%, F1=68.72%, avg 8.9 steps / 43.6k tokens / 42s per question.

By difficulty:

Difficulty N EX F1 avg steps avg tokens
simple 148 75.00% 78.33% 7.6 33k
moderate 250 62.80% 66.99% 9.1 46k
challenging 102 54.90% 59.01% 10.2 53k

By database (sorted by EX):

DB N EX F1
superhero 52 84.62% 89.35%
european_football_2 51 76.47% 79.48%
student_club 48 72.92% 74.44%
toxicology 40 72.50% 73.68%
codebase_community 49 69.39% 74.17%
card_games 52 63.46% 58.27%
debit_card_specializing 30 60.00% 62.78%
formula_1 66 54.55% 62.13%
thrombosis_prediction 50 54.00% 59.60%
financial 32 50.00% 58.79%
california_schools 30 43.33% 54.29%

Cost: ~45 min wall time at concurrency=8, ~21M tokens total.

What this evidence supports

  • The runtime completes 500 non-trivial ReAct loops end-to-end without framework-level errors, at concurrency=8, against a live OpenAI-compatible provider. That's the load-bearing claim for this PR.
  • Difficulty scaling is monotonic (75% → 63% → 55%) — no pathological inversion that would suggest the runtime mishandles longer loops.

Scope disclaimers

  • Not a model benchmark. Numbers are bound to kimi-k2.5; other models will move them.
  • Not a leaderboard submission. Vanilla ReAct with the stock prompt; no schema-ranking, few-shot, or self-consistency. This is a framework viability floor, not SOTA.
  • Not a regression gate. LLM non-determinism is ±1pp run-to-run.
  • Harness not shipping with this PR. If you want to reproduce, ping me — the branch can be shared on request.

Follow-up: Spark backend run

Same harness, same 500 BIRD questions, but the agent targets a real EMR Kyuubi + Spark 3.5.3 cluster via jdbc:hive2://...:10009 instead of local SQLite. BIRD data was loaded from the SQLite files into Spark managed tables (Parquet on OSS-DLS) under 11 databases named bird_<db_id>.

Metric SQLite (v2, middleware on) Spark (v3)
Overall EX 64.80% 60.80%
F1 68.72% 59.70%
simple 75.00% 71.62%
moderate 62.80% 60.00%
challenging 54.90% 47.06%
gen_ok 100% 100%
BadRequestException 0 0
avg steps / q 8.9 11.9
avg tokens / q 44k 62k
wall-clock ~45 min ~82 min

This setup is materially stricter than the official BIRD evaluation. BIRD pins the target db_id per example and only measures SQL generation quality against a known schema. Here the agent is not told which database to target — and, importantly, the Hive metastore on the cluster holds 23 databases (11 BIRD-loaded plus 12 unrelated production schemas: tpcds_*, parquet_db_*, orc_db_*, kyuubi_test, doctor_test, etc.). The agent has to disambiguate the right bird_<db_id> out of that mixed list every question, which BIRD explicitly does not measure. The 4pp EX drop (64.80% → 60.80%) is largely attributable to this multi-database disambiguation task, observed failure mode: for a question about EUR/CZK currencies, the agent picked bird_financial instead of the correct bird_debit_card_specializing because the former name "looked more relevant" to the question's domain.

What this run confirms for the PR:

  • Runtime is datasource-agnostic. No changes to ReactAgent, no changes to any middleware, no changes to the provider — only the harness swapped JDBC URLs and plumbed a dialectName through the prompt builder.
  • Middleware scales to heavier prompts. Spark's richer schema-discovery round-trips push avg per-question prompt size from 44k to 62k tokens, yet zero BadRequest / zero framework errors, same as the SQLite run. The 128k compaction threshold still provides the load-bearing guard even though it happens not to trigger on any single question at this size.
  • Difficulty scaling stays monotonic (72% → 60% → 47%), same qualitative shape as SQLite — the runtime does not degrade non-linearly on harder questions when swapping backends.

@wangzhigang1999
Copy link
Copy Markdown
Contributor Author

ReactAgent Execution Flow

ReactAgent.run(request, memory, eventConsumer)
│
├─ memory.addUserMessage(userInput)
├─ dispatchAgentStart  /  emit(AgentStart)
│
├─ for step in 1..maxIterations:
│    ├─ emit(StepStart)
│    │
│    ├─ messages = memory.buildLlmMessages()
│    ├─ messages = middleware.beforeLlmCall(messages)      ← may rewrite or abort the call
│    │
│    ├─ streamLlmResponse(messages)                        ← streaming + chunk accumulation
│    │     └─ emit(ContentDelta)*   (one per token)
│    │
│    ├─ emit(ContentComplete)
│    ├─ memory.addAssistantMessage(...)
│    ├─ middleware.afterLlmCall(...)
│    │
│    ├─ if no toolCalls → emit(StepEnd) + emit(AgentFinish) + return   ✅ normal termination
│    │
│    └─ executeToolCalls (3-phase pipeline):
│         ├─ Phase 1 (serial)   : parse args → beforeToolCall → approval gate
│         ├─ Phase 2 (parallel) : toolRegistry.submitTool(...) → futures
│         └─ Phase 3 (serial)   : future.join()
│                                 → afterToolCall (may rewrite result, e.g. offload)
│                                 → memory.addToolResult(...)
│                                 → emit(ToolResult)
│
├─ exceeded maxIterations → emit(AgentError) + emit(AgentFinish)
├─ exception thrown       → emit(AgentError) + emit(AgentFinish)
└─ finally: dispatchAgentFinish   ← guarantees middleware cleanup

@wangzhigang1999 wangzhigang1999 marked this pull request as ready for review April 23, 2026 06:21
@wangzhigang1999
Copy link
Copy Markdown
Contributor Author

Hi @pan3793, when you have time, could I ask for a review on this one? 🙏

Third PR of the Data Agent Engine series (umbrella #7379, labeled 2b/4) — adds the ReactAgent runtime, middleware stack (logging / approval / compaction / tool-output offload), and OpenAiProvider. Sits on top of 2a (#7400) and is consumed by the final REST-layer PR.

It's on the larger side (~5.3k lines, almost all under externals/kyuubi-data-agent-engine/), so a high-level pass on the agent/middleware shape and session lifecycle is more than enough — happy to iterate on line-level feedback after. ReactAgentLiveTest exercises the full loop end-to-end if easier to poke at locally (needs DATA_AGENT_LLM_API_KEY / DATA_AGENT_LLM_API_URL / DATA_AGENT_LLM_MODEL).

No rush — thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants