[KYUUBI #7379][2b/4] Data Agent Engine: agent runtime, middleware stack, and OpenAI provider by wangzhigang1999 · Pull Request #7417 · apache/kyuubi

wangzhigang1999 · 2026-04-22T13:07:34Z

Why are the changes needed?

Part 2b of 4 for the Data Agent Engine (umbrella, KPIP-7373).

This PR adds the ReAct agent runtime that drives the LLM <-> tool loop, a composable middleware stack around it, and a production OpenAiProvider. It sits on top of the tool system and data source abstraction introduced in PR 2a, and is consumed by the REST layer in PR 3.

Changes include:

ReactAgent — ReAct loop with streaming, tool-call dispatch, turn budget, malformed-tool-call recovery
ConversationMemory — message history with cumulative prompt-token tracking
AgentRunContext / AgentInvocation / ApprovalMode — per-run state plumbing
ToolOutputStore — size-gated tool-output offload, keyed by session+call-id, with ReadToolOutputTool / GrepToolOutputTool for LLM-driven retrieval
AgentMiddleware interface with onRegister hook for tool wiring, plus four middlewares:
- LoggingMiddleware — structured request/response logging
- ApprovalMiddleware — risk-level-based approval gate
- CompactionMiddleware — token-threshold-driven history summarization keyed by session
- ToolResultOffloadMiddleware — transparently owns the ToolOutputStore and registers retrieval tools
OpenAiProvider — OpenAI-compatible chat completions with streaming and tool calls
ExecuteStatement.scala — SSE encoding extended to emit Compaction events
Dialects moved under datasource.dialect package for organization
New kyuubi.engine.data.agent.compaction.trigger.tokens configuration entry
MockLlmProvider — deterministic mock for middleware and runtime tests
mysql-connector-j moved to test scope (GPL-licensed; cannot be bundled in an Apache binary release — addresses review feedback on [KYUUBI #7379][2b/4] Data Agent Engine: agent runtime, middleware stack, and OpenAI provider #7417)

How was this patch tested?

Unit tests (Java): ConversationMemoryTest, ToolOutputStoreTest, ApprovalMiddlewareTest, CompactionMiddlewareTest, ToolResultOffloadMiddlewareTest, event/EventTest, plus updates to ToolRegistryThreadSafetyTest / ToolTest / RunSelectQueryToolTest / RunMutationQueryToolTest / JdbcDialectTest / MySQL DialectTest
Live LLM tests (opt-in, require DATA_AGENT_LLM_API_KEY / DATA_AGENT_LLM_API_URL / DATA_AGENT_LLM_MODEL): ReactAgentLiveTest, CompactionMiddlewareLiveTest — exercise the full loop against a real OpenAI-compatible endpoint
E2E (Scala): DataAgentE2ESuite extended with OpenAI-provider paths; new DataAgentCompactionE2ESuite observes compaction via JDBC
Existing unit + MySQL Testcontainers tests from PR 2a remain green

Was this patch authored or co-authored using generative AI tooling?

Partially assisted by Claude Code (Claude Opus 4.7) for test generation, code review, and PR formatting. Core design and implementation are human-authored.

…re stack, OpenAI provider, and live E2E tests This PR delivers the runtime layer of the Data Agent Engine on top of the tool system and data source plumbing from 2a/4: - ReactAgent: ReAct-style loop with streaming LLM responses, per-step tool dispatch, and AgentRunContext tracking token usage, iterations, and session. - Middleware stack (AgentMiddleware + ReactAgent.Builder): * LoggingMiddleware -- structured per-step/LLM/tool/finish logs with MDC. * ApprovalMiddleware -- CompletableFuture-based resolve for DESTRUCTIVE tools; modes NORMAL / STRICT / AUTO_APPROVE. * CompactionMiddleware -- token-threshold-triggered history summarization with KEEP_RECENT_TURNS=4, emits a Compaction AgentEvent so clients can observe the mechanism firing. * ToolResultOffloadMiddleware -- spills large tool outputs to disk and surfaces `read_tool_output` / `grep_tool_output` companion tools for the LLM to re-query truncated previews. - OpenAiProvider: single shared ReactAgent, per-session ConversationMemory, streaming chat completions, Hikari-pooled JDBC data source; reads model and thresholds from KyuubiConf. - ExecuteStatement (Scala): encodes all AgentEvents (including compaction and approval_request) as SSE JSON rows streamed through the JDBC reply column. - KyuubiConf: new keys for LLM provider/api-url/model/api-key, approval mode, compaction trigger tokens, offload root/thresholds, max iterations, etc. - Tests: * Unit tests for runtime, middlewares, offload store, and event shapes. * Live tests gated on DATA_AGENT_LLM_API_KEY covering full LLM round-trips: ReactAgentLiveTest (offload+grep, approval approve/deny), DataAgentE2ESuite and DataAgentApprovalE2ESuite (JDBC layer), DataAgentCompactionE2ESuite (JDBC-observable compaction event + post-compaction recovery), CompactionMiddlewareLiveTest. * Compatibility verified against qwen3.6-plus, glm-5, and kimi-k2.5 via per-call `model=` logging in ReactAgent.

MySQL Connector/J is GPL-licensed and cannot be bundled in an Apache binary release. Users who need the MySQL/StarRocks datasource at runtime should provide the driver jar themselves on the engine classpath. Addresses review feedback on apache#7417.

wangzhigang1999 · 2026-04-23T03:30:51Z

Evidence: runtime under real workload

Supplementary evidence for this PR. The benchmark harness itself is kept on a local branch (pr2c/...) and is not part of this PR — the numbers below are cited only to show that the runtime shipped here behaves correctly under a non-trivial text-to-SQL workload.

TL;DR

End-to-end robustness. 500/500 BIRD mini-dev questions completed, gen_ok=100%, zero framework-level failures (no crashes, no stuck sessions, no rate-limit fallout at concurrency=8).
Accuracy (EX=64.80%, F1=68.72% with kimi-k2.5) is included as context — this PR is not a model or accuracy deliverable.

Setup


Runtime under test	`ReactAgent` + `ToolResultOffloadMiddleware` + `CompactionMiddleware` (this PR)
Tools	`submit_sql`, `run_select_query`; `max_iterations=30`
Model	`kimi-k2.5` via dashscope OpenAI-compatible endpoint
Workload	BIRD mini-dev, 500 questions / 11 SQLite databases
Harness	internal, on local branch `pr2c/...` (not submitted)
Concurrency	8
Run	`benchmark-out/20260423-012636/`

Results

Overall: 500 questions, gen_ok=100%, EX=64.80%, F1=68.72%, avg 8.9 steps / 43.6k tokens / 42s per question.

By difficulty:

Difficulty	N	EX	F1	avg steps	avg tokens
simple	148	75.00%	78.33%	7.6	33k
moderate	250	62.80%	66.99%	9.1	46k
challenging	102	54.90%	59.01%	10.2	53k

By database (sorted by EX):

DB	N	EX	F1
superhero	52	84.62%	89.35%
european_football_2	51	76.47%	79.48%
student_club	48	72.92%	74.44%
toxicology	40	72.50%	73.68%
codebase_community	49	69.39%	74.17%
card_games	52	63.46%	58.27%
debit_card_specializing	30	60.00%	62.78%
formula_1	66	54.55%	62.13%
thrombosis_prediction	50	54.00%	59.60%
financial	32	50.00%	58.79%
california_schools	30	43.33%	54.29%

Cost: ~45 min wall time at concurrency=8, ~21M tokens total.

What this evidence supports

The runtime completes 500 non-trivial ReAct loops end-to-end without framework-level errors, at concurrency=8, against a live OpenAI-compatible provider. That's the load-bearing claim for this PR.
Difficulty scaling is monotonic (75% → 63% → 55%) — no pathological inversion that would suggest the runtime mishandles longer loops.

Scope disclaimers

Not a model benchmark. Numbers are bound to kimi-k2.5; other models will move them.
Not a leaderboard submission. Vanilla ReAct with the stock prompt; no schema-ranking, few-shot, or self-consistency. This is a framework viability floor, not SOTA.
Not a regression gate. LLM non-determinism is ±1pp run-to-run.
Harness not shipping with this PR. If you want to reproduce, ping me — the branch can be shared on request.

Follow-up: Spark backend run

Same harness, same 500 BIRD questions, but the agent targets a real EMR Kyuubi + Spark 3.5.3 cluster via jdbc:hive2://...:10009 instead of local SQLite. BIRD data was loaded from the SQLite files into Spark managed tables (Parquet on OSS-DLS) under 11 databases named bird_<db_id>.

Metric	SQLite (v2, middleware on)	Spark (v3)
Overall EX	64.80%	60.80%
F1	68.72%	59.70%
simple	75.00%	71.62%
moderate	62.80%	60.00%
challenging	54.90%	47.06%
gen_ok	100%	100%
BadRequestException	0	0
avg steps / q	8.9	11.9
avg tokens / q	44k	62k
wall-clock	~45 min	~82 min

This setup is materially stricter than the official BIRD evaluation. BIRD pins the target db_id per example and only measures SQL generation quality against a known schema. Here the agent is not told which database to target — and, importantly, the Hive metastore on the cluster holds 23 databases (11 BIRD-loaded plus 12 unrelated production schemas: tpcds_*, parquet_db_*, orc_db_*, kyuubi_test, doctor_test, etc.). The agent has to disambiguate the right bird_<db_id> out of that mixed list every question, which BIRD explicitly does not measure. The 4pp EX drop (64.80% → 60.80%) is largely attributable to this multi-database disambiguation task, observed failure mode: for a question about EUR/CZK currencies, the agent picked bird_financial instead of the correct bird_debit_card_specializing because the former name "looked more relevant" to the question's domain.

What this run confirms for the PR:

Runtime is datasource-agnostic. No changes to ReactAgent, no changes to any middleware, no changes to the provider — only the harness swapped JDBC URLs and plumbed a dialectName through the prompt builder.
Middleware scales to heavier prompts. Spark's richer schema-discovery round-trips push avg per-question prompt size from 44k to 62k tokens, yet zero BadRequest / zero framework errors, same as the SQLite run. The 128k compaction threshold still provides the load-bearing guard even though it happens not to trigger on any single question at this size.
Difficulty scaling stays monotonic (72% → 60% → 47%), same qualitative shape as SQLite — the runtime does not degrade non-linearly on harder questions when swapping backends.

wangzhigang1999 · 2026-04-23T05:43:13Z

ReactAgent Execution Flow

ReactAgent.run(request, memory, eventConsumer)
│
├─ memory.addUserMessage(userInput)
├─ dispatchAgentStart  /  emit(AgentStart)
│
├─ for step in 1..maxIterations:
│    ├─ emit(StepStart)
│    │
│    ├─ messages = memory.buildLlmMessages()
│    ├─ messages = middleware.beforeLlmCall(messages)      ← may rewrite or abort the call
│    │
│    ├─ streamLlmResponse(messages)                        ← streaming + chunk accumulation
│    │     └─ emit(ContentDelta)*   (one per token)
│    │
│    ├─ emit(ContentComplete)
│    ├─ memory.addAssistantMessage(...)
│    ├─ middleware.afterLlmCall(...)
│    │
│    ├─ if no toolCalls → emit(StepEnd) + emit(AgentFinish) + return   ✅ normal termination
│    │
│    └─ executeToolCalls (3-phase pipeline):
│         ├─ Phase 1 (serial)   : parse args → beforeToolCall → approval gate
│         ├─ Phase 2 (parallel) : toolRegistry.submitTool(...) → futures
│         └─ Phase 3 (serial)   : future.join()
│                                 → afterToolCall (may rewrite result, e.g. offload)
│                                 → memory.addToolResult(...)
│                                 → emit(ToolResult)
│
├─ exceeded maxIterations → emit(AgentError) + emit(AgentFinish)
├─ exception thrown       → emit(AgentError) + emit(AgentFinish)
└─ finally: dispatchAgentFinish   ← guarantees middleware cleanup

wangzhigang1999 · 2026-04-23T06:38:39Z

Hi @pan3793, when you have time, could I ask for a review on this one? 🙏

Third PR of the Data Agent Engine series (umbrella #7379, labeled 2b/4) — adds the ReactAgent runtime, middleware stack (logging / approval / compaction / tool-output offload), and OpenAiProvider. Sits on top of 2a (#7400) and is consumed by the final REST-layer PR.

It's on the larger side (~5.3k lines, almost all under externals/kyuubi-data-agent-engine/), so a high-level pass on the agent/middleware shape and session lifecycle is more than enough — happy to iterate on line-level feedback after. ReactAgentLiveTest exercises the full loop end-to-end if easier to poke at locally (needs DATA_AGENT_LLM_API_KEY / DATA_AGENT_LLM_API_URL / DATA_AGENT_LLM_MODEL).

No rush — thanks!

github-actions Bot added kind:documentation Documentation is a feature! module:common kind:build labels Apr 22, 2026

wangzhigang1999 force-pushed the pr2b/data-agent-runtime branch from c534fd7 to 3011909 Compare April 22, 2026 16:15

pan3793 reviewed Apr 23, 2026

View reviewed changes

Comment thread externals/kyuubi-data-agent-engine/pom.xml

wangzhigang1999 force-pushed the pr2b/data-agent-runtime branch from 3011909 to ce4eecc Compare April 23, 2026 02:45

wangzhigang1999 marked this pull request as ready for review April 23, 2026 06:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KYUUBI #7379][2b/4] Data Agent Engine: agent runtime, middleware stack, and OpenAI provider#7417

[KYUUBI #7379][2b/4] Data Agent Engine: agent runtime, middleware stack, and OpenAI provider#7417
wangzhigang1999 wants to merge 2 commits intoapache:masterfrom
wangzhigang1999:pr2b/data-agent-runtime

wangzhigang1999 commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

wangzhigang1999 commented Apr 23, 2026 •

edited

Loading

Uh oh!

wangzhigang1999 commented Apr 23, 2026

Uh oh!

wangzhigang1999 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wangzhigang1999 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

wangzhigang1999 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evidence: runtime under real workload

TL;DR

Setup

Results

What this evidence supports

Scope disclaimers

Follow-up: Spark backend run

Uh oh!

wangzhigang1999 commented Apr 23, 2026

ReactAgent Execution Flow

Uh oh!

wangzhigang1999 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wangzhigang1999 commented Apr 22, 2026 •

edited

Loading

wangzhigang1999 commented Apr 23, 2026 •

edited

Loading