Skip to content

[KYUUBI #7379][2a/4] Data Agent Engine: tool system, data source, and prompt templates#7400

Merged
yaooqinn merged 5 commits intoapache:masterfrom
wangzhigang1999:pr2a/data-agent-tools
Apr 21, 2026
Merged

[KYUUBI #7379][2a/4] Data Agent Engine: tool system, data source, and prompt templates#7400
yaooqinn merged 5 commits intoapache:masterfrom
wangzhigang1999:pr2a/data-agent-tools

Conversation

@wangzhigang1999
Copy link
Copy Markdown
Contributor

@wangzhigang1999 wangzhigang1999 commented Apr 13, 2026

Why are the changes needed?

Part 2a of 4 for the Data Agent Engine (umbrella, KPIP-7373).

This PR adds the tool system, data source abstraction, and composable prompt builder — the infrastructure that the agent runtime (PR 2b) will use to execute SQL and interact with the LLM.

Changes include:

  • AgentTool interface with ToolRiskLevel, JSON schema generation for LLM function calling
  • ToolRegistry — thread-safe tool registration, dispatch, timeout enforcement, and OpenAI-compatible tool definition export
  • RunSelectQueryTool / RunMutationQueryTool — read-only vs. mutation SQL execution, output truncation, and SqlReadOnlyChecker
  • SqlExecutor — shared JDBC execution logic with statement timeout and result formatting
  • DataSourceFactory — HikariCP connection pool creation with optional user/password
  • JdbcDialect — auto-detection from JDBC URL with dialect-specific identifier quoting (Spark, MySQL, SQLite, Trino, generic fallback)
  • TableRef — catalog/schema/table reference with JSON deserialization support
  • SystemPromptBuilder — composable Markdown prompt assembly with date injection, per-dialect datasource sections, and free-form text sections
  • Prompt templates: base.md, datasource-{mysql,spark,sqlite,trino}.md
  • New kyuubi.engine.data.agent.tool.* and kyuubi.engine.data.agent.datasource.* configuration entries

How was this patch tested?

  • Unit tests (Java): JdbcDialectTest, TableRefTest, DataSourceFactoryAuthTest, SqlReadOnlyCheckerTest, RunSelectQueryToolTest, RunMutationQueryToolTest, ToolTest, ToolRegistryThreadSafetyTest, ToolSchemaGeneratorTest, SystemPromptBuilderTest
  • MySQL integration tests (Testcontainers): DataSourceFactoryTest, DialectTest, RunSelectQueryTest, RunMutationQueryTest, ToolExecutionTest — all run against a real MySQL container
  • Total: 164 JUnit tests + 21 ScalaTest tests, all passing

Was this patch authored or co-authored using generative AI tooling?

Partially assisted by Claude Code (Claude Opus 4.6) for test generation, code review, and PR formatting. Core design and implementation are human-authored.

wangzhigang1999 and others added 2 commits April 13, 2026 16:52
…urce, and prompt templates

Tool system with risk-based separation (RunSelectQueryTool / RunMutationQueryTool),
ToolRegistry with JSON schema generation, and SqlReadOnlyChecker keyword whitelist.

Data source abstraction with JdbcDialect auto-detection (Spark/Trino/MySQL/SQLite),
GenericDialect fallback for unknown JDBC subprotocols, TableRef value object for
structured table references with Jackson deserialization support, and HikariCP-backed
DataSourceFactory with credential isolation.

Composable SystemPromptBuilder with per-dialect prompt templates (base.md +
datasource-{name}.md), SQL workflow guidance, and query risk classification.
… remove jdbcUrl shortcut

- SystemPromptBuilder.datasource() now replaces the previous datasource section
  instead of appending, matching the single-datasource-per-session model
- Remove jdbcUrl() convenience method; callers use JdbcDialect.fromUrl() directly
- Remove redundant tests (pool config defaults, toString, tool metadata, fast timeout)
- Clean up todo comment in JdbcDialect

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wangzhigang1999 wangzhigang1999 marked this pull request as ready for review April 13, 2026 11:01
@wangzhigang1999
Copy link
Copy Markdown
Contributor Author

Hi @pan3793, could you help review this PR when you have time? This is Part 2a of the Data Agent Engine series — it adds the tool system, data source abstraction, and prompt builder. Thanks! 🙏

Register the three Apache-2.0 libraries bundled by the data-agent
engine (openai-java, victools jsonschema-generator and
jsonschema-module-jackson) so the binary distribution's LICENSE
file reflects the dependencies it ships.
@github-actions github-actions Bot added the kind:infra license, community building, project builds, asf infra related, etc. label Apr 17, 2026
@yaooqinn yaooqinn requested a review from Copilot April 20, 2026 03:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (KYUUBI #7379, part 2a/4) introduces the Data Agent Engine’s foundational “tool + datasource + prompt” infrastructure that later PRs will use to run SQL via JDBC and interact with an LLM using OpenAI-style function calling.

Changes:

  • Add tool SPI (AgentTool), registry (ToolRegistry) with JSON-schema generation, and SQL tools (run_select_query, run_mutation_query).
  • Add datasource abstraction (JdbcDialect, TableRef, DataSourceFactory) and dialect-specific prompt templates.
  • Expand configuration/docs (timeouts) and add extensive unit + integration tests (SQLite + MySQL via Testcontainers).

Reviewed changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pom.xml Adds dependency/version management for OpenAI SDK, victools JSON schema generator, and MySQL Testcontainers.
kyuubi-common/src/main/scala/org/apache/kyuubi/config/KyuubiConf.scala Refines SQL timeout config doc and adds tool call timeout config.
docs/configuration/settings.md Updates Data Agent timeout settings documentation and adds tool call timeout setting.
LICENSE-binary Adds third-party notices for new dependencies.
externals/kyuubi-data-agent-engine/pom.xml Adds module dependencies for OpenAI SDK + schema generator + MySQL integration testing.
externals/kyuubi-data-agent-engine/src/main/scala/org/apache/kyuubi/engine/dataagent/operation/ExecuteStatement.scala Emits tool args as structured JSON in SSE events.
externals/kyuubi-data-agent-engine/src/main/resources/prompts/base.md Adds base system prompt template and tool usage rules.
externals/kyuubi-data-agent-engine/src/main/resources/prompts/datasource-mysql.md Adds MySQL-specific prompt guidance.
externals/kyuubi-data-agent-engine/src/main/resources/prompts/datasource-spark.md Adds Spark SQL prompt guidance.
externals/kyuubi-data-agent-engine/src/main/resources/prompts/datasource-sqlite.md Adds SQLite-specific prompt guidance.
externals/kyuubi-data-agent-engine/src/main/resources/prompts/datasource-trino.md Adds Trino-specific prompt guidance.
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/AgentTool.java Defines the tool interface used by the agent runtime.
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/ToolRegistry.java Implements tool registration, OpenAI-tool export, JSON args deserialization, and timeout enforcement.
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/ToolSchemaGenerator.java Generates JSON Schema for tool args via victools + Jackson annotations.
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/sql/SqlQueryArgs.java Defines shared SQL tool args schema (with JSON annotations).
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/sql/SqlExecutor.java Shared JDBC execution + markdown formatting for SQL tool outputs.
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/sql/SqlReadOnlyChecker.java Adds a keyword-based “read-only” gate for run_select_query.
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/sql/RunSelectQueryTool.java Implements read-only SQL tool with read-only guard.
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/sql/RunMutationQueryTool.java Implements destructive SQL tool for DML/DDL, marked destructive.
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/prompt/SystemPromptBuilder.java Composes system prompts from markdown templates + optional sections.
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/DataSourceFactory.java Creates HikariCP pooled DataSources with optional credentials.
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/JdbcDialect.java Detects dialect from JDBC URL and provides identifier quoting/qualification.
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/GenericDialect.java Fallback dialect carrying subprotocol name without quoting support.
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/SparkDialect.java Spark dialect implementation (backticks).
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/MysqlDialect.java MySQL dialect implementation (backticks).
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/SqliteDialect.java SQLite dialect implementation (double quotes).
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/TrinoDialect.java Trino dialect implementation (double quotes).
externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/TableRef.java Immutable catalog/schema/table reference with JSON support.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/tool/sql/SqlReadOnlyCheckerTest.java Unit tests for read-only SQL detection.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/tool/sql/RunSelectQueryToolTest.java SQLite-backed tests for select tool behavior/formatting/timeouts.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/tool/sql/RunMutationQueryToolTest.java SQLite-backed tests for mutation tool behavior/risk level.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/tool/ToolTest.java Tests registry tool export, dispatch, timeouts, and error handling.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/tool/ToolSchemaGeneratorTest.java Tests schema generation and type/annotation mapping.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/tool/ToolRegistryThreadSafetyTest.java Concurrency tests for registry register/access/execute.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/prompt/SystemPromptBuilderTest.java Tests prompt composition and datasource template selection.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/mysql/WithMySQLContainer.java Shared MySQL Testcontainers harness for integration tests.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/mysql/ToolExecutionTest.java Registry-level MySQL integration tests (dispatch/timeout/errors/workflows).
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/mysql/RunSelectQueryTest.java MySQL integration tests for select tool.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/mysql/RunMutationQueryTest.java MySQL integration tests for mutation tool.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/mysql/DialectTest.java MySQL integration tests for dialect detection/quoting + prompt integration.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/mysql/DataSourceFactoryTest.java MySQL integration tests for DataSourceFactory pooling/auth failures.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/datasource/TableRefTest.java Unit tests for TableRef construction/equality/JSON.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/datasource/JdbcDialectTest.java Unit tests for dialect detection/quoting/qualification.
externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/datasource/DataSourceFactoryAuthTest.java SQLite-backed tests for DataSourceFactory credential handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…labels, config validation

- ToolRegistry: replace unbounded cached thread pool with bounded
  ThreadPoolExecutor (MAX=8, SynchronousQueue, reject-fast); implement
  AutoCloseable so the engine can shut the pool down cleanly
- SqlExecutor: use ResultSetMetaData.getColumnLabel so SQL AS aliases
  render in the markdown header the LLM sees
- KyuubiConf: add checkValue(>= 1s) on the two data-agent timeout
  entries; units stay consistent with existing timeConf conventions
- Tests: new MySQL integration test covering the column-alias fix;
  MySQL harness now uses ObjectMapper for tool args and closes the
  registry in teardown
…ail rationale

The previous design note argued against SQL parsers by claiming they fail
on dialect-specific queries — which isn't accurate for Calcite Babel with a
matching SqlDialect. Replace it with the real rationale: the checker is a
guardrail against LLM misrouting, not a security boundary, because the
approval gate for writes is the separate run_mutation_query tool. Also drop
the now-redundant Postgres EXPLAIN ANALYZE caveat from the keyword list.
@wangzhigang1999
Copy link
Copy Markdown
Contributor Author

Thanks @yaooqinn for triggering the Copilot review. Addressed the feedback in ab5b658 + 9245098:

Fixed

  • SqlExecutor: switched to ResultSetMetaData.getColumnLabel so SQL AS aliases show up correctly in the markdown header the LLM sees.
  • ToolRegistry: replaced unbounded newCachedThreadPool with a bounded ThreadPoolExecutor (max 8, SynchronousQueue, reject-fast) and implemented AutoCloseable for clean shutdown.
  • KyuubiConf: added checkValue(>= 1s) on both data-agent timeout entries; units stay consistent with existing timeConf conventions.
  • MySQL test harness: tool-args JSON now built via ObjectMapper instead of manual string concat.

Kept as-is (by design)

  • SqlReadOnlyChecker first-token whitelist. This is a guardrail against LLM misrouting, not a security boundary. Writes go through the separate run_mutation_query tool, and an LLM with DML intent calls that directly. The first-token check catches the realistic failure mode (LLM hallucinates a mutation into the read-only tool). A real SQL parser (e.g. Calcite Babel) would be stronger and worth considering as a follow-up — not urgent for 2a.
  • SqlExecutor returns full result set; no client-side row cap. Output-size control belongs in the context layer above the tool, not inside the tool itself. PR description updated to reflect this.
  • SystemPromptBuilderTest uses LocalDate.now() directly. The two now() calls sit sub-millisecond apart; the midnight-UTC race is theoretical. Not worth widening SystemPromptBuilder's public API with a Clock seam just to close it.

PTAL when you have time — thanks!

cc @pan3793

@yaooqinn yaooqinn merged commit cca9581 into apache:master Apr 21, 2026
37 of 39 checks passed
*
* <ul>
* <li>{@code SELECT} — standard query
* <li>{@code WITH} — CTE-prefixed query
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only checking the firstToken may not be enough, like:

FROM t1
INSERT INTO t2
SELECT c1, c2;


WITH tmp AS (SELECT * FROM t1)
INSERT INTO t2 SELECT * FROM tmp;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thank you @wForget — both of these are legitimate Spark/Hive bypasses. The whitelist has FROM (for the FROM t SELECT ... read variant) and WITH (for SELECT CTEs), and your examples exploit exactly those entries.

I will open a dedicated follow-up issue to track this and commit to closing the gap there. Our practical observation is that the realistic risk is moderate: modern LLMs rarely produce FROM t INSERT ... or WITH ... INSERT on their own — those patterns are thin in training data, and the tool descriptions + run_mutation_query split already steer the model toward the straightforward INSERT ... SELECT shape. The exposure is mostly when a user pastes such a statement into the conversation verbatim. Still, a tool named "read-only" should actually be read-only for the configured dialects, and we intend to fix this properly. For now, we'd like to keep the data-agent series moving and handle the hardening in the tracked follow-up.

@pan3793
Copy link
Copy Markdown
Member

pan3793 commented Apr 21, 2026

@yaooqinn, use the merge script instead of the merge button, update Assignees, Milestone, when merging PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:build kind:documentation Documentation is a feature! kind:infra license, community building, project builds, asf infra related, etc. module:common

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants