Skip to content

[Feature][CLI] Add SeaTunnel CLI for natural language config generation#10789

Open
SEZ9 wants to merge 1 commit intoapache:devfrom
SEZ9:feature/seatunnel-cli
Open

[Feature][CLI] Add SeaTunnel CLI for natural language config generation#10789
SEZ9 wants to merge 1 commit intoapache:devfrom
SEZ9:feature/seatunnel-cli

Conversation

@SEZ9
Copy link
Copy Markdown
Collaborator

@SEZ9 SEZ9 commented Apr 19, 2026

Purpose of this pull request

Add seatunnel-cli, a Python CLI tool that generates Apache SeaTunnel HOCON pipeline
configurations from natural language descriptions (English and Chinese).

Key capabilities:

  • Multi-agent pipeline: Planner → Config Generator → Validator → Auto-fix (up to 3 rounds)
  • 100+ connectors: Auto-generated metadata catalog from Java source (*Factory.java, *Options.java) with 1200+ option definitions and
    inheritance chain resolution
  • Multi-provider LLM support: AWS Bedrock, Anthropic API, OpenAI (and compatible APIs)
  • Three-tier knowledge base: Runtime REST API → auto-generated catalog → keyword routing
  • Dry-run validation: Local HOCON syntax check + engine --check + REST API validation
  • Auto-save: Generated configs automatically saved to ~/.seatunnel/last_job.conf
  • Auto-fix on failure: /check and /run failures trigger LLM-powered diagnosis and config repair
  • Session & memory: Persistent conversation sessions and connection detail memory across sessions
  • Interactive & single-shot modes: REPL for exploration, one-liner for scripting

Usage:

cd seatunnel-cli
pip install -e ".[bedrock]"
seatunnel  # interactive mode
seatunnel "Sync MySQL users table to S3 Parquet" -o job.conf  # single-shot

Does this PR introduce any user-facing change?

Yes. This adds a new seatunnel-cli module (Python) as a standalone tool alongside the existing Java codebase. It introduces:

- seatunnel CLI command for natural language config generation
- Interactive REPL with /save, /check, /run, /connectors, /memory commands
- Built-in connector catalog (connector_catalog.json) with 100 connectors
- --sync-catalog option to regenerate catalog from SeaTunnel Java source

No changes to existing Java modules. The CLI is fully self-contained under seatunnel-cli/.

How was this patch tested?

- Manual testing of interactive mode and single-shot mode with multiple LLM providers (Bedrock, OpenAI)
- Connector catalog generation tested against current dev branch: 100 connectors, 1273 options, 99.5% option resolution rate (7 unresolved out of
1273)
- Dry-run validation tested with local HOCON parsing and engine --check mode
- Auto-fix loop tested with intentionally broken configs (missing required fields, wrong option names)
- Session persistence and memory store tested across multiple CLI sessions
- Tested with both English and Chinese natural language inputs

Check list

- If any new Jar binary package adding in your PR, please add License Notice according
https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md — N/A (Python module, no Jar packages)
- If necessary, please update the documentation to describe the new feature. — README.md included in seatunnel-cli/
- If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR. — N/A (new module, no breaking changes)
- If you are contributing the connector code, please check that the following files are updated: — N/A (not a connector)

---

Add SeaTunnel CLI for natural language config generation
@davidzollo
Copy link
Copy Markdown
Contributor

The repository CI uses apache/skywalking-eyes@v0.5.0 to perform license-header checks; .licenserc.yaml ignores *.md, *.json, and .gitignore, but does not ignore *.toml and *.sh. The setup.sh file in this PR contains an ASF header, while pyproject.toml and env.example.sh do not.

Suggestion: Add an ASF header in TOML comment format to pyproject.toml; keep the shebang on the first line of env.example.sh, and add the ASF header starting from the second line, following the style of setup.sh.

@nzw921rx
Copy link
Copy Markdown
Collaborator

nzw921rx commented Apr 20, 2026

@SEZ9 pls enable CI followed by the instruction https://github.com/apache/seatunnel/pull/10789/checks?check_run_id=72041491227

@davidzollo
Copy link
Copy Markdown
Contributor

davidzollo commented Apr 20, 2026

Amazing! This is a very innovative feature.

Thanks for the contribution. I tested this PR locally with a real OpenAI-compatible provider, using DeepSeek (OPENAI_BASE_URL=https://api.deepseek.com, OPENAI_MODEL=deepseek-chat).

The good news is that the basic CLI flow can work: provider initialization, planner/config/validator flow, static catalog loading, config rendering, auto-save, and a simple FakeSource -> Console generation all completed successfully. The generated FakeSource -> Console config also passed validate_hocon.

However, I found several issues that should be fixed before merge.

1. Source/sink option rules are incorrectly merged for connectors with the same name

This is the most important functional issue.

For Jdbc, the generated catalog currently merges source and sink rules into one connector entry:

types = [source, sink]
required = [url, driver, schema_save_mode, data_save_mode]

But according to the actual Java source, schema_save_mode and data_save_mode are sink-only options.

JdbcSourceFactory.optionRule() only requires:

.required(JdbcSourceOptions.URL, JdbcSourceOptions.DRIVER)

while JdbcSinkFactory.optionRule() requires:

.required(
    JdbcSinkOptions.URL,
    JdbcSinkOptions.DRIVER,
    JdbcSinkOptions.SCHEMA_SAVE_MODE,
    JdbcSinkOptions.DATA_SAVE_MODE)

Because the catalog merges both factories by connector name, a normal Jdbc source job is incorrectly treated as missing sink-only options. In my real DeepSeek test, the CLI repeatedly entered the fix loop and finally generated this invalid source config:

source {
  Jdbc {
    url = "jdbc:mysql://localhost:3306/test"
    driver = "com.mysql.cj.jdbc.Driver"
    username = "${MYSQL_USER}"
    password = "${MYSQL_PASSWORD}"
    query = "SELECT id, name FROM users"

    schema_save_mode = "DISABLED"
    data_save_mode = "DISABLED"
  }
}

This is not a valid semantic fix. The CLI only produced it because its own catalog validation was wrong.

Suggested fix:

  • Store catalog details by (plugin_type, connector_name), not only by connector_name.
  • Validate source.Jdbc against source rules only.
  • Validate sink.Jdbc against sink rules only.
  • Keep the compact display index if desired, but do not use merged required options for validation or LLM tool details.
  • Add a regression test for source { Jdbc { url, driver, query } } to ensure it does not require schema_save_mode / data_save_mode.

2. Basic installation path does not start the CLI successfully

pip install ./seatunnel-cli succeeds, but running seatunnel fails because the default provider is bedrock and boto3 is not part of the base dependencies.

Users must install one of the extras, for example:

pip install -e ".[bedrock]"
pip install -e ".[openai]"
pip install -e ".[anthropic]"

Suggested fix:

  • Either include the default provider SDK in base dependencies, or
  • make the CLI fail gracefully with a clear message before stack trace, or
  • require --provider and document that base install alone is not enough.

3. ASF license headers are missing in new files

seatunnel-cli/pyproject.toml and seatunnel-cli/env.example.sh do not have ASF license headers. The repository SkyWalking Eyes config does not ignore .toml or .sh, so the license-header check is likely to fail.

Suggested fix:

  • Add ASF header to pyproject.toml.
  • Keep shebang as the first line in env.example.sh, then add ASF header below it.

@DanielLeens
Copy link
Copy Markdown

Hi @SEZ9, thanks for putting together such an ambitious CLI prototype. I pulled the branch locally as seatunnel-review-10789 and reviewed it against the current dev baseline.

Runtime path I checked:

seatunnel console command
  -> seatunnel_cli.cli.main()
      -> Orchestrator in agents.py
          -> LLMProvider / connector catalog / memory store
          -> generated HOCON config
      -> /check or /run
          -> local seatunnel.sh --check / --config, or REST validation through SEATUNNEL_API_BASE

The idea is valuable, but I do not think this PR is ready to merge yet. A few items need to be closed first:

  1. The GitHub Build check is ACTION_REQUIRED, so the PR has not gone through the required project validation yet.
  2. This adds a new Python package and LLM runtime under seatunnel-cli/, with dependencies declared in seatunnel-cli/pyproject.toml (rich, prompt-toolkit, pyhocon, and optional boto3, anthropic, openai). Since this becomes part of the Apache source/release surface, we need the dependency/license/release integration to be explicit rather than living only as an isolated subdirectory.
  3. Some new non-code package files still need Apache source-release treatment. For example seatunnel-cli/pyproject.toml and seatunnel-cli/env.example.sh start without the ASF header pattern used by the new Python sources. The current CI action-required state may be related, but either way the release checks must be clean.
  4. seatunnel-cli/seatunnel_cli/connector_catalog.json is a large generated catalog checked into the repo. Please add a reproducible generation/update story in CI or tests, otherwise it can silently drift from the Java connector Option definitions and generate configs that no longer match the engine.

Conclusion: can merge after fixes

This is a promising direction, but before merge I would like to see CI enabled and green, source-release/dependency handling made explicit, and the static catalog generation path made reproducible. Happy to review the next revision; the feature idea itself is exciting.

Copy link
Copy Markdown
Contributor

@chl-wxp chl-wxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a lot of functionality at once (CLI, multi-agent pipeline, catalog generation, validation, execution, memory, etc.), which makes it quite heavy to review and reason about.

It might be more practical to start with a minimal viable workflow:

interactive or one-shot CLI
natural language → HOCON config generation

without validation, engine integration, or metadata catalog for now.

This helps establish the core value (“NL → config”) first, and keeps the initial scope small and easier to review.

Then we can layer additional capabilities (validation, catalog, execution, auto-fix, etc.) in separate PRs.

This incremental approach would reduce risk and improve maintainability.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can rely on existing inspection capabilities:https://github.com/apache/seatunnel/pull/10763/changes

@DanielLeens
Copy link
Copy Markdown

Hi @SEZ9, I rechecked the current PR head locally as seatunnel-review-10789 at 072da0ccf65f. I reviewed the full diff against upstream/dev and did not run local Maven/tests in this batch; this is a source-level review.

The new CLI path is outside the Java engine runtime but becomes part of the Apache source/release surface:

seatunnel console command
  -> seatunnel_cli.cli.main()
  -> Orchestrator / LLMProvider / connector catalog / memory store
  -> generated HOCON config
  -> local seatunnel.sh --check/--config or REST validation

The idea is useful, but the PR still needs release-quality integration work. Fetched metadata reports Build: ACTION_REQUIRED. The new Python package, optional LLM dependencies, source-release headers, and generated connector catalog need explicit handling so this does not become a drifting sidecar tool.

Conclusion: can merge after fixes

Blocking items:

  1. Get the required Build validation enabled and green.
  2. Make dependency/license/source-release treatment explicit for the new Python package files.
  3. Add a reproducible generation/update path for connector_catalog.json so it stays aligned with Java connector options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants