Adding hudi-cli-mcp to assist AI assistants to help and manage hudi tables#18554
Adding hudi-cli-mcp to assist AI assistants to help and manage hudi tables#18554nsivabalan wants to merge 1 commit intoapache:masterfrom
Conversation
|
Sample commands that I was able to try out "connect to /tmp/trips_table" ⏺ Here's what's at /tmp/trips_table: ┌──────────────────┬────────────────────┐ Partitions: chennai, san_francisco, sao_paulo Timeline (5 completed commits + 1 inflight):
Plus 2 clean actions. What would you like to do with this table? I can help with inspecting commits, reading data, analyzing the timeline, running CLI commands, etc. ❯ analyze the timeline Timeline Analysis for /tmp/trips_table Schema fields: ts (long), uuid (string), rider (string), driver (string), fare (double), city (string) Commit History ┌─────┬──────────────┬───────────────┬─────────┬───────────────┬───────────┬─────────┬──────────────┬────────────────┐ Clean Actions
File Lineage Original file group (commits 1-5, one file per partition): ┌───────────────┬──────────────┐ After clustering (commit 6 replaced all original files with new ones): ┌───────────────┬──────────────┐ Key Observations
Would you like me to dig deeper into the inflight commit, inspect the actual parquet data, or check the archived timeline? ❯ timeline show incomplete ⏺ There's one incomplete instant on the timeline: ┌───────────────────┬──────────┬────────┐ Details from the inflight metadata:
This looks like a failed or interrupted upsert that was left dangling. If you want to recover the table, this instant would need to be rolled back (commit rollback). ❯ metadata timeline show incomplete Read 1 file (ctrl+o to expand) ⏺ The metadata table timeline has no incomplete instants. All deltacommits are fully completed (each has .requested, .inflight, and .deltacommit): ┌───────────────────┬────────────────────────────┐ Notable: the inflight data table commit (20260422001401481) has no corresponding entry in the metadata table timeline — meaning the metadata table was never updated for that failed ❯ show fsview latest ⏺ Bash(for partition in chennai san_francisco sao_paulo; do echo "=== $partition ==="; ls -la /tmp/trips_table/$partition/*.parquet 2>/dev/null; done) ⏺ Latest File System View The latest valid file slice per file group is determined by the last completed commit. Since 20260422001401481 is inflight (not committed), those files are not valid for reads. chennai (1 record) ┌──────────────────────────┬────────────────────────────────────────┬────────────┬────────┐ Stale slices still on disk (not yet cleaned):
Inflight (invalid): eec5aa74-..._20260422001401481.parquet — 0 bytes (empty file) san_francisco (3 records) ┌──────────────────────────┬────────────────────────────────────────┬────────────┬────────┐ Stale slices still on disk:
Inflight (invalid): 6014f1fa-..._20260422001401481.parquet (436 KB — data written but not committed) sao_paulo (1 record) ┌──────────────────────────┬────────────────────────────────────────┬────────────┬────────┐ Stale slices still on disk:
Inflight (invalid): bc88c4f3-..._20260422001401481.parquet (436 KB) Summary
❯ Can you not trim or shorten the content while printing. I could not see the full file path or file group id. ⏺ Understood. Let me redo that with full values. Read 1 file, recalled 1 memory, wrote 2 memories (ctrl+o to expand) ⏺ Latest File System View chennai (1 record) Active file group: eec5aa74-7eab-42e9-a67a-37c354dd9288-0 ┌────────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────┬───────────────┬──────────────────────┐ Replaced file group: 30ae9920-ee63-40cb-9527-64661b63dcbb-0 (replaced by clustering at 20260422001330433) ┌────────────────────────────────────────────────────────────────────────────────────┬───────────────────┬───────────────┬───────────────────────┐ san_francisco (3 records) Active file group: 6014f1fa-25f1-41cf-b91f-91737c25691a-0 ┌──────────────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────┬───────────────┬───────────────────────┐ Replaced file group: 415dbece-a4c1-42e3-9eae-c9b871634675-0 (replaced by clustering at 20260422001330433) ┌──────────────────────────────────────────────────────────────────────────────────────────┬───────────────────┬───────────────┬───────────────────────┐ sao_paulo (1 record) Active file group: bc88c4f3-b6af-4928-8b9a-8f23bd776c14-0 ┌──────────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────┬───────────────┬───────────────────────┐ Replaced file group: b0b31443-70a7-4123-9c06-4233b57997ac-0 (replaced by clustering at 20260422001330433) ┌──────────────────────────────────────────────────────────────────────────────────────┬───────────────────┬───────────────┬───────────────────────┐ Summary
✻ Sautéed for 32s ❯ can you show lineage for file group 30ae9920-ee63-40cb-9527-64661b63dcbb-0 ⏺ File Group Lineage: 30ae9920-ee63-40cb-9527-64661b63dcbb-0 (partition: chennai) ┌───────────────────┬────────────────────────────────────────────────────────────────────────────────────┬───────────────────┬─────────┬─────────┬─────────┬────────────┬────────────┐ Successor file group: eec5aa74-7eab-42e9-a67a-37c354dd9288-0 (created by replacecommit/CLUSTER at 20260422001330433) Cleaned file slices:
Still on disk (not yet cleaned):
Lifecycle: Created at commit 20260422001020750 with 1 insert, received 4 successive upserts (1 record updated each time), then replaced by clustering at 20260422001330433. The file ❯ stats filesizes ⏺ Bash(for f in $(find /tmp/trips_table -name ".parquet" -not -path "/.hoodie/*" | sort); do stat -f "%z" "$f" | tr '\n' ' '; echo "$f"; done) ⏺ File Sizes ┌──────────────────────────────────────────────────────────────────────────────────────────┬──────────────┐ Summary Statistics ┌───────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐ Per-Partition Breakdown ┌───────────────┬───────┬─────────────────┬───────────────┐ Note: Excluding the 0-byte inflight file in chennai, all files are tightly clustered around 436 KB — which is almost entirely Parquet metadata/overhead given there are only 1-3 records ❯ commit showpartitions ⏺ Which commit instant? Here are the available ones:
❯ 20260422001326843 ⏺ Commit 20260422001326843 — Partition Details ┌───────────────┬───────────┬────────────────────────────────────────┬─────────┬─────────┬─────────┬──────────────┬─────────────┬───────────────┐ ┌─────────────────────┬───────────────────┐ Runtime Stats ┌───────────────┬────────────────┬──────────────────┬──────────────────┐ ❯ commit show_write_stats ⏺ Which commit instant?
❯ 20260422001330433 ⏺ Write Stats for Commit 20260422001330433 (replacecommit / CLUSTER) Per-File Write Stats ┌────────────┬─────────────────────────────────┬─────────────────────────────────────────────────────────────────────────┬──────┬───────┬───────┬───────┬──────┬───────┬───────┐ Replaced File Groups ┌───────────────┬────────────────────────────────────────┐ Runtime Stats ┌───────────────┬────────────────┬──────────────────┬──────────────────┐ Totals ┌─────────────────────┬───────────┐ All records were written as inserts into new file groups — this is expected for clustering, which rewrites data from old file groups into new ones. |
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the docs update! The README clearly explains the MCP server's purpose, tool inventory, safety tiers, and client configuration. A few specifics may need verification: the example JAR versions pair hudi-spark3.5-bundle with 0.14.1 (Spark 3.5 support did not land until a later Hudi release), pytest isn't listed in requirements.txt even though the Running Tests section uses it, the supported Java versions may need to be broadened, and the Claude Code config path/location could be clarified. Please have a Hudi committer or PMC member verify the version examples and finalize review.
|
|
||
| | Variable | Description | Example | | ||
| |----------|-------------|---------| | ||
| | `SPARK_HOME` | Path to your Spark installation | `/opt/spark-3.5.3-bin-hadoop3` | |
There was a problem hiding this comment.
🤖 The example pairs hudi-spark3.5-bundle_2.12-0.14.1.jar with Hudi 0.14.1, but Spark 3.5 support wasn't added until a later Hudi release (0.15.x/1.x) — this artifact coordinate likely doesn't exist on Maven Central. Could we point to a version combination that actually ships a Spark 3.5 bundle (or show a Spark 3.4 bundle with 0.14.1)? This appears in multiple places (the env-variable table, the Running the Server section, and the Claude Desktop / Claude Code client configs).
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
|
||
| ### Environment Variables | ||
|
|
||
| The following environment variables must be set before starting the server: |
There was a problem hiding this comment.
🤖 Is Java 8 or 11 still the intended supported range? Recent Hudi releases also support Java 17, and since this tool will commonly be used against current Hudi installations, it may help to list Java 17 as well (or explicitly state which Hudi versions this is tested against).
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
|
||
| ```bash | ||
| cd scripts/hudi-cli-mcp | ||
| python -m venv venv |
There was a problem hiding this comment.
🤖 This section states the only dependency is fastmcp>=2.0.0, but the Running Tests section below uses python -m pytest. It would help to either add pytest to requirements.txt (or a requirements-dev.txt) and mention it here, or call out that contributors need to pip install pytest separately before running tests.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| "command": "/path/to/hudi/scripts/hudi-cli-mcp/venv/bin/python", | ||
| "args": ["/path/to/hudi/scripts/hudi-cli-mcp/server.py"], | ||
| "env": { | ||
| "SPARK_HOME": "/path/to/spark", |
There was a problem hiding this comment.
🤖 Could we confirm that ~/.claude/mcp.json is the correct configuration location for Claude Code? Claude Code typically manages MCP servers via the claude mcp add CLI (writing to ~/.claude.json or a project-scoped .mcp.json), so a user who creates ~/.claude/mcp.json manually may find it isn't picked up. Linking to the official Claude Code MCP docs, or showing the claude mcp add command, would be more reliable.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
|
||
| ## Overview | ||
|
|
||
| The server wraps the Hudi CLI (`hudi-cli`) in a structured MCP interface with: |
There was a problem hiding this comment.
🤖 Minor: the README refers to the Hudi CLI as hudi-cli, but the launcher in the Hudi repo is hudi-cli.sh (under hudi-cli/). A one-line note on how the server locates/invokes the CLI binary (env var? PATH lookup? bundled bundles?) would help operators who don't already know the layout.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18554 +/- ##
============================================
- Coverage 68.85% 66.97% -1.88%
+ Complexity 28473 27765 -708
============================================
Files 2475 2475
Lines 136672 136672
Branches 16631 16631
============================================
- Hits 94110 91542 -2568
- Misses 34999 37673 +2674
+ Partials 7563 7457 -106
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Adds an MCP (Model Context Protocol) server for the Hudi CLI, enabling AI assistants (Claude, ChatGPT, etc.) to inspect, diagnose, and operate on Hudi tables through natural language.
This bridges the gap between Hudi's powerful CLI tooling and the growing ecosystem of AI-powered developer tools.
Summary and Changelog
Introduces scripts/hudi-cli-mcp, a Python-based MCP server that wraps the Hudi CLI in a structured interface with 40+ tools.
No existing code was modified. All new code is under scripts/hudi-cli-mcp/.
Impact
No impact on existing Hudi code, APIs, or storage format. This is a standalone Python tool under scripts/ that invokes the Hudi CLI as a subprocess.
Risk Level
none
This is a new, additive script with no changes to existing Hudi modules. It shells out to the existing hudi-cli binary and does not modify any table state without explicit user
confirmation through the safety protocol.
Documentation Update
A detailed README.md is included in scripts/hudi-cli-mcp/ covering:
A follow-up update to the Hudi website can be done to add this to the CLI/tooling documentation.
Contributor's checklist