docs: Update documentation for new features in Hudi 1.2.0 by yihua · Pull Request #18867 · apache/hudi

yihua · 2026-05-28T00:07:25Z

Describe the issue this Pull Request addresses

Updates the website documentation under website/docs/ to cover features new in Hudi 1.2.0. Release notes and the auto-generated config reference (configurations.md / basic_configurations.md) are tracked separately and are not in scope here.

Summary and Changelog

New documentation for major 1.2.0 capabilities: Lance file format, VARIANT / VECTOR / BLOB data types, vector search, Flink Source V2, RLI-based Flink bucket indexing, Azure storage-based lock provider, pre-write validators, the JSON Kinesis source, show_timeline procedure, new CLI commands, and additional metrics.
Corrections and tightening across the existing AI / Lance / BLOB pages (config keys, defaults, engine-support claims).
Updates to Flink, cleaning, clustering, compaction, metadata, indexing, sync, validator, and reading/writing pages so each new capability lives in its natural place with a light inline marker rather than a dedicated "new in 1.2.0" section.

Impact

Documentation-only. No code or build changes.

Risk Level

low

Documentation Update

This PR is the documentation update for 1.2.0 features. The auto-generated config reference and the 1.2.0 release notes are tracked separately.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable — N/A (documentation-only)

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the docs update! This PR refreshes the documentation to cover a broad set of Hudi 1.2.0 features — Spark session-level configs, incremental column pruning, the vector search TVF, metastore sync improvements, VARIANT/VECTOR engine support matrices, rolling metadata, and various advanced storage and concurrency knobs — and corrects a number of existing AI-feature pages. The main concerns are a handful of factual inconsistencies across pages (engine/version support claims, config renames, default-value flips, cross-page description mismatches) plus a few invalid example snippets and a likely typo worth resolving before merge. Please take a look at the inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.

…cy_control.md The HoodiePreWriteCleanerPolicy enum's CLEAN value is documented in source as: 'Force a CLEAN table service call before starting the write (also performs rollback of failed writes).' BaseHoodieTableServiceClient #clean also invokes rollbackFailedWrites before the clean itself. Update the cleaning.md description so it matches the existing wording in concurrency_control.md: CLEAN also rolls back failed writes; ROLLBACK_FAILED_WRITES is the rollback-only variant. Addresses review comment on PR apache#18867.

yihua · 2026-05-28T04:38:03Z

@danny0405 and @cshuo could you take another pass on the Flink docs update, particularly in website/docs/flink-quick-start-guide.md, website/docs/flink_tuning.md, website/docs/ingestion_flink.md, website/docs/indexes.md, website/docs/precommit_validator.md, website/docs/reading_tables_batch_reads.md, website/docs/reading_tables_streaming_reads.md, website/docs/sql_ddl.md, website/docs/variant_type.md, website/docs/writing_data.md?

The Required Dependencies table listed the bare connector artifact (org.lance:lance-spark-<spark>_<scala>) while the shell example below referenced the shaded bundle JAR (lance-spark-bundle-3.5_2.12-0.4.0.jar). Lance publishes both — the bare connector is what Hudi declares as a Maven dependency internally, while the -bundle- variant is the shaded uber-JAR meant for spark-shell/spark-submit --jars. Align the table to the bundle artifacts (matching the shell example and what users actually need on a Spark classpath), and add a one-paragraph note explaining the connector vs. bundle distinction. Addresses review comment on PR apache#18867.

yihua · 2026-05-28T05:37:53Z

+- **`VECTOR(dim[, elementType])`** — stores fixed-dimension embedding vectors (e.g. `VECTOR(768)`,
+  `VECTOR(768, FLOAT)`, `VECTOR(768, DOUBLE)`). Enables approximate nearest-neighbor search via
+  the `hudi_vector_search` TVF. See [Vector Search](vector_search.md) for full details.
+
+- **`BLOB`** — stores arbitrary binary objects (images, audio, documents) either inline within the
+  base file or as external references. See [BLOB / Unstructured Data](blob_unstructured_data.md)
+  for the storage modes, DDL syntax, and read APIs.


We'll have a separate PR to move the new type support docs here.

syncing_metastore.md: Address open questions about the HMS 4.x JDBC fallback by stating the detection scope (per HoodieHiveSyncClient instance, flag is monotonic), the no-fallback path (mode=hms / hiveql hitting HMS 4.x logs an error and surfaces the original exception), and how JDBC connection failures surface (eagerly at sync-client construction time, wrapped as 'Failed to create HiveMetaStoreClient' with the JDBC exception as cause). variant_type.md: Drop 'fixed in 1.2.0' from the Spark 4.1 row and restate both Spark 4.0 and 4.1 rows in terms of what is supported today — full read/write/query for COW and MOR — rather than the historical fix narrative. Addresses review comment on PR apache#18867.

nsivabalan

LGTM.

- Fix wrong config key: hoodie.datasource.write.base.file.format does not exist; correct key is hoodie.table.base.file.format - Correct indexing claim: column stats and partition stats are auto-disabled on Lance; only bloom filters are supported - Add Engine Support / Limitations: Spark-only (Flink/Hive throw), VARIANT columns rejected at write, files are non-splittable - Add File Sizing and Memory configs (max.file.size, allocator.size.bytes, flush.byte.watermark) - Add Schema Evolution section (add-column, type promotion; FLOAT->DOUBLE and FLOAT->STRING not supported) - Add MOR (Lance base + Avro log) example - Add Spark 4.0 and 4.1 dependency rows; clarify Lance JAR is not bundled - Note VECTOR columns round-trip as native Lance FixedSizeList enabling IVF-PQ; BLOB inline columns default to DESCRIPTOR on Lance

- Fix wrong default for hoodie.read.blob.inline.mode (DESCRIPTOR, not CONTENT) and rewrite the (previously backwards) description - Add 'managed: boolean' to all BLOB reference struct examples (INLINE CAST, OUT_OF_LINE named_struct, DataFrame schema) — required field - Add caution that read_blob() on INLINE columns under the default DESCRIPTOR mode returns a descriptor; set inline.mode=CONTENT to materialize bytes - Add Metastore Sync section: BLOB maps to STRUCT in Hive and BigQuery

Remove 'New in 1.2.0' and '(1.2.0)' section wrappers. Each feature now lives in the thematically appropriate section of its page, with a light inline marker (e.g. 'Available since Hudi 1.2.0' or 'Hudi 1.2.0 introduced ...'). This keeps the docs accurate as future minor releases ship. cleaning.md: Empty clean, capping commits-per-run, driver-side planning, MDT cleaner derivation, pre-write cleaner policy, and full-clean partition filtering merged into the existing retention-policy, configs, and 'Ways to trigger Cleaning' sections. clustering.md: CommitBasedClusteringPlanStrategy, SparkStreamCopyClusteringPlanStrategy, single-group control, file-slice sort order, and driver-side plan generation merged into the existing Plan Strategy section. Failed-plan expiration merged into the HoodieClusteringJob section. compaction.md: MDT compaction trigger strategy and external-platform delegation merged into the Strategies in Compaction Scheduling section. Log compaction blocks threshold placed alongside the existing Flink Offline Compaction options. metadata.md: Auto-delete of disabled MDT partitions merged into Metadata Tracking on Writers. New 'MDT Cleaner and Compaction' section after Concurrency Control. New 'Timeline Archival Controls' section before Related Resources. metadata_indexing.md: RLI config-key renames and additional configs (defer.init, max.filegroup.size) merged into the existing Configurations section. Auto-delete cross-reference inlined into Drop Index. metrics.md: Archival, rollback / post-commit / duration, per-table registry isolation, and log-block compaction sections promoted to top-level peers of the existing metrics sections. precommit_validator.md: '(1.2.0)' stripped from three section headings; inline version markers added. concurrency_control.md: '(1.2.0)' stripped from Exclusive Rollbacks section heading; opening sentence rephrased with inline marker. writing_data.md: Rephrase 'available in 1.2.0' to 'were added in Hudi 1.2.0'.

…cy_control.md The HoodiePreWriteCleanerPolicy enum's CLEAN value is documented in source as: 'Force a CLEAN table service call before starting the write (also performs rollback of failed writes).' BaseHoodieTableServiceClient #clean also invokes rollbackFailedWrites before the clean itself. Update the cleaning.md description so it matches the existing wording in concurrency_control.md: CLEAN also rolls back failed writes; ROLLBACK_FAILED_WRITES is the rollback-only variant. Addresses review comment on PR apache#18867.

Remove the top-level Hudi 1.2.0 prerequisites callout (which would go stale as future releases ship). Replace with a per-Spark-version bundle picker note next to the Hudi Streamer Spark bundle reference, listing the hudi-spark<X>-bundle artifacts and the Java runtime requirement that follows from the Spark version.

Remove the version-labeled '1.2.0 New Features' bullet from the 'Where To Go From Here?' section so it doesn't go stale. Each feature now lives in its natural topical bullet: - Append write buffer modes and RLI bucket indexing -> Writing Data - Lookup join (with RocksDB cache) -> Reading Data - Managed-Memory Write Buffer -> Tuning Flink Source V2 was already linked from the Reading Data bullet.

Rename the section heading to '## Flink Source V2'. The RFC-95 reference now lives as an inline link to the RFC doc in the opening sentence of the section. Update all cross-page anchors from #flink-source-v2-rfc-95 to #flink-source-v2 in: - flink-quick-start-guide.md - reading_tables_batch_reads.md - reading_tables_streaming_reads.md (2 anchors) - flink_tuning.md

- Open with what the source does and what users get out of the box (parallel per-shard reads, auto handling of shard splits and merges, KPL de-aggregation) instead of an implementation-detail intro. - Trim the configuration table to the six keys most users actually set (stream.name, region, starting.position, max.events, append.offsets, partitions). Point to the configurations reference for credential, endpoint, retry, and rate-limit tuning. - Drop hoodie.streamer.source.kinesis.persist.fetch.rdd from the config table and the example, and drop the caution callout. - Add a Checkpoint format section documenting the streamName,shardId:value,shardId:value,... encoding and the four value variants (lastSeq, lastSeq@arrivalTime, lastSeq|endSeq, lastSeq@arrivalTime|endSeq) — useful for debugging and manual checkpoint resets. - Rename the section heading from 'Amazon Kinesis Source (JsonKinesisSource)' to 'Amazon Kinesis Source'; the class name stays inline in the opening sentence.

The Kinesis source belongs alongside the other Hudi Streamer source types (DFS, Kafka, Pulsar, S3/GCS events, JDBC, SQL). Move it from a standalone top-level section to a #### Amazon Kinesis subsection under ### Sources, placed between Pulsar and the cloud storage event sources. Its inner headings demote from ### to ##### accordingly.

Kinesis sequence numbers are 56-digit decimal strings (128-bit integers) in practice, which made the inline example wrap awkwardly and obscured the checkpoint structure. Abbreviate with ellipsis, move the example into a fenced code block, and call out the actual sequence-number length so users still know what to expect.

The Required Dependencies table listed the bare connector artifact (org.lance:lance-spark-<spark>_<scala>) while the shell example below referenced the shaded bundle JAR (lance-spark-bundle-3.5_2.12-0.4.0.jar). Lance publishes both — the bare connector is what Hudi declares as a Maven dependency internally, while the -bundle- variant is the shaded uber-JAR meant for spark-shell/spark-submit --jars. Align the table to the bundle artifacts (matching the shell example and what users actually need on a Spark classpath), and add a one-paragraph note explaining the connector vs. bundle distinction. Addresses review comment on PR apache#18867.

Replace the historical narrative (renamed keys, alias table, 'not previously documented' note) with a single table listing the current canonical configs for both global and partitioned RLI. Users configuring a new table only need to know what to set today; the rename trivia was a source of confusion.

Collapse the four separate metric tables (List of metrics, Archival Metrics, Rollback/Post-commit/Table Service Duration Metrics, Log-Block Compaction Gauges) into a single 'List of metrics' table so readers can find every emitted metric in one place. Folded the Per-Table MetricRegistry Isolation note into a single paragraph after the table. Revert website/docs/quick-start-guide.md to its asf-site state — the Spark version matrix and Java-requirements note belong in pages closer to deployment / install rather than on the Spark quick-start guide.

Drop the ReverseOrderHoodieRecordPayload subsection and the DefaultHoodieRecordPayload sentinel no-op note; these payload classes should not be documented here.

syncing_metastore.md: Address open questions about the HMS 4.x JDBC fallback by stating the detection scope (per HoodieHiveSyncClient instance, flag is monotonic), the no-fallback path (mode=hms / hiveql hitting HMS 4.x logs an error and surfaces the original exception), and how JDBC connection failures surface (eagerly at sync-client construction time, wrapped as 'Failed to create HiveMetaStoreClient' with the JDBC exception as cause). variant_type.md: Drop 'fixed in 1.2.0' from the Spark 4.1 row and restate both Spark 4.0 and 4.1 rows in terms of what is supported today — full read/write/query for COW and MOR — rather than the historical fix narrative. Addresses review comment on PR apache#18867.

The section opened with 'Flink 1.2.0 supports declaring Hudi metadata fields...' — but Flink itself does not have a 1.2.0 in this version range; the feature is a Hudi capability. Rephrase to state the capability directly without a version qualifier, matching the docs-style convention used elsewhere in this PR (state what's supported, not when it shipped).

…okup join, LIMIT push-down - Flink support matrix and download/Maven snippets: default build for Hudi 1.2.x is Flink 1.20, not 2.1 (verified pom.xml on release-1.2.0). - RLI bootstrap: clarify that setting index.bootstrap.enabled=false after bootstrap is optional; keeping the bootstrap operators is harmless. - VARIANT on Flink: native VARIANT is not supported on any Flink version in 1.2.0 (Flink 2.1.x adapter still throws 'Variant is not supported yet'). Replace the Flink 2.1+ row with the accurate statement that VARIANT surfaces as ROW<metadata BYTES, value BYTES> on Flink. - Lookup join example: use a processing-time attribute (PROCTIME()) and FOR SYSTEM_TIME AS OF o.proc_time, matching the Hudi lookup join tests. - Reading tables batch reads: drop the LIMIT push-down subsection. Legacy source also supports limit push-down, and the implementation enforces the limit in the source reader via RecordLimiter, not in split enumeration as the previous wording claimed.

…IPTOR mode - managed field: clarify it only applies to OUT_OF_LINE blobs and that no cleaner consumes it yet (it's an intent flag for a future managed-blob cleaner). Reflect this in the field reference and the inline example comment. - read_blob() under DESCRIPTOR mode: replace 'returns a descriptor reference' with the actual behavior — calling read_blob() on an INLINE column throws, since the bytes are not materialized in the scan.

Hudi's Lance integration is at the file-format level only — it does not leverage Lance's table-format-level vector index. The hudi_vector_search TVF runs a brute-force scan; it does not consume any Lance index. Scrub all claims to the contrary so the docs match what is actually shipped. lance_file_format.md: - Drop 'ANN indexing' from the page summary. - Remove 'IVF-PQ vector index' from the architecture diagram. - Rewrite the VECTOR Storage on Lance section: keep the true claim (VECTOR round-trips as Lance FixedSizeList, no conversion overhead at the file-format layer); drop the 'unlocks Lance's built-in IVF-PQ ANN index' sentence and the claim that the TVF uses any Lance index. - Rewrite the Vector Search with Lance section to describe what the TVF does (returns the top_k closest rows) without mentioning a specific algorithm or Lance index. vector_search.md: - Drop 'ANN' from keywords and summary; describe the TVF as 'vector similarity search' instead of 'approximate nearest neighbor search'. - Drop the 'enables Lance's built-in IVF-PQ ANN index' claim from the Storage Format section. - Rephrase the TVF description to state what it returns (top_k closest rows under a distance metric) rather than naming an algorithm. ai_overview.md and overview.mdx (pre-existing pages but with the same overstatements on this topic): - Replace 'approximate nearest neighbor (ANN) search' with 'vector similarity search'. - Replace 'Vector-optimized columnar format with built-in ANN indexing' with 'Vector-friendly columnar format for AI/ML workloads'. - Replace the 'Efficient vector indexing and ANN search' Lance bullet with the accurate file-format-level benefit (native vector column encoding, no conversion overhead).

github-actions Bot added docs size:XL PR with lines of changes > 1000 labels May 28, 2026

hudi-agent reviewed May 28, 2026

View reviewed changes

rahil-c requested review from rahil-c and voonhous May 28, 2026 00:32

yihua commented May 28, 2026

View reviewed changes

Comment thread website/docs/blob_unstructured_data.md Outdated

yihua commented May 28, 2026

View reviewed changes

Comment thread website/docs/blob_unstructured_data.md

Comment thread website/docs/blob_unstructured_data.md

Comment thread website/docs/blob_unstructured_data.md

yihua commented May 28, 2026

View reviewed changes

Comment thread website/docs/cleaning.md Outdated

yihua commented May 28, 2026

View reviewed changes

Comment thread website/docs/cleaning.md

yihua force-pushed the docs-1.2.0-feature-updates branch from 1e05663 to 71c096f Compare May 28, 2026 04:06

danny0405 reviewed May 28, 2026

View reviewed changes

Comment thread website/docs/flink-quick-start-guide.md Outdated

cshuo reviewed May 28, 2026

View reviewed changes

Comment thread website/docs/flink-quick-start-guide.md Outdated

danny0405 reviewed May 28, 2026

View reviewed changes

Comment thread website/docs/ingestion_flink.md Outdated

danny0405 reviewed May 28, 2026

View reviewed changes

Comment thread website/docs/variant_type.md Outdated

cshuo reviewed May 28, 2026

View reviewed changes

Comment thread website/docs/ingestion_flink.md Outdated

cshuo reviewed May 28, 2026

View reviewed changes

Comment thread website/docs/reading_tables_batch_reads.md Outdated

yihua commented May 28, 2026

View reviewed changes

yihua mentioned this pull request May 28, 2026

docs: add flink RLI related configurations #18869

Merged

3 tasks

nsivabalan reviewed May 28, 2026

View reviewed changes

Comment thread website/docs/clustering.md Outdated

Comment thread website/docs/ingestion_flink.md

voonhous reviewed May 28, 2026

View reviewed changes

Comment thread website/docs/blob_unstructured_data.md Outdated

Comment thread website/docs/blob_unstructured_data.md Outdated

Comment thread website/docs/blob_unstructured_data.md

linliu-code reviewed May 28, 2026

View reviewed changes

Comment thread website/docs/hoodie_streaming_ingestion.md Outdated

linliu-code reviewed May 28, 2026

View reviewed changes

Comment thread website/docs/hoodie_streaming_ingestion.md

rahil-c reviewed May 28, 2026

View reviewed changes

Comment thread website/docs/lance_file_format.md Outdated

rahil-c reviewed May 28, 2026

View reviewed changes

Comment thread website/docs/lance_file_format.md Outdated

yihua added 2 commits May 28, 2026 09:17

yihua added 27 commits May 28, 2026 09:19

Update cleaning docs

5fb34dc

Update CLI page

b18c20b

Remove redundant docs in compaction page

50d8b7e

Update concurrency control page

e74c950

Update concurrency control docs

cf6aa07

Remove redundant config

375d5a1

Remove unnecessary configs

f672b76

Remove redundant docs

9dfb455

Revert record_merger.md changes

ae20f1e

Drop the ReverseOrderHoodieRecordPayload subsection and the DefaultHoodieRecordPayload sentinel no-op note; these payload classes should not be documented here.

Remove redundant docs

a387aed

Remove redundant docs

5562c40

yihua force-pushed the docs-1.2.0-feature-updates branch from 5ca9ca1 to c2b54db Compare May 28, 2026 16:19

Fix rebasing

80c8fa8

yihua merged commit e3a28fb into apache:asf-site May 28, 2026
1 check passed

Conversation

yihua commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua May 28, 2026

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

yihua commented May 28, 2026 •

edited

Loading

yihua commented May 28, 2026 •

edited

Loading