docs: Update documentation for new features in Hudi 1.2.0#18867
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the docs update! This PR refreshes the documentation to cover a broad set of Hudi 1.2.0 features — Spark session-level configs, incremental column pruning, the vector search TVF, metastore sync improvements, VARIANT/VECTOR engine support matrices, rolling metadata, and various advanced storage and concurrency knobs — and corrects a number of existing AI-feature pages. The main concerns are a handful of factual inconsistencies across pages (engine/version support claims, config renames, default-value flips, cross-page description mismatches) plus a few invalid example snippets and a likely typo worth resolving before merge. Please take a look at the inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.
1e05663 to
71c096f
Compare
…cy_control.md The HoodiePreWriteCleanerPolicy enum's CLEAN value is documented in source as: 'Force a CLEAN table service call before starting the write (also performs rollback of failed writes).' BaseHoodieTableServiceClient #clean also invokes rollbackFailedWrites before the clean itself. Update the cleaning.md description so it matches the existing wording in concurrency_control.md: CLEAN also rolls back failed writes; ROLLBACK_FAILED_WRITES is the rollback-only variant. Addresses review comment on PR apache#18867.
|
@danny0405 and @cshuo could you take another pass on the Flink docs update, particularly in |
The Required Dependencies table listed the bare connector artifact (org.lance:lance-spark-<spark>_<scala>) while the shell example below referenced the shaded bundle JAR (lance-spark-bundle-3.5_2.12-0.4.0.jar). Lance publishes both — the bare connector is what Hudi declares as a Maven dependency internally, while the -bundle- variant is the shaded uber-JAR meant for spark-shell/spark-submit --jars. Align the table to the bundle artifacts (matching the shell example and what users actually need on a Spark classpath), and add a one-paragraph note explaining the connector vs. bundle distinction. Addresses review comment on PR apache#18867.
| - **`VECTOR(dim[, elementType])`** — stores fixed-dimension embedding vectors (e.g. `VECTOR(768)`, | ||
| `VECTOR(768, FLOAT)`, `VECTOR(768, DOUBLE)`). Enables approximate nearest-neighbor search via | ||
| the `hudi_vector_search` TVF. See [Vector Search](vector_search.md) for full details. | ||
|
|
||
| - **`BLOB`** — stores arbitrary binary objects (images, audio, documents) either inline within the | ||
| base file or as external references. See [BLOB / Unstructured Data](blob_unstructured_data.md) | ||
| for the storage modes, DDL syntax, and read APIs. |
There was a problem hiding this comment.
We'll have a separate PR to move the new type support docs here.
syncing_metastore.md: Address open questions about the HMS 4.x JDBC fallback by stating the detection scope (per HoodieHiveSyncClient instance, flag is monotonic), the no-fallback path (mode=hms / hiveql hitting HMS 4.x logs an error and surfaces the original exception), and how JDBC connection failures surface (eagerly at sync-client construction time, wrapped as 'Failed to create HiveMetaStoreClient' with the JDBC exception as cause). variant_type.md: Drop 'fixed in 1.2.0' from the Spark 4.1 row and restate both Spark 4.0 and 4.1 rows in terms of what is supported today — full read/write/query for COW and MOR — rather than the historical fix narrative. Addresses review comment on PR apache#18867.
- Fix wrong config key: hoodie.datasource.write.base.file.format does not exist; correct key is hoodie.table.base.file.format - Correct indexing claim: column stats and partition stats are auto-disabled on Lance; only bloom filters are supported - Add Engine Support / Limitations: Spark-only (Flink/Hive throw), VARIANT columns rejected at write, files are non-splittable - Add File Sizing and Memory configs (max.file.size, allocator.size.bytes, flush.byte.watermark) - Add Schema Evolution section (add-column, type promotion; FLOAT->DOUBLE and FLOAT->STRING not supported) - Add MOR (Lance base + Avro log) example - Add Spark 4.0 and 4.1 dependency rows; clarify Lance JAR is not bundled - Note VECTOR columns round-trip as native Lance FixedSizeList enabling IVF-PQ; BLOB inline columns default to DESCRIPTOR on Lance
- Fix wrong default for hoodie.read.blob.inline.mode (DESCRIPTOR, not CONTENT) and rewrite the (previously backwards) description - Add 'managed: boolean' to all BLOB reference struct examples (INLINE CAST, OUT_OF_LINE named_struct, DataFrame schema) — required field - Add caution that read_blob() on INLINE columns under the default DESCRIPTOR mode returns a descriptor; set inline.mode=CONTENT to materialize bytes - Add Metastore Sync section: BLOB maps to STRUCT in Hive and BigQuery
Remove 'New in 1.2.0' and '(1.2.0)' section wrappers. Each feature now lives in the thematically appropriate section of its page, with a light inline marker (e.g. 'Available since Hudi 1.2.0' or 'Hudi 1.2.0 introduced ...'). This keeps the docs accurate as future minor releases ship. cleaning.md: Empty clean, capping commits-per-run, driver-side planning, MDT cleaner derivation, pre-write cleaner policy, and full-clean partition filtering merged into the existing retention-policy, configs, and 'Ways to trigger Cleaning' sections. clustering.md: CommitBasedClusteringPlanStrategy, SparkStreamCopyClusteringPlanStrategy, single-group control, file-slice sort order, and driver-side plan generation merged into the existing Plan Strategy section. Failed-plan expiration merged into the HoodieClusteringJob section. compaction.md: MDT compaction trigger strategy and external-platform delegation merged into the Strategies in Compaction Scheduling section. Log compaction blocks threshold placed alongside the existing Flink Offline Compaction options. metadata.md: Auto-delete of disabled MDT partitions merged into Metadata Tracking on Writers. New 'MDT Cleaner and Compaction' section after Concurrency Control. New 'Timeline Archival Controls' section before Related Resources. metadata_indexing.md: RLI config-key renames and additional configs (defer.init, max.filegroup.size) merged into the existing Configurations section. Auto-delete cross-reference inlined into Drop Index. metrics.md: Archival, rollback / post-commit / duration, per-table registry isolation, and log-block compaction sections promoted to top-level peers of the existing metrics sections. precommit_validator.md: '(1.2.0)' stripped from three section headings; inline version markers added. concurrency_control.md: '(1.2.0)' stripped from Exclusive Rollbacks section heading; opening sentence rephrased with inline marker. writing_data.md: Rephrase 'available in 1.2.0' to 'were added in Hudi 1.2.0'.
…cy_control.md The HoodiePreWriteCleanerPolicy enum's CLEAN value is documented in source as: 'Force a CLEAN table service call before starting the write (also performs rollback of failed writes).' BaseHoodieTableServiceClient #clean also invokes rollbackFailedWrites before the clean itself. Update the cleaning.md description so it matches the existing wording in concurrency_control.md: CLEAN also rolls back failed writes; ROLLBACK_FAILED_WRITES is the rollback-only variant. Addresses review comment on PR apache#18867.
Remove the top-level Hudi 1.2.0 prerequisites callout (which would go stale as future releases ship). Replace with a per-Spark-version bundle picker note next to the Hudi Streamer Spark bundle reference, listing the hudi-spark<X>-bundle artifacts and the Java runtime requirement that follows from the Spark version.
Remove the version-labeled '1.2.0 New Features' bullet from the 'Where To Go From Here?' section so it doesn't go stale. Each feature now lives in its natural topical bullet: - Append write buffer modes and RLI bucket indexing -> Writing Data - Lookup join (with RocksDB cache) -> Reading Data - Managed-Memory Write Buffer -> Tuning Flink Source V2 was already linked from the Reading Data bullet.
Rename the section heading to '## Flink Source V2'. The RFC-95 reference now lives as an inline link to the RFC doc in the opening sentence of the section. Update all cross-page anchors from #flink-source-v2-rfc-95 to #flink-source-v2 in: - flink-quick-start-guide.md - reading_tables_batch_reads.md - reading_tables_streaming_reads.md (2 anchors) - flink_tuning.md
- Open with what the source does and what users get out of the box (parallel per-shard reads, auto handling of shard splits and merges, KPL de-aggregation) instead of an implementation-detail intro. - Trim the configuration table to the six keys most users actually set (stream.name, region, starting.position, max.events, append.offsets, partitions). Point to the configurations reference for credential, endpoint, retry, and rate-limit tuning. - Drop hoodie.streamer.source.kinesis.persist.fetch.rdd from the config table and the example, and drop the caution callout. - Add a Checkpoint format section documenting the streamName,shardId:value,shardId:value,... encoding and the four value variants (lastSeq, lastSeq@arrivalTime, lastSeq|endSeq, lastSeq@arrivalTime|endSeq) — useful for debugging and manual checkpoint resets. - Rename the section heading from 'Amazon Kinesis Source (JsonKinesisSource)' to 'Amazon Kinesis Source'; the class name stays inline in the opening sentence.
The Kinesis source belongs alongside the other Hudi Streamer source types (DFS, Kafka, Pulsar, S3/GCS events, JDBC, SQL). Move it from a standalone top-level section to a #### Amazon Kinesis subsection under ### Sources, placed between Pulsar and the cloud storage event sources. Its inner headings demote from ### to ##### accordingly.
Kinesis sequence numbers are 56-digit decimal strings (128-bit integers) in practice, which made the inline example wrap awkwardly and obscured the checkpoint structure. Abbreviate with ellipsis, move the example into a fenced code block, and call out the actual sequence-number length so users still know what to expect.
The Required Dependencies table listed the bare connector artifact (org.lance:lance-spark-<spark>_<scala>) while the shell example below referenced the shaded bundle JAR (lance-spark-bundle-3.5_2.12-0.4.0.jar). Lance publishes both — the bare connector is what Hudi declares as a Maven dependency internally, while the -bundle- variant is the shaded uber-JAR meant for spark-shell/spark-submit --jars. Align the table to the bundle artifacts (matching the shell example and what users actually need on a Spark classpath), and add a one-paragraph note explaining the connector vs. bundle distinction. Addresses review comment on PR apache#18867.
Replace the historical narrative (renamed keys, alias table, 'not previously documented' note) with a single table listing the current canonical configs for both global and partitioned RLI. Users configuring a new table only need to know what to set today; the rename trivia was a source of confusion.
Collapse the four separate metric tables (List of metrics, Archival Metrics, Rollback/Post-commit/Table Service Duration Metrics, Log-Block Compaction Gauges) into a single 'List of metrics' table so readers can find every emitted metric in one place. Folded the Per-Table MetricRegistry Isolation note into a single paragraph after the table. Revert website/docs/quick-start-guide.md to its asf-site state — the Spark version matrix and Java-requirements note belong in pages closer to deployment / install rather than on the Spark quick-start guide.
Drop the ReverseOrderHoodieRecordPayload subsection and the DefaultHoodieRecordPayload sentinel no-op note; these payload classes should not be documented here.
syncing_metastore.md: Address open questions about the HMS 4.x JDBC fallback by stating the detection scope (per HoodieHiveSyncClient instance, flag is monotonic), the no-fallback path (mode=hms / hiveql hitting HMS 4.x logs an error and surfaces the original exception), and how JDBC connection failures surface (eagerly at sync-client construction time, wrapped as 'Failed to create HiveMetaStoreClient' with the JDBC exception as cause). variant_type.md: Drop 'fixed in 1.2.0' from the Spark 4.1 row and restate both Spark 4.0 and 4.1 rows in terms of what is supported today — full read/write/query for COW and MOR — rather than the historical fix narrative. Addresses review comment on PR apache#18867.
The section opened with 'Flink 1.2.0 supports declaring Hudi metadata fields...' — but Flink itself does not have a 1.2.0 in this version range; the feature is a Hudi capability. Rephrase to state the capability directly without a version qualifier, matching the docs-style convention used elsewhere in this PR (state what's supported, not when it shipped).
…okup join, LIMIT push-down - Flink support matrix and download/Maven snippets: default build for Hudi 1.2.x is Flink 1.20, not 2.1 (verified pom.xml on release-1.2.0). - RLI bootstrap: clarify that setting index.bootstrap.enabled=false after bootstrap is optional; keeping the bootstrap operators is harmless. - VARIANT on Flink: native VARIANT is not supported on any Flink version in 1.2.0 (Flink 2.1.x adapter still throws 'Variant is not supported yet'). Replace the Flink 2.1+ row with the accurate statement that VARIANT surfaces as ROW<metadata BYTES, value BYTES> on Flink. - Lookup join example: use a processing-time attribute (PROCTIME()) and FOR SYSTEM_TIME AS OF o.proc_time, matching the Hudi lookup join tests. - Reading tables batch reads: drop the LIMIT push-down subsection. Legacy source also supports limit push-down, and the implementation enforces the limit in the source reader via RecordLimiter, not in split enumeration as the previous wording claimed.
…IPTOR mode - managed field: clarify it only applies to OUT_OF_LINE blobs and that no cleaner consumes it yet (it's an intent flag for a future managed-blob cleaner). Reflect this in the field reference and the inline example comment. - read_blob() under DESCRIPTOR mode: replace 'returns a descriptor reference' with the actual behavior — calling read_blob() on an INLINE column throws, since the bytes are not materialized in the scan.
Hudi's Lance integration is at the file-format level only — it does not leverage Lance's table-format-level vector index. The hudi_vector_search TVF runs a brute-force scan; it does not consume any Lance index. Scrub all claims to the contrary so the docs match what is actually shipped. lance_file_format.md: - Drop 'ANN indexing' from the page summary. - Remove 'IVF-PQ vector index' from the architecture diagram. - Rewrite the VECTOR Storage on Lance section: keep the true claim (VECTOR round-trips as Lance FixedSizeList, no conversion overhead at the file-format layer); drop the 'unlocks Lance's built-in IVF-PQ ANN index' sentence and the claim that the TVF uses any Lance index. - Rewrite the Vector Search with Lance section to describe what the TVF does (returns the top_k closest rows) without mentioning a specific algorithm or Lance index. vector_search.md: - Drop 'ANN' from keywords and summary; describe the TVF as 'vector similarity search' instead of 'approximate nearest neighbor search'. - Drop the 'enables Lance's built-in IVF-PQ ANN index' claim from the Storage Format section. - Rephrase the TVF description to state what it returns (top_k closest rows under a distance metric) rather than naming an algorithm. ai_overview.md and overview.mdx (pre-existing pages but with the same overstatements on this topic): - Replace 'approximate nearest neighbor (ANN) search' with 'vector similarity search'. - Replace 'Vector-optimized columnar format with built-in ANN indexing' with 'Vector-friendly columnar format for AI/ML workloads'. - Replace the 'Efficient vector indexing and ANN search' Lance bullet with the accurate file-format-level benefit (native vector column encoding, no conversion overhead).
5ca9ca1 to
c2b54db
Compare
Describe the issue this Pull Request addresses
Updates the website documentation under
website/docs/to cover features new in Hudi 1.2.0. Release notes and the auto-generated config reference (configurations.md/basic_configurations.md) are tracked separately and are not in scope here.Summary and Changelog
show_timelineprocedure, new CLI commands, and additional metrics.Impact
Documentation-only. No code or build changes.
Risk Level
low
Documentation Update
This PR is the documentation update for 1.2.0 features. The auto-generated config reference and the 1.2.0 release notes are tracked separately.
Contributor's checklist