Skip to content

docs: Update documentation for new features in Hudi 1.2.0#18867

Merged
yihua merged 44 commits into
apache:asf-sitefrom
yihua:docs-1.2.0-feature-updates
May 28, 2026
Merged

docs: Update documentation for new features in Hudi 1.2.0#18867
yihua merged 44 commits into
apache:asf-sitefrom
yihua:docs-1.2.0-feature-updates

Conversation

@yihua
Copy link
Copy Markdown
Contributor

@yihua yihua commented May 28, 2026

Describe the issue this Pull Request addresses

Updates the website documentation under website/docs/ to cover features new in Hudi 1.2.0. Release notes and the auto-generated config reference (configurations.md / basic_configurations.md) are tracked separately and are not in scope here.

Summary and Changelog

  • New documentation for major 1.2.0 capabilities: Lance file format, VARIANT / VECTOR / BLOB data types, vector search, Flink Source V2, RLI-based Flink bucket indexing, Azure storage-based lock provider, pre-write validators, the JSON Kinesis source, show_timeline procedure, new CLI commands, and additional metrics.
  • Corrections and tightening across the existing AI / Lance / BLOB pages (config keys, defaults, engine-support claims).
  • Updates to Flink, cleaning, clustering, compaction, metadata, indexing, sync, validator, and reading/writing pages so each new capability lives in its natural place with a light inline marker rather than a dedicated "new in 1.2.0" section.

Impact

Documentation-only. No code or build changes.

Risk Level

low

Documentation Update

This PR is the documentation update for 1.2.0 features. The auto-generated config reference and the 1.2.0 release notes are tracked separately.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable — N/A (documentation-only)

@github-actions github-actions Bot added docs size:XL PR with lines of changes > 1000 labels May 28, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the docs update! This PR refreshes the documentation to cover a broad set of Hudi 1.2.0 features — Spark session-level configs, incremental column pruning, the vector search TVF, metastore sync improvements, VARIANT/VECTOR engine support matrices, rolling metadata, and various advanced storage and concurrency knobs — and corrects a number of existing AI-feature pages. The main concerns are a handful of factual inconsistencies across pages (engine/version support claims, config renames, default-value flips, cross-page description mismatches) plus a few invalid example snippets and a likely typo worth resolving before merge. Please take a look at the inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.

Comment thread website/docs/sql_queries.md
Comment thread website/docs/sql_queries.md Outdated
Comment thread website/docs/variant_type.md
Comment thread website/docs/syncing_metastore.md Outdated
Comment thread website/docs/writing_data.md
Comment thread website/docs/record_merger.md Outdated
Comment thread website/docs/indexes.md Outdated
Comment thread website/docs/ingestion_flink.md
Comment thread website/docs/metadata.md
Comment thread website/docs/blob_unstructured_data.md
@rahil-c rahil-c requested review from rahil-c and voonhous May 28, 2026 00:32
Comment thread website/docs/blob_unstructured_data.md Outdated
Comment thread website/docs/blob_unstructured_data.md
Comment thread website/docs/blob_unstructured_data.md
Comment thread website/docs/blob_unstructured_data.md
Comment thread website/docs/cleaning.md Outdated
Comment thread website/docs/cleaning.md
@yihua yihua force-pushed the docs-1.2.0-feature-updates branch from 1e05663 to 71c096f Compare May 28, 2026 04:06
yihua added a commit to yihua/hudi that referenced this pull request May 28, 2026
…cy_control.md

The HoodiePreWriteCleanerPolicy enum's CLEAN value is documented in
source as: 'Force a CLEAN table service call before starting the write
(also performs rollback of failed writes).' BaseHoodieTableServiceClient
#clean also invokes rollbackFailedWrites before the clean itself.

Update the cleaning.md description so it matches the existing wording in
concurrency_control.md: CLEAN also rolls back failed writes;
ROLLBACK_FAILED_WRITES is the rollback-only variant.

Addresses review comment on PR apache#18867.
@yihua
Copy link
Copy Markdown
Contributor Author

yihua commented May 28, 2026

@danny0405 and @cshuo could you take another pass on the Flink docs update, particularly in website/docs/flink-quick-start-guide.md, website/docs/flink_tuning.md, website/docs/ingestion_flink.md, website/docs/indexes.md, website/docs/precommit_validator.md, website/docs/reading_tables_batch_reads.md, website/docs/reading_tables_streaming_reads.md, website/docs/sql_ddl.md, website/docs/variant_type.md, website/docs/writing_data.md?

Comment thread website/docs/flink-quick-start-guide.md Outdated
Comment thread website/docs/flink-quick-start-guide.md Outdated
Comment thread website/docs/ingestion_flink.md Outdated
Comment thread website/docs/variant_type.md Outdated
yihua added a commit to yihua/hudi that referenced this pull request May 28, 2026
The Required Dependencies table listed the bare connector artifact
(org.lance:lance-spark-<spark>_<scala>) while the shell example below
referenced the shaded bundle JAR (lance-spark-bundle-3.5_2.12-0.4.0.jar).
Lance publishes both — the bare connector is what Hudi declares as a
Maven dependency internally, while the -bundle- variant is the shaded
uber-JAR meant for spark-shell/spark-submit --jars.

Align the table to the bundle artifacts (matching the shell example and
what users actually need on a Spark classpath), and add a one-paragraph
note explaining the connector vs. bundle distinction.

Addresses review comment on PR apache#18867.
Comment thread website/docs/ingestion_flink.md Outdated
Comment thread website/docs/reading_tables_batch_reads.md Outdated
Comment thread website/docs/sql_ddl.md
Comment on lines +1016 to +1022
- **`VECTOR(dim[, elementType])`** — stores fixed-dimension embedding vectors (e.g. `VECTOR(768)`,
`VECTOR(768, FLOAT)`, `VECTOR(768, DOUBLE)`). Enables approximate nearest-neighbor search via
the `hudi_vector_search` TVF. See [Vector Search](vector_search.md) for full details.

- **`BLOB`** — stores arbitrary binary objects (images, audio, documents) either inline within the
base file or as external references. See [BLOB / Unstructured Data](blob_unstructured_data.md)
for the storage modes, DDL syntax, and read APIs.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have a separate PR to move the new type support docs here.

yihua added a commit to yihua/hudi that referenced this pull request May 28, 2026
syncing_metastore.md: Address open questions about the HMS 4.x JDBC
fallback by stating the detection scope (per HoodieHiveSyncClient
instance, flag is monotonic), the no-fallback path (mode=hms / hiveql
hitting HMS 4.x logs an error and surfaces the original exception),
and how JDBC connection failures surface (eagerly at sync-client
construction time, wrapped as 'Failed to create HiveMetaStoreClient'
with the JDBC exception as cause).

variant_type.md: Drop 'fixed in 1.2.0' from the Spark 4.1 row and
restate both Spark 4.0 and 4.1 rows in terms of what is supported
today — full read/write/query for COW and MOR — rather than the
historical fix narrative.

Addresses review comment on PR apache#18867.
Copy link
Copy Markdown
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment thread website/docs/clustering.md Outdated
Comment thread website/docs/ingestion_flink.md
Comment thread website/docs/blob_unstructured_data.md Outdated
Comment thread website/docs/blob_unstructured_data.md Outdated
Comment thread website/docs/blob_unstructured_data.md
Comment thread website/docs/hoodie_streaming_ingestion.md Outdated
Comment thread website/docs/hoodie_streaming_ingestion.md
Comment thread website/docs/lance_file_format.md Outdated
Comment thread website/docs/lance_file_format.md Outdated
yihua added 2 commits May 28, 2026 09:17
- Fix wrong config key: hoodie.datasource.write.base.file.format does
  not exist; correct key is hoodie.table.base.file.format
- Correct indexing claim: column stats and partition stats are
  auto-disabled on Lance; only bloom filters are supported
- Add Engine Support / Limitations: Spark-only (Flink/Hive throw), VARIANT
  columns rejected at write, files are non-splittable
- Add File Sizing and Memory configs (max.file.size, allocator.size.bytes,
  flush.byte.watermark)
- Add Schema Evolution section (add-column, type promotion; FLOAT->DOUBLE
  and FLOAT->STRING not supported)
- Add MOR (Lance base + Avro log) example
- Add Spark 4.0 and 4.1 dependency rows; clarify Lance JAR is not bundled
- Note VECTOR columns round-trip as native Lance FixedSizeList enabling
  IVF-PQ; BLOB inline columns default to DESCRIPTOR on Lance
- Fix wrong default for hoodie.read.blob.inline.mode (DESCRIPTOR, not
  CONTENT) and rewrite the (previously backwards) description
- Add 'managed: boolean' to all BLOB reference struct examples (INLINE
  CAST, OUT_OF_LINE named_struct, DataFrame schema) — required field
- Add caution that read_blob() on INLINE columns under the default
  DESCRIPTOR mode returns a descriptor; set inline.mode=CONTENT to
  materialize bytes
- Add Metastore Sync section: BLOB maps to STRUCT in Hive and BigQuery
yihua added 27 commits May 28, 2026 09:19
Remove 'New in 1.2.0' and '(1.2.0)' section wrappers. Each feature now
lives in the thematically appropriate section of its page, with a light
inline marker (e.g. 'Available since Hudi 1.2.0' or 'Hudi 1.2.0
introduced ...'). This keeps the docs accurate as future minor releases
ship.

cleaning.md: Empty clean, capping commits-per-run, driver-side planning,
  MDT cleaner derivation, pre-write cleaner policy, and full-clean
  partition filtering merged into the existing retention-policy, configs,
  and 'Ways to trigger Cleaning' sections.

clustering.md: CommitBasedClusteringPlanStrategy,
  SparkStreamCopyClusteringPlanStrategy, single-group control, file-slice
  sort order, and driver-side plan generation merged into the existing
  Plan Strategy section. Failed-plan expiration merged into the
  HoodieClusteringJob section.

compaction.md: MDT compaction trigger strategy and external-platform
  delegation merged into the Strategies in Compaction Scheduling section.
  Log compaction blocks threshold placed alongside the existing Flink
  Offline Compaction options.

metadata.md: Auto-delete of disabled MDT partitions merged into Metadata
  Tracking on Writers. New 'MDT Cleaner and Compaction' section after
  Concurrency Control. New 'Timeline Archival Controls' section before
  Related Resources.

metadata_indexing.md: RLI config-key renames and additional configs
  (defer.init, max.filegroup.size) merged into the existing
  Configurations section. Auto-delete cross-reference inlined into Drop
  Index.

metrics.md: Archival, rollback / post-commit / duration, per-table
  registry isolation, and log-block compaction sections promoted to
  top-level peers of the existing metrics sections.

precommit_validator.md: '(1.2.0)' stripped from three section headings;
  inline version markers added.

concurrency_control.md: '(1.2.0)' stripped from Exclusive Rollbacks
  section heading; opening sentence rephrased with inline marker.

writing_data.md: Rephrase 'available in 1.2.0' to 'were added in Hudi
  1.2.0'.
…cy_control.md

The HoodiePreWriteCleanerPolicy enum's CLEAN value is documented in
source as: 'Force a CLEAN table service call before starting the write
(also performs rollback of failed writes).' BaseHoodieTableServiceClient
#clean also invokes rollbackFailedWrites before the clean itself.

Update the cleaning.md description so it matches the existing wording in
concurrency_control.md: CLEAN also rolls back failed writes;
ROLLBACK_FAILED_WRITES is the rollback-only variant.

Addresses review comment on PR apache#18867.
Remove the top-level Hudi 1.2.0 prerequisites callout (which would go
stale as future releases ship). Replace with a per-Spark-version bundle
picker note next to the Hudi Streamer Spark bundle reference, listing
the hudi-spark<X>-bundle artifacts and the Java runtime requirement
that follows from the Spark version.
Remove the version-labeled '1.2.0 New Features' bullet from the
'Where To Go From Here?' section so it doesn't go stale. Each feature
now lives in its natural topical bullet:

- Append write buffer modes and RLI bucket indexing -> Writing Data
- Lookup join (with RocksDB cache) -> Reading Data
- Managed-Memory Write Buffer -> Tuning

Flink Source V2 was already linked from the Reading Data bullet.
Rename the section heading to '## Flink Source V2'. The RFC-95
reference now lives as an inline link to the RFC doc in the opening
sentence of the section.

Update all cross-page anchors from #flink-source-v2-rfc-95 to
#flink-source-v2 in:
- flink-quick-start-guide.md
- reading_tables_batch_reads.md
- reading_tables_streaming_reads.md (2 anchors)
- flink_tuning.md
- Open with what the source does and what users get out of the box
  (parallel per-shard reads, auto handling of shard splits and merges,
  KPL de-aggregation) instead of an implementation-detail intro.
- Trim the configuration table to the six keys most users actually set
  (stream.name, region, starting.position, max.events, append.offsets,
  partitions). Point to the configurations reference for credential,
  endpoint, retry, and rate-limit tuning.
- Drop hoodie.streamer.source.kinesis.persist.fetch.rdd from the config
  table and the example, and drop the caution callout.
- Add a Checkpoint format section documenting the
  streamName,shardId:value,shardId:value,... encoding and the four
  value variants (lastSeq, lastSeq@arrivalTime, lastSeq|endSeq,
  lastSeq@arrivalTime|endSeq) — useful for debugging and manual
  checkpoint resets.
- Rename the section heading from 'Amazon Kinesis Source
  (JsonKinesisSource)' to 'Amazon Kinesis Source'; the class name
  stays inline in the opening sentence.
The Kinesis source belongs alongside the other Hudi Streamer source
types (DFS, Kafka, Pulsar, S3/GCS events, JDBC, SQL). Move it from a
standalone top-level section to a #### Amazon Kinesis subsection under
### Sources, placed between Pulsar and the cloud storage event sources.
Its inner headings demote from ### to ##### accordingly.
Kinesis sequence numbers are 56-digit decimal strings (128-bit integers)
in practice, which made the inline example wrap awkwardly and obscured
the checkpoint structure. Abbreviate with ellipsis, move the example
into a fenced code block, and call out the actual sequence-number
length so users still know what to expect.
The Required Dependencies table listed the bare connector artifact
(org.lance:lance-spark-<spark>_<scala>) while the shell example below
referenced the shaded bundle JAR (lance-spark-bundle-3.5_2.12-0.4.0.jar).
Lance publishes both — the bare connector is what Hudi declares as a
Maven dependency internally, while the -bundle- variant is the shaded
uber-JAR meant for spark-shell/spark-submit --jars.

Align the table to the bundle artifacts (matching the shell example and
what users actually need on a Spark classpath), and add a one-paragraph
note explaining the connector vs. bundle distinction.

Addresses review comment on PR apache#18867.
Replace the historical narrative (renamed keys, alias table, 'not
previously documented' note) with a single table listing the current
canonical configs for both global and partitioned RLI. Users
configuring a new table only need to know what to set today; the rename
trivia was a source of confusion.
Collapse the four separate metric tables (List of metrics, Archival
Metrics, Rollback/Post-commit/Table Service Duration Metrics,
Log-Block Compaction Gauges) into a single 'List of metrics' table so
readers can find every emitted metric in one place. Folded the
Per-Table MetricRegistry Isolation note into a single paragraph after
the table.

Revert website/docs/quick-start-guide.md to its asf-site state — the
Spark version matrix and Java-requirements note belong in pages closer
to deployment / install rather than on the Spark quick-start guide.
Drop the ReverseOrderHoodieRecordPayload subsection and the
DefaultHoodieRecordPayload sentinel no-op note; these payload classes
should not be documented here.
syncing_metastore.md: Address open questions about the HMS 4.x JDBC
fallback by stating the detection scope (per HoodieHiveSyncClient
instance, flag is monotonic), the no-fallback path (mode=hms / hiveql
hitting HMS 4.x logs an error and surfaces the original exception),
and how JDBC connection failures surface (eagerly at sync-client
construction time, wrapped as 'Failed to create HiveMetaStoreClient'
with the JDBC exception as cause).

variant_type.md: Drop 'fixed in 1.2.0' from the Spark 4.1 row and
restate both Spark 4.0 and 4.1 rows in terms of what is supported
today — full read/write/query for COW and MOR — rather than the
historical fix narrative.

Addresses review comment on PR apache#18867.
The section opened with 'Flink 1.2.0 supports declaring Hudi metadata
fields...' — but Flink itself does not have a 1.2.0 in this version
range; the feature is a Hudi capability. Rephrase to state the
capability directly without a version qualifier, matching the
docs-style convention used elsewhere in this PR (state what's
supported, not when it shipped).
…okup join, LIMIT push-down

- Flink support matrix and download/Maven snippets: default build for
  Hudi 1.2.x is Flink 1.20, not 2.1 (verified pom.xml on release-1.2.0).
- RLI bootstrap: clarify that setting index.bootstrap.enabled=false
  after bootstrap is optional; keeping the bootstrap operators is
  harmless.
- VARIANT on Flink: native VARIANT is not supported on any Flink
  version in 1.2.0 (Flink 2.1.x adapter still throws
  'Variant is not supported yet'). Replace the Flink 2.1+ row with the
  accurate statement that VARIANT surfaces as
  ROW<metadata BYTES, value BYTES> on Flink.
- Lookup join example: use a processing-time attribute (PROCTIME()) and
  FOR SYSTEM_TIME AS OF o.proc_time, matching the Hudi lookup join
  tests.
- Reading tables batch reads: drop the LIMIT push-down subsection.
  Legacy source also supports limit push-down, and the implementation
  enforces the limit in the source reader via RecordLimiter, not in
  split enumeration as the previous wording claimed.
…IPTOR mode

- managed field: clarify it only applies to OUT_OF_LINE blobs and that
  no cleaner consumes it yet (it's an intent flag for a future
  managed-blob cleaner). Reflect this in the field reference and the
  inline example comment.
- read_blob() under DESCRIPTOR mode: replace 'returns a descriptor
  reference' with the actual behavior — calling read_blob() on an
  INLINE column throws, since the bytes are not materialized in the
  scan.
Hudi's Lance integration is at the file-format level only — it does not
leverage Lance's table-format-level vector index. The hudi_vector_search
TVF runs a brute-force scan; it does not consume any Lance index. Scrub
all claims to the contrary so the docs match what is actually shipped.

lance_file_format.md:
- Drop 'ANN indexing' from the page summary.
- Remove 'IVF-PQ vector index' from the architecture diagram.
- Rewrite the VECTOR Storage on Lance section: keep the true claim
  (VECTOR round-trips as Lance FixedSizeList, no conversion overhead at
  the file-format layer); drop the 'unlocks Lance's built-in IVF-PQ ANN
  index' sentence and the claim that the TVF uses any Lance index.
- Rewrite the Vector Search with Lance section to describe what the TVF
  does (returns the top_k closest rows) without mentioning a specific
  algorithm or Lance index.

vector_search.md:
- Drop 'ANN' from keywords and summary; describe the TVF as 'vector
  similarity search' instead of 'approximate nearest neighbor search'.
- Drop the 'enables Lance's built-in IVF-PQ ANN index' claim from the
  Storage Format section.
- Rephrase the TVF description to state what it returns (top_k closest
  rows under a distance metric) rather than naming an algorithm.

ai_overview.md and overview.mdx (pre-existing pages but with the same
overstatements on this topic):
- Replace 'approximate nearest neighbor (ANN) search' with 'vector
  similarity search'.
- Replace 'Vector-optimized columnar format with built-in ANN indexing'
  with 'Vector-friendly columnar format for AI/ML workloads'.
- Replace the 'Efficient vector indexing and ANN search' Lance bullet
  with the accurate file-format-level benefit (native vector column
  encoding, no conversion overhead).
@yihua yihua force-pushed the docs-1.2.0-feature-updates branch from 5ca9ca1 to c2b54db Compare May 28, 2026 16:19
@yihua yihua merged commit e3a28fb into apache:asf-site May 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants