docs: Add release 1.2.0 documentation#18880
Conversation
Snapshot the current docs into versioned_docs/version-1.2.0 and make 1.2.0 the default in docusaurus.config.js. Bump release-1.1 redirect targets to release-1.2 and refresh hardcoded Hudi versions (1.0.x/1.1.1) in code examples to 1.2.0.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR snapshots the current docs into versioned_docs/version-1.2.0 and bumps hardcoded versions, registering 1.2.0 as the default, while also introducing new 1.2.0 content around Lance, vector search, BLOB/VARIANT types, HMS 4.x JDBC fallback, and new Flink/Spark configs. The version bumps are mostly consistent, but a few stale version strings appear to have been missed (e.g., hudi-gcp-bundle-0.13.0.jar, hudi-flink-bundle_2.11-0.9.0.jar, the CLI table-version mapping), and there are some typos, a leftover author TODO, and minor inconsistencies in the new feature descriptions worth a cleanup pass. Please take a look at the inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.
| |----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | Flink global RLI with native MDT access | `index.type=GLOBAL_RECORD_LEVEL_INDEX`<br />`metadata.enabled=true`<br />`index.global.enabled=true`<br />`index.bootstrap.enabled=false`<br />`hoodie.metadata.record.level.index.defer.init=false` | Flink reads global record locations directly from the metadata table and uses the in-task RLI cache for recently accessed keys. Use this when a shared metadata-table index is preferred over task-local RocksDB state. | | ||
| | Flink global RLI with local RocksDB cache | `index.type=GLOBAL_RECORD_LEVEL_INDEX`<br />`metadata.enabled=true`<br />`index.global.enabled=true`<br />`index.bootstrap.enabled=true`<br />`index.bootstrap.rocksdb.path=<local-path>`<br />`hoodie.metadata.record.level.index.defer.init=false` | Flink bootstraps the global RLI into a local RocksDB backend. This can reduce repeated metadata-table lookups at the cost of local disk usage and bootstrap time. | | ||
| | Dynamic bucket scaling with partitioned RLI | `index.type=RECORD_LEVEL_INDEX`<br />`metadata.enabled=true`<br />`index.global.enabled=false`<br />`hoodie.metadata.record.level.index.min.filegroup.count=<initial-file-groups-per-partition>`<br />`hoodie.metadata.record.level.index.max.filegroup.count=<max-file-groups-per-partition>`<br />Optionally tune `index.rli.cache.size` and `index.rli.cache.concurrent.partitions.num` for the partition cache. | Flink uses partition-scoped RLI to route existing keys to their recorded file group and assigns new keys through dynamic bucket assignment. This supports streaming upsert and insert overwrite workloads. | |
There was a problem hiding this comment.
🤖 The default column for index.bootstrap.enabled says false (Optional), but the description says "If not explicitly set for global RLI, Flink enables bootstrap by default." These appear contradictory. The companion table in ingestion_flink.md simply lists the default as false. Could you clarify whether the actual runtime default flips to true for GLOBAL_RECORD_LEVEL_INDEX, or whether the description is overstating it? Otherwise users wiring this up for the first time may set it explicitly only because they're unsure.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
|
|
||
| | Option Name | Description | Default | Remarks | | ||
| | ----------- | ------- | ------- | ------- | | ||
| | `write.buffer.disruptor.ring.size` | Size of the Disruptor ring buffer (must be a power of 2) | `16384` | Larger values absorb write bursts but consume more heap memory | |
There was a problem hiding this comment.
🤖 The wording "As of Hudi 1.2.0, Flink writers support TIMELINE_SERVER_BASED marker type" reads as if this is a brand-new capability in 1.2.0. TIMELINE_SERVER_BASED markers themselves have been around since 0.9.0. Is the intent to call out that Flink specifically gained support in 1.2.0, or is this restating an existing feature? @danny0405 could you confirm when Flink writers gained TIMELINE_SERVER_BASED marker support so this can be worded accurately?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| a query vector under the chosen distance metric. | ||
|
|
||
| ```sql | ||
| SELECT * |
There was a problem hiding this comment.
🤖 Minor wording cleanup: secondary keys stores in the secondary index partition → secondary keys stored in the secondary index partition. Also two lines later, All the different index types can be queries by specifying... → queried.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| -- COW | ||
| CREATE TABLE my_ai_table ( | ||
| id STRING, | ||
| embedding VECTOR(768), |
There was a problem hiding this comment.
🤖 The Lance CREATE TABLE example sets hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger'. This config typically accepts a comma-separated list of merger class names. Is DefaultSparkRecordMerger the correct FQCN required here, and is it a required property for Lance-backed tables, or only recommended? A short sentence clarifying why this needs to be set explicitly (and that it's a single value here, not the full list) would help users not familiar with the merger API. @yihua could you confirm the recommended value here?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:10000 | ||
| hoodie.datasource.hive_sync.username=<username> | ||
| hoodie.datasource.hive_sync.password=<password> | ||
| ``` |
There was a problem hiding this comment.
🤖 The HMS 4.x JDBC fallback section says fallback only happens for callers that go through Thrift first, and the flag never resets within a HoodieHiveSyncClient instance. Should we explicitly note here whether long-running writers (e.g., Hudi Streamer continuous mode) will re-probe Thrift on each sync run, or whether the same sync-client instance is reused across runs and therefore stays in JDBC-fallback mode for the lifetime of the process? Users tuning continuous pipelines on HMS 4.x will want to know.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Describe the issue this Pull Request addresses
Cuts the 1.2.0 documentation by snapshotting the current docs into
versioned_docs/version-1.2.0and making 1.2.0 the default version indocusaurus.config.js. Also refreshes hardcoded Hudi versions in code examples (1.0.x / 1.1.1 → 1.2.0) within the 1.2.0 docs.Release-page redirects continue to point at
release-1.1; they will be flipped torelease-1.2in the follow-up PR that addsreleases/release-1.2.md.Summary and Changelog
versioned_docs/version-1.2.0+ sidebar, register1.2.0inversions.jsondocusaurus.config.jsImpact
As above
Risk Level
none
Documentation Update
Release 1.2.0 documentation
Contributor's checklist