feat(lance): fix lance writer/reader regarding arrow memory limit issue by rahil-c · Pull Request #18613 · apache/hudi

rahil-c · 2026-04-27T17:27:47Z

Describe the issue this Pull Request addresses

The Lance writer hardcodes its Arrow child allocator at 120 MB. Arrow's BaseLargeVariableWidthVector grows by power-of-2 doubling (32 → 64 → 128 MB), so a batch holding ~64 MB of payload (e.g. PNG blobs) hits a 128 MB reallocation request that exceeds the cap and OOMs. The reader has the same hardcoded constants (120 MB data, 8 MB metadata).

Repro: HUDI_BASE_FILE_FORMAT=lance HUDI_BLOB_MODE=inline HUDI_INLINE_READ_MODE=descriptor on the blob demo with ≥256 rows.

Summary and Changelog

Make the Lance Arrow allocator sizes configurable and add a byte-aware flush so in-flight buffers can't escalate past the cap.

New HoodieStorageConfig keys (all markAdvanced):

hoodie.lance.write.allocator.size.bytes — default 256 MB
hoodie.lance.write.flush.byte.watermark — default 96 MB
hoodie.lance.read.allocator.size.bytes — default 256 MB
hoodie.lance.read.metadata.allocator.size.bytes — default 8 MB

Writer (HoodieBaseLanceWriter):

Constructor takes allocatorSize and flushByteWatermark; hardcoded constant removed.
Flush triggers on currentBatchSize >= batchSize || allocator.getAllocatedMemory() >= flushByteWatermark. We use the allocator's tracked memory rather than FieldVector.getBufferSize() because the latter short-circuits to 0 for variable-width vectors when valueCount == 0, and valueCount is only set at finishBatch() — so a watermark driven by it never fires mid-batch.
flushBatch() now closes the VectorSchemaRoot after writing. Arrow's variable-width vectors grow by doubling and never shrink on clear()/reset() — without releasing, capacity from one batch is still held when the next starts doubling, so the cap is enforced against accumulated capacity.

Reader (HoodieSparkLanceReader, SparkLanceReaderBase):

Sizes plumbed via constructor / read off storageConf. HoodieFileReaderFactory.newLanceFileReader signature now takes HoodieConfig.

Plumbing: HoodieSparkLanceWriter builder, HoodieSparkFileWriterFactory, HoodieInternalRowFileWriterFactory, HoodieSparkFileReaderFactory thread the new keys through. Test references to the removed public constant updated.

Impact

Behavioral: default writer/reader allocator caps go from 120 MB → 256 MB; the writer also flushes on the byte watermark in addition to the 1000-row threshold. Tunable per workload via the four new configs. No public API removed; HoodieFileReaderFactory.newLanceFileReader(StoragePath) is replaced by (HoodieConfig, StoragePath).

Risk Level

low — Lance is opt-in (HUDI_BASE_FILE_FORMAT=lance), Parquet path untouched. The newLanceFileReader signature change is internal-protected. Verified the failing demo passes at 256 / 500 / 5000 rows across inline+descriptor, inline+content, and out_of_line blob modes; Parquet baseline unchanged.

Documentation Update

The four new HoodieStorageConfig keys carry config descriptions and markAdvanced(); they will surface in the auto-generated config reference. No website change required.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR makes the Lance reader/writer Arrow allocator sizing configurable, adds a byte-aware flush watermark, and releases per-batch Arrow buffers so the doubling growth never escalates past the allocator cap. One minor question on getDataSize() semantics worth a look in the inline comment. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small consistency nit in getDataSize() — the rest of the code is clean and the Arrow memory-management rationale is well-documented throughout.

hudi-agent · 2026-04-27T18:13:35Z

-        currentBufferSize += vector.getBufferSize();
-      }
-    }
+    long currentBufferSize = currentBatchSize > 0 ? allocator.getAllocatedMemory() : 0;


🤖 Switching getDataSize() from summing FieldVector.getBufferSize() to allocator.getAllocatedMemory() is a broader change than just fixing the mid-batch zero issue: the same allocator is also passed to LanceFileWriter.open(...) in initializeWriter(), so any buffers Lance holds internally now count toward getDataSize() and toward the flush watermark. Have you measured what overhead the Lance writer carries here? If it's non-trivial it could cause earlier file rolls and earlier batch flushes than the prior per-vector accounting. If that's a concern, one option is to keep the per-vector sum at flush boundaries (where valueCount is set) and only fall back to allocator.getAllocatedMemory() for the live mid-batch reading.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-27T18:13:35Z

+   * in-progress batch. The in-progress portion uses {@link BufferAllocator#getAllocatedMemory()}
+   * — see {@link #currentBufferBytes()} for why per-vector {@code getBufferSize()} is unreliable
+   * mid-batch. This may slightly overestimate due to Arrow's pre-allocation overhead.
   */


🤖 nit: could you call currentBufferBytes() here instead of inlining allocator.getAllocatedMemory() directly? The method's own Javadoc says "see currentBufferBytes() for why per-vector getBufferSize() is unreliable mid-batch", so it would be cleaner to have getDataSize() actually route through that method — currentBatchSize > 0 ? currentBufferBytes() : 0 — so both callsites stay in sync if the implementation ever changes.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

codecov-commenter · 2026-04-27T18:59:40Z

Codecov Report

❌ Patch coverage is 91.52542% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.10%. Comparing base (29f9c40) to head (328eeb4).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
...apache/hudi/io/storage/HoodieSparkLanceWriter.java	40.00%	0 Missing and 3 partials ⚠️
...pache/hudi/io/storage/HoodieFileReaderFactory.java	50.00%	1 Missing ⚠️
...rg/apache/hudi/io/lance/HoodieBaseLanceWriter.java	87.50%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18613      +/-   ##
============================================
- Coverage     68.90%   68.10%   -0.80%     
- Complexity    28581    28862     +281     
============================================
  Files          2482     2513      +31     
  Lines        137053   139965    +2912     
  Branches      16713    17322     +609     
============================================
+ Hits          94436    95323     +887     
- Misses        35009    36804    +1795     
- Partials       7608     7838     +230

Flag	Coverage Δ
common-and-other-modules	`44.29% <33.89%> (-0.09%)`	⬇️
hadoop-mr-java-client	`44.86% <66.66%> (+0.03%)`	⬆️
spark-client-hadoop-common	`48.41% <36.36%> (-0.05%)`	⬇️
spark-java-tests	`48.90% <91.52%> (-0.60%)`	⬇️
spark-scala-tests	`44.93% <33.89%> (-0.30%)`	⬇️
utilities	`37.91% <33.89%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../hudi/io/storage/HoodieSparkFileReaderFactory.java	`78.57% <100.00%> (+3.57%)`	⬆️
.../hudi/io/storage/HoodieSparkFileWriterFactory.java	`89.83% <100.00%> (+0.73%)`	⬆️
...apache/hudi/io/storage/HoodieSparkLanceReader.java	`73.25% <100.00%> (-0.82%)`	⬇️
...torage/row/HoodieInternalRowFileWriterFactory.java	`91.89% <100.00%> (+0.98%)`	⬆️
...apache/hudi/common/config/HoodieStorageConfig.java	`90.06% <100.00%> (+0.73%)`	⬆️
...ution/datasources/lance/SparkLanceReaderBase.scala	`76.13% <100.00%> (+0.55%)`	⬆️
...pache/hudi/io/storage/HoodieFileReaderFactory.java	`58.33% <50.00%> (ø)`
...rg/apache/hudi/io/lance/HoodieBaseLanceWriter.java	`70.58% <87.50%> (+1.54%)`	⬆️
...apache/hudi/io/storage/HoodieSparkLanceWriter.java	`85.60% <40.00%> (-2.01%)`	⬇️

... and 70 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-04-27T20:05:11Z

CI report:

328eeb4 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

rahil-c · 2026-04-27T20:06:39Z

+      .key("hoodie.lance.write.allocator.size.bytes")
+      .defaultValue(String.valueOf(256 * 1024 * 1024))
+      .markAdvanced()
+      .withDocumentation("Maximum size in bytes of the Arrow child allocator used by the Lance "


to mention this is experimental and subject to change

to address in followup pr

yihua

LGTM overall

yihua · 2026-04-27T20:11:52Z

+      .key("hoodie.lance.write.allocator.size.bytes")
+      .defaultValue(String.valueOf(256 * 1024 * 1024))
+      .markAdvanced()
+      .withDocumentation("Maximum size in bytes of the Arrow child allocator used by the Lance "


nit: add .sinceVersion("1.2.0") for all configs

fix lance writer/reader regarding arrow memory limit issue

328eeb4

rahil-c requested review from voonhous and yihua April 27, 2026 17:28

hudi-agent reviewed Apr 27, 2026

View reviewed changes

rahil-c commented Apr 27, 2026

View reviewed changes

yihua approved these changes Apr 27, 2026

View reviewed changes

yihua merged commit 5c73bc0 into apache:master Apr 27, 2026
63 of 65 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(lance): fix lance writer/reader regarding arrow memory limit issue#18613

feat(lance): fix lance writer/reader regarding arrow memory limit issue#18613
yihua merged 1 commit into
apache:masterfrom
rahil-c:rahil/arrow-limit

rahil-c commented Apr 27, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Apr 27, 2026

Uh oh!

hudi-agent Apr 27, 2026

Uh oh!

codecov-commenter commented Apr 27, 2026

Uh oh!

hudi-bot commented Apr 27, 2026

Uh oh!

rahil-c Apr 27, 2026

Uh oh!

rahil-c Apr 27, 2026

Uh oh!

yihua left a comment

Uh oh!

yihua Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

rahil-c commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 27, 2026

Codecov Report

Uh oh!

hudi-bot commented Apr 27, 2026

CI report:

Uh oh!

rahil-c Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

rahil-c Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rahil-c commented Apr 27, 2026 •

edited

Loading