feat(lance): fix lance writer/reader regarding arrow memory limit issue#18613
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR makes the Lance reader/writer Arrow allocator sizing configurable, adds a byte-aware flush watermark, and releases per-batch Arrow buffers so the doubling growth never escalates past the allocator cap. One minor question on getDataSize() semantics worth a look in the inline comment. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small consistency nit in getDataSize() — the rest of the code is clean and the Arrow memory-management rationale is well-documented throughout.
| currentBufferSize += vector.getBufferSize(); | ||
| } | ||
| } | ||
| long currentBufferSize = currentBatchSize > 0 ? allocator.getAllocatedMemory() : 0; |
There was a problem hiding this comment.
🤖 Switching getDataSize() from summing FieldVector.getBufferSize() to allocator.getAllocatedMemory() is a broader change than just fixing the mid-batch zero issue: the same allocator is also passed to LanceFileWriter.open(...) in initializeWriter(), so any buffers Lance holds internally now count toward getDataSize() and toward the flush watermark. Have you measured what overhead the Lance writer carries here? If it's non-trivial it could cause earlier file rolls and earlier batch flushes than the prior per-vector accounting. If that's a concern, one option is to keep the per-vector sum at flush boundaries (where valueCount is set) and only fall back to allocator.getAllocatedMemory() for the live mid-batch reading.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| * in-progress batch. The in-progress portion uses {@link BufferAllocator#getAllocatedMemory()} | ||
| * — see {@link #currentBufferBytes()} for why per-vector {@code getBufferSize()} is unreliable | ||
| * mid-batch. This may slightly overestimate due to Arrow's pre-allocation overhead. | ||
| */ |
There was a problem hiding this comment.
🤖 nit: could you call currentBufferBytes() here instead of inlining allocator.getAllocatedMemory() directly? The method's own Javadoc says "see currentBufferBytes() for why per-vector getBufferSize() is unreliable mid-batch", so it would be cleaner to have getDataSize() actually route through that method — currentBatchSize > 0 ? currentBufferBytes() : 0 — so both callsites stay in sync if the implementation ever changes.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18613 +/- ##
============================================
- Coverage 68.90% 68.10% -0.80%
- Complexity 28581 28862 +281
============================================
Files 2482 2513 +31
Lines 137053 139965 +2912
Branches 16713 17322 +609
============================================
+ Hits 94436 95323 +887
- Misses 35009 36804 +1795
- Partials 7608 7838 +230
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| .key("hoodie.lance.write.allocator.size.bytes") | ||
| .defaultValue(String.valueOf(256 * 1024 * 1024)) | ||
| .markAdvanced() | ||
| .withDocumentation("Maximum size in bytes of the Arrow child allocator used by the Lance " |
There was a problem hiding this comment.
to mention this is experimental and subject to change
There was a problem hiding this comment.
to address in followup pr
| .key("hoodie.lance.write.allocator.size.bytes") | ||
| .defaultValue(String.valueOf(256 * 1024 * 1024)) | ||
| .markAdvanced() | ||
| .withDocumentation("Maximum size in bytes of the Arrow child allocator used by the Lance " |
There was a problem hiding this comment.
nit: add .sinceVersion("1.2.0") for all configs
Describe the issue this Pull Request addresses
The Lance writer hardcodes its Arrow child allocator at 120 MB. Arrow's
BaseLargeVariableWidthVectorgrows by power-of-2 doubling (32 → 64 → 128 MB), so a batch holding ~64 MB of payload (e.g. PNG blobs) hits a 128 MB reallocation request that exceeds the cap and OOMs. The reader has the same hardcoded constants (120 MB data, 8 MB metadata).Repro:
HUDI_BASE_FILE_FORMAT=lance HUDI_BLOB_MODE=inline HUDI_INLINE_READ_MODE=descriptoron the blob demo with ≥256 rows.Summary and Changelog
Make the Lance Arrow allocator sizes configurable and add a byte-aware flush so in-flight buffers can't escalate past the cap.
New
HoodieStorageConfigkeys (allmarkAdvanced):hoodie.lance.write.allocator.size.bytes— default 256 MBhoodie.lance.write.flush.byte.watermark— default 96 MBhoodie.lance.read.allocator.size.bytes— default 256 MBhoodie.lance.read.metadata.allocator.size.bytes— default 8 MBWriter (
HoodieBaseLanceWriter):allocatorSizeandflushByteWatermark; hardcoded constant removed.currentBatchSize >= batchSize || allocator.getAllocatedMemory() >= flushByteWatermark. We use the allocator's tracked memory rather thanFieldVector.getBufferSize()because the latter short-circuits to 0 for variable-width vectors whenvalueCount == 0, andvalueCountis only set atfinishBatch()— so a watermark driven by it never fires mid-batch.flushBatch()now closes theVectorSchemaRootafter writing. Arrow's variable-width vectors grow by doubling and never shrink onclear()/reset()— without releasing, capacity from one batch is still held when the next starts doubling, so the cap is enforced against accumulated capacity.Reader (
HoodieSparkLanceReader,SparkLanceReaderBase):storageConf.HoodieFileReaderFactory.newLanceFileReadersignature now takesHoodieConfig.Plumbing:
HoodieSparkLanceWriterbuilder,HoodieSparkFileWriterFactory,HoodieInternalRowFileWriterFactory,HoodieSparkFileReaderFactorythread the new keys through. Test references to the removedpublicconstant updated.Impact
Behavioral: default writer/reader allocator caps go from 120 MB → 256 MB; the writer also flushes on the byte watermark in addition to the 1000-row threshold. Tunable per workload via the four new configs. No public API removed;
HoodieFileReaderFactory.newLanceFileReader(StoragePath)is replaced by(HoodieConfig, StoragePath).Risk Level
low — Lance is opt-in (
HUDI_BASE_FILE_FORMAT=lance), Parquet path untouched. ThenewLanceFileReadersignature change is internal-protected. Verified the failing demo passes at 256 / 500 / 5000 rows acrossinline+descriptor,inline+content, andout_of_lineblob modes; Parquet baseline unchanged.Documentation Update
The four new
HoodieStorageConfigkeys carry config descriptions andmarkAdvanced(); they will surface in the auto-generated config reference. No website change required.Contributor's checklist