feat(lance): support simplified path for lance blob inline reading#18575
Conversation
Covers steps 2, 3, and 4 of VC's 1.2 BLOB release plan:
2. Wire INLINE writes - testEndToEndInline drives SparkRDDWriteClient
insert+upsert with INLINE Avro records (data=bytes, reference=null)
against a Parquet-backed Hudi table.
3. read_blob() reads inline values - testInlineBlobRoundTrip runs
SELECT read_blob(col) over an in-memory INLINE DataFrame and
verifies each payload round-trips byte-for-byte.
4. Mixed datasets - testMixedInlineAndOutOfLine builds 10 rows
alternating INLINE and OUT_OF_LINE, pointing the range rows at
one shared file and asserts the returned sequence matches input
order (stronger than TestBatchedBlobReader.testMixedBlobTypes,
which orders by record_id before asserting).
testInlineOnHudiBackedTable mirrors the cherry-picked
testReadBlobOnHudiBackedTable (OUT_OF_LINE) but writes INLINE rows
via spark.write.format("hudi") + bulk_insert, reads back through
spark.read.format("hudi"), and resolves via read_blob() - exercises
the full write -> HoodieFileIndex-backed read -> SQL path that the
cherry-picked BatchedBlobReadExec serialization fix unblocks.
No production code changes. BatchedBlobReader already dispatches
INLINE rows into its inline branch (collectBatch field-0 check)
and preserves row order via sortBy(index) in processNextBatch.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR adds a simplified inline-read path for Lance blobs: a user-facing hoodie.read.blob.inline.mode config (today limited to CONTENT), writer-side schema annotation so the nested data bytes route through Lance's dedicated blob writer, and reader-side nullability widening to tolerate Lance's inline-vs-outline materialization shape. The refactor of LanceRecordIterator to skip empty batches in a loop is a small but real correctness improvement. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. Two small suggestions below — a redundant .toSeq pattern in BlobLanceSchemaSupport and a hard-coded metadata key string in the test that should use the shared constant.
cc @yihua
| forceNullableRecursively(child) | ||
| } | ||
| } | ||
| new Field(arrowField.getName, arrowField.getFieldType, newChildren.toSeq.asJava) |
There was a problem hiding this comment.
🤖 nit: newChildren is already a Seq[Field] (produced by .asScala.toSeq.map on line 130), so the second .toSeq before .asJava is a no-op — same redundancy in forceNullableRecursively at line 154. Could you drop the extra .toSeq in both spots?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the updates! This round is mostly polish: the BlobLanceSchemaSupport class moved out of org.apache.spark.sql.avro into org.apache.hudi.io.storage (a cleaner home, since it doesn't rely on Spark Avro internals), LanceRecordIterator was refactored to drive batch iteration through ColumnarBatch.rowIterator() instead of a manual rowIdInBatch counter (logic is equivalent — empty-batch skip and columnVector caching are preserved), and several comments were tightened. No new issues flagged from this automated pass, and the two prior nits (redundant .toSeq calls in BlobLanceSchemaSupport, hard-coded "lance-encoding:blob" string in the test) remain open but are low-priority — a Hudi committer or PMC member can take it from here for a final review.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
CodeRabbit Walkthrough: This pull request introduces comprehensive support for Apache Hudi across multiple layers: adds Hadoop/Hive/Spark development environments via Docker Compose, implements Lance file format read/write capabilities with vector metadata, introduces Blob and Variant logical types with schema conversion, enhances multi-writer rollback semantics with heartbeat ownership, extends timeline archival with clean-based retention boundaries, migrates map utilities to CollectionUtils, and adds Flink continuous-sort buffering with dictionary-encoded partition paths.
Sequence Diagram (CodeRabbit):
sequenceDiagram
participant Writer1 as Writer 1
participant LM as Lock Manager
participant TM as Timeline
participant Writer2 as Writer 2
participant HB as Heartbeat
Writer1->>LM: Acquire scheduling lock (skipLocking=false)
LM-->>Writer1: Lock acquired
Writer1->>TM: Read timeline, check for pending rollback
alt No pending rollback exists
Writer1->>TM: Schedule rollback plan
TM-->>Writer1: Plan scheduled
else Pending rollback exists
Writer1->>Writer1: Return false (avoid duplicate)
end
Writer1->>LM: Release scheduling lock
Writer2->>LM: Acquire execution lock (multi-writer heartbeat)
LM-->>Writer2: Lock acquired
Writer2->>HB: Acquire rollback heartbeat ownership
alt Heartbeat doesn't exist or inactive
HB-->>Writer2: Ownership granted
Writer2->>TM: Execute rollback (call table.rollback)
TM-->>Writer2: Rollback completed
Writer2->>HB: Stop heartbeat (finally block)
else Heartbeat active from other writer
HB-->>Writer2: Ownership denied
Writer2->>Writer2: Return false (skip execution)
end
Writer2->>LM: Release execution lock
CodeRabbit: hudi-agent#19 (review)
| valueType = forceTypeNullable(m.valueType), | ||
| valueContainsNull = true) | ||
| case other => other | ||
| } |
There was a problem hiding this comment.
Recurse into non-BLOB containers when widening nullability.
This only rewrites top-level fields. A schema like STRUCT<title: STRING, content: BLOB> keeps content unchanged because media is not itself a BLOB field, so nested inline/null BLOB materialization can still hit the same projection NPE you're trying to avoid.
Suggested fix
private def widenBlobSubtreeNullability(schema: StructType): StructType = {
- StructType(schema.fields.map { f =>
- if (BlobLanceSchemaSupport.isBlobField(f)) {
- f.copy(nullable = true, dataType = forceTypeNullable(f.dataType))
- } else {
- f
- }
- })
+ StructType(schema.fields.map(rewriteField))
}
+private def rewriteField(field: StructField): StructField =
+ if (BlobLanceSchemaSupport.isBlobField(field)) {
+ field.copy(nullable = true, dataType = forceTypeNullable(field.dataType))
+ } else {
+ field.copy(dataType = rewriteNested(field.dataType))
+ }
+
+private def rewriteNested(dt: DataType): DataType = dt match {
+ case s: StructType => StructType(s.fields.map(rewriteField))
+ case a: ArrayType => a.copy(elementType = rewriteNested(a.elementType))
+ case m: MapType => m.copy(
+ keyType = rewriteNested(m.keyType),
+ valueType = rewriteNested(m.valueType))
+ case other => other
+}
+
private def forceFieldNullable(field: StructField): StructField =
field.copy(nullable = true, dataType = forceTypeNullable(field.dataType))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } | |
| private def widenBlobSubtreeNullability(schema: StructType): StructType = { | |
| StructType(schema.fields.map(rewriteField)) | |
| } | |
| private def rewriteField(field: StructField): StructField = | |
| if (BlobLanceSchemaSupport.isBlobField(field)) { | |
| field.copy(nullable = true, dataType = forceTypeNullable(field.dataType)) | |
| } else { | |
| field.copy(dataType = rewriteNested(field.dataType)) | |
| } | |
| private def rewriteNested(dt: DataType): DataType = dt match { | |
| case s: StructType => StructType(s.fields.map(rewriteField)) | |
| case a: ArrayType => a.copy(elementType = rewriteNested(a.elementType)) | |
| case m: MapType => m.copy( | |
| keyType = rewriteNested(m.keyType), | |
| valueType = rewriteNested(m.valueType)) | |
| case other => other | |
| } | |
| private def forceFieldNullable(field: StructField): StructField = | |
| field.copy(nullable = true, dataType = forceTypeNullable(field.dataType)) | |
| private def forceTypeNullable(dt: DataType): DataType = dt match { | |
| case s: StructType => StructType(s.fields.map(forceFieldNullable)) | |
| case a: ArrayType => a.copy(elementType = forceTypeNullable(a.elementType), containsNull = true) | |
| case m: MapType => m.copy( | |
| keyType = forceTypeNullable(m.keyType), | |
| valueType = forceTypeNullable(m.valueType), | |
| valueContainsNull = true) | |
| case other => other | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala`
around lines 223 - 244, widenBlobSubtreeNullability only marks top-level fields
that are directly blob fields, so nested BLOBs inside structs/arrays/maps are
missed; add a helper (e.g., containsBlob(dt: DataType): Boolean) that
recursively inspects DataType (StructType/ArrayType/MapType) and returns true if
any nested StructField satisfies BlobLanceSchemaSupport.isBlobField, then change
widenBlobSubtreeNullability to mark a field nullable (using
forceFieldNullable/forceTypeNullable) when either
BlobLanceSchemaSupport.isBlobField(field) OR containsBlob(field.dataType) is
true so non-BLOB containers that contain BLOBs are widened transitively. Ensure
containsBlob recurses into StructType.fields, ArrayType.elementType, and
MapType.key/value types.
— CodeRabbit (original) (source:comment#3139418945)
There was a problem hiding this comment.
@rahil-c do we only support top-level BLOB fields as I remember?
| val lanceFile = Files.walk(Paths.get(tablePath)) | ||
| .filter(p => p.toString.endsWith(".lance")) | ||
| .findFirst() | ||
| .orElseThrow(() => new AssertionError(s"No .lance files found under $tablePath")) |
There was a problem hiding this comment.
Close the Files.walk stream.
This leaves the directory stream open for the rest of the helper, which can make temp-dir cleanup flaky.
♻️ Suggested fix
- val lanceFile = Files.walk(Paths.get(tablePath))
- .filter(p => p.toString.endsWith(".lance"))
- .findFirst()
- .orElseThrow(() => new AssertionError(s"No .lance files found under $tablePath"))
+ val walk = Files.walk(Paths.get(tablePath))
+ val lanceFile = try {
+ walk
+ .filter(p => p.toString.endsWith(".lance"))
+ .findFirst()
+ .orElseThrow(() => new AssertionError(s"No .lance files found under $tablePath"))
+ } finally {
+ walk.close()
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala`
around lines 969 - 972, The Files.walk(...) stream isn't closed; capture the
stream into a local val (e.g., val stream = Files.walk(Paths.get(tablePath)))
and use a try/finally (or try-with-resources equivalent) to perform the
filter/findFirst on that stream and then call stream.close() in the finally
block so the directory stream is always closed after computing lanceFile; update
the code that currently uses the inline Files.walk(...) call (the lanceFile
binding) to use this scoped stream variable.
— CodeRabbit (original) (source:comment#3139418962)
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the iteration on this! The round-3 changes (loop on empty batches in LanceRecordIterator, schema annotation helper, nullability widening on read) look reasonable, and the prior rounds have already covered the main concerns. No new critical correctness issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A few minor naming and simplification suggestions below.
cc @yihua
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
CodeRabbit Walkthrough: This pull request introduces comprehensive support for Apache Hudi across multiple layers: adds Hadoop/Hive/Spark development environments via Docker Compose, implements Lance file format read/write capabilities with vector metadata, introduces Blob and Variant logical types with schema conversion, enhances multi-writer rollback semantics with heartbeat ownership, extends timeline archival with clean-based retention boundaries, migrates map utilities to CollectionUtils, and adds Flink continuous-sort buffering with dictionary-encoded partition paths.
Sequence Diagram (CodeRabbit):
sequenceDiagram
participant Writer1 as Writer 1
participant LM as Lock Manager
participant TM as Timeline
participant Writer2 as Writer 2
participant HB as Heartbeat
Writer1->>LM: Acquire scheduling lock (skipLocking=false)
LM-->>Writer1: Lock acquired
Writer1->>TM: Read timeline, check for pending rollback
alt No pending rollback exists
Writer1->>TM: Schedule rollback plan
TM-->>Writer1: Plan scheduled
else Pending rollback exists
Writer1->>Writer1: Return false (avoid duplicate)
end
Writer1->>LM: Release scheduling lock
Writer2->>LM: Acquire execution lock (multi-writer heartbeat)
LM-->>Writer2: Lock acquired
Writer2->>HB: Acquire rollback heartbeat ownership
alt Heartbeat doesn't exist or inactive
HB-->>Writer2: Ownership granted
Writer2->>TM: Execute rollback (call table.rollback)
TM-->>Writer2: Rollback completed
Writer2->>HB: Stop heartbeat (finally block)
else Heartbeat active from other writer
HB-->>Writer2: Ownership denied
Writer2->>Writer2: Return false (skip execution)
end
Writer2->>LM: Release execution lock
CodeRabbit: hudi-agent#19 (review)
| super(file, DEFAULT_BATCH_SIZE, bloomFilterOpt.map(HoodieBloomFilterRowWriteSupport::new)); | ||
| this.sparkSchema = enrichSparkSchemaForLanceVectors(sparkSchema); | ||
| this.arrowSchema = LanceArrowUtils.toArrowSchema(this.sparkSchema, DEFAULT_TIMEZONE, true); | ||
| Schema baseArrow = LanceArrowUtils.toArrowSchema(this.sparkSchema, DEFAULT_TIMEZONE, true); |
There was a problem hiding this comment.
nit: fold this line into BlobLanceSchemaSupport.annotateBlobFieldsForLance?
There was a problem hiding this comment.
Should Lance Java API supports LanceArrowUtils.toArrowSchema to use LargeBinary for lance-encoding:blob=true?
There was a problem hiding this comment.
After looking at the LanceArrowUtils.toArrowSchema, it has this logic. So enriching spark schema with ENCODING_BLOB should enable LargeBinary conversion automatically?
if (metadata != null) {
if (metadata.contains(ENCODING_BLOB)
&& metadata.getString(ENCODING_BLOB).equalsIgnoreCase("true")) {
large = true
}
if (metadata.contains(ARROW_LARGE_VAR_CHAR_KEY)
&& metadata.getString(ARROW_LARGE_VAR_CHAR_KEY).equalsIgnoreCase("true")) {
large = true
}
implicit val formats: Formats = DefaultFormats
meta = metadata.jsonValue.extract[Map[String, Object]].map { case (k, v) =>
(k, String.valueOf(v))
}
}
| Schema baseArrow = LanceArrowUtils.toArrowSchema(this.sparkSchema, DEFAULT_TIMEZONE, true); | ||
| // annotate Hudi BLOB fields so the nested `data` bytes column uses Lance's blob writer (the | ||
| // metadata key `lance-encoding:blob=true` on a LargeBinary column). | ||
| this.arrowSchema = BlobLanceSchemaSupport.annotateBlobFieldsForLance(sparkSchema, baseArrow); |
There was a problem hiding this comment.
sparkSchema or this.sparkSchema as they are different?
There was a problem hiding this comment.
let me make variable names more clear
| Schema baseArrow = LanceArrowUtils.toArrowSchema(this.sparkSchema, DEFAULT_TIMEZONE, true); | ||
| // annotate Hudi BLOB fields so the nested `data` bytes column uses Lance's blob writer (the | ||
| // metadata key `lance-encoding:blob=true` on a LargeBinary column). | ||
| this.arrowSchema = BlobLanceSchemaSupport.annotateBlobFieldsForLance(sparkSchema, baseArrow); |
There was a problem hiding this comment.
Can annotateBlobFieldsForLance happen along with enrichSparkSchemaForLanceVectors? enrichSparkSchemaForLanceVectors already does traversal on fields once for vectors. It can be used for blob fields as well to avoid another loop.
| /** | ||
| * Recursively rebuild an Arrow field tree so every field is marked nullable. | ||
| * Lance validates child non-nullability even when the parent struct value is | ||
| * null; for BLOB structs, INLINE rows have a null `reference` and OUT_OF_LINE | ||
| * rows have a null `data`, so all BLOB descendants must tolerate nulls. | ||
| */ | ||
| private def forceNullableRecursively(arrowField: Field): Field = { | ||
| val oldType = arrowField.getFieldType | ||
| val newType = new FieldType(true, oldType.getType, oldType.getDictionary, oldType.getMetadata) | ||
| val children = arrowField.getChildren.asScala.toSeq.map(forceNullableRecursively) | ||
| new Field(arrowField.getName, newType, children.toSeq.asJava) | ||
| } |
There was a problem hiding this comment.
It looks like this is a Lance validation bug of an overly strict validation. Lance checks child-field nullability even when the parent struct value is null. If the parent is null, the children are irrelevant. A null reference struct should be valid regardless of whether external_path is declared nullable.
The forceNullableRecursively workaround is papering over this Lance behavior by lying about the schema, declaring external_path and managed as nullable when they conceptually aren't. This works but loosens the schema contract: Lance would now accept a non-null reference struct with a null external_path, which is invalid in Hudi's BLOB model.
| private def forceNullableRecursively(arrowField: Field): Field = { | ||
| val oldType = arrowField.getFieldType | ||
| val newType = new FieldType(true, oldType.getType, oldType.getDictionary, oldType.getMetadata) | ||
| val children = arrowField.getChildren.asScala.toSeq.map(forceNullableRecursively) | ||
| new Field(arrowField.getName, newType, children.toSeq.asJava) | ||
| } |
There was a problem hiding this comment.
Instead of tweaking Arrow schema, could we change Spark Schema to make relevant fields nullable which is much easier? Then we can get rid of this class.
| * on if the Hudi schema declares those leaves non-nullable. Non-blob fields keep their | ||
| * original nullability so their contracts aren't silently loosened. | ||
| */ | ||
| private def widenBlobSubtreeNullability(schema: StructType): StructType = { |
There was a problem hiding this comment.
Reuse this on the writer side as well so only Spark StructType schema is modified for consistency?
| valueType = forceTypeNullable(m.valueType), | ||
| valueContainsNull = true) | ||
| case other => other | ||
| } |
There was a problem hiding this comment.
@rahil-c do we only support top-level BLOB fields as I remember?
| */ | ||
| @ParameterizedTest | ||
| @EnumSource(value = classOf[HoodieTableType]) | ||
| def testBlobInlineRoundTrip(tableType: HoodieTableType): Unit = { |
There was a problem hiding this comment.
nit: I think we should add a single BLOB type test class to have the consistent testing logic on both parquet and lance so it's easier to maintain. We can do the refactoring later; @rahil-c to add a tracking issue.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18575 +/- ##
============================================
+ Coverage 68.87% 68.89% +0.01%
- Complexity 28512 28561 +49
============================================
Files 2478 2480 +2
Lines 136801 136995 +194
Branches 16659 16697 +38
============================================
+ Hits 94225 94382 +157
- Misses 34990 35009 +19
- Partials 7586 7604 +18
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| newFields[i] = enrichVectorField(field, vec); | ||
| } else if (isBlobField(field)) { | ||
| newFields[i] = enrichBlobField(field); |
There was a problem hiding this comment.
nit: we could possibly improve the logic by extracting the type from the schema metadata and then check, instead of parsing field multiple times.
| * Arrow-side in {@link #annotateBlobDataChildren}, since lance-spark's | ||
| * {@code toArrowSchema} doesn't propagate Spark field metadata to struct children.</li> |
There was a problem hiding this comment.
Looks like the issue should be fixed upstream in LanceArrowUtils#toArrowField where the struct field's meta is not passed down:
case StructType(fields) =>
val fieldType = new FieldType(nullable, ArrowType.Struct.INSTANCE, null, meta.asJava)
new Field(
name,
fieldType,
fields.map { field =>
toArrowField(field.name, field.dataType, field.nullable, timeZoneId)
}.toSeq.asJava)

Describe the issue this Pull Request addresses
Core problem: Hudi's BLOB model and Lance's blob encoding don't line up out of the box, and without a bridge Hudi BLOB tables can't use
hoodie.base.file.format=lance.lance-encoding:blob=trueon aLargeBinarycolumn, routing bytes into a dedicated blob stream) or it isn't.Before this PR, nothing connected the two:
HoodieSparkLanceWriterhanded Lance a plainBinaryfor the BLOBdatachild, so INLINE bytes went through Lance's default column path. The dedicated blob-stream optimization that's the whole point of picking Lance was never engaged.data(OUT_OF_LINE) or nullreference(INLINE), and their nested children follow suit. Lance's child-nullability check rejected them.{position, size}instead of materializing bytes) — but there was no config surface to express that intent.Summary and Changelog
Design (simplified)
Per-row example:
typedatareferenceINLINE0xDEADBEEF…nulldatareferences those bytesOUT_OF_LINEnull{s3://a.jpg, 0, 2048, false}datanull;referencepopulated as a regular structWrite path
Read path
Key invariant: the iterator has no BLOB-aware branches. Rows come off Lance already in the Hudi BLOB shape, so the existing
read_blob()SQL function resolves them exactly the same way it does for Parquet-backed BLOB tables.Changelog
BlobLanceSchemaSupport.scala(new, inorg.apache.hudi.io.storage) — write-path Arrow schema rewriter. For every Hudi BLOB column in the Spark schema, rebuilds the nesteddatachild asLargeBinary + {lance-encoding:blob=true}and recursively widens nullability within the BLOB subtree. Structural no-op for schemas without BLOB columns.HoodieSparkLanceWriter.java— routes the base Arrow schema throughBlobLanceSchemaSupport.annotateBlobFieldsForLance(...)at writer-open time.HoodieSparkLanceReader.java(internal compaction/merge reader) — opens Lance inBlobReadMode.CONTENTbecause compaction/merge/log-replay paths need actual bytes to rewrite. Pinned to CONTENT regardless of user config; the datasource reader is where the user knob is honored.HoodieReaderConfig.java— addshoodie.read.blob.inline.mode(defaultCONTENT, valid values{CONTENT}today). This is the forward seam for a futureDESCRIPTORmode where INLINE bytes are surfaced as{position, size}pointers for deferred reads; landing the config now makes that change a one-line enum expansion instead of a schema-plumbing refactor.SparkLanceReaderBase.scala(datasource read path) — reads the new config to pick the Lance read mode, widens nullability inside BLOB subtrees (see read-path diagram above), and hands the iterator a schema Lance's batch layout can populate without tripping the codegen projection.LanceRecordIterator.java— single unrelated fix:hasNext()now loops through empty Arrow batches instead of terminating on the first zero-row batch. The originalif (loadNextBatch())would silently drop subsequent non-empty batches after any empty one (e.g. after filter pushdown).TestLanceDataSource.scala— two new parameterized tests (COW + MOR):testBlobInlineRoundTrip— writes INLINE rows, verifies the Lance file carrieslance-encoding:blob=trueon thedatachild, and byte-asserts the payloads round-trip through both the rawdatacolumn andread_blob(payload).testBlobOutline— writes OUT_OF_LINE rows pointing at external.binfiles with varying offsets/lengths; byte-comparesread_blob(payload)against the expected slice of each external file.Impact
hoodie.base.file.format=lance. Existing Parquet-backed BLOB tables and non-BLOB Lance tables are unchanged.hoodie.read.blob.inline.mode— advanced, defaultCONTENT, single valid value today. Matches pre-PR behavior out of the box; no user action needed.lance-encoding:blob=trueon the BLOBdatachild. No impact on Lance files without BLOB columns.Risk Level
Medium. Integration between two relatively young subsystems (Hudi BLOB + Lance file format). Mitigated by:
dataandread_blob(payload); OUT_OF_LINE byte round-trip throughread_blob(payload)against external.binfiles with varying offsets/lengths.BlobLanceSchemaSupport.isBlobField(metadata marker on the Spark field). Schemas without BLOB columns see the pre-PR code path exactly.read_blob()SQL resolution — only the Lance↔Hudi translation seam.Documentation Update
None beyond the in-config documentation on
hoodie.read.blob.inline.mode. No public API changes. Hudi BLOB is already documented; this PR extends coverage tohoodie.base.file.format=lance.Contributor's checklist