Skip to content

BLOB column should be rejected as primaryKey / recordKey #18819

@rahil-c

Description

@rahil-c

What happened:
CREATE TABLE ... TBLPROPERTIES (primaryKey = '<blob_col>') followed by INSERT both succeed silently when the keyed column is of BLOB type. The resulting _hoodie_record_key is the JSON-stringified BLOB struct, e.g. {"type":"INLINE","data":"hello-0","reference":null}.

BLOB is raw binary bytes (images, video, embeddings, or EXTERNAL references to such payloads). It is not a valid record-key type semantically:

  • For INLINE BLOBs, the key is the entire byte payload — for real-world blobs (MB-sized images/video/embeddings) the key balloons proportionally, blowing up shuffle bytes and metadata index (record index, secondary index, bloom) storage.
  • For EXTERNAL BLOBs, the key is derived from the storage path, so record identity tracks path rather than content — moving or re-uploading the same blob yields a different key.

What you expected:
Hudi should reject BLOB-typed columns as the record key, the same way other unsupported key types are rejected.

  • Spark DDL: CREATE TABLE ... TBLPROPERTIES (primaryKey = '<blob_col>')
  • Spark DataSource writes: .option("hoodie.datasource.write.recordkey.field", "<blob_col>")

Both should fail fast with a clear error message identifying the BLOB column and the unsupported-type reason.

Steps to reproduce:

  1. Use 1.2.0 Spark bundle.
  2. Either:
    a. DDL path: CREATE TABLE t (id BLOB, label STRING) USING hudi TBLPROPERTIES (primaryKey = 'id')
    b. DataSource path: df.write.format("hudi").option("hoodie.datasource.write.recordkey.field", "id").save(...) with id of BLOB type.
  3. INSERT / write a row with an INLINE BLOB value.
  4. SELECT _hoodie_record_key FROM t → key is the JSON-serialized struct.

Environment:

  • Hudi version: 1.2.0-rc2
  • Query engine: Spark 3.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:bugBug reports and fixes

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions