What happened:
CREATE TABLE ... TBLPROPERTIES (primaryKey = '<blob_col>') followed by INSERT both succeed silently when the keyed column is of BLOB type. The resulting _hoodie_record_key is the JSON-stringified BLOB struct, e.g. {"type":"INLINE","data":"hello-0","reference":null}.
BLOB is raw binary bytes (images, video, embeddings, or EXTERNAL references to such payloads). It is not a valid record-key type semantically:
- For INLINE BLOBs, the key is the entire byte payload — for real-world blobs (MB-sized images/video/embeddings) the key balloons proportionally, blowing up shuffle bytes and metadata index (record index, secondary index, bloom) storage.
- For EXTERNAL BLOBs, the key is derived from the storage path, so record identity tracks path rather than content — moving or re-uploading the same blob yields a different key.
What you expected:
Hudi should reject BLOB-typed columns as the record key, the same way other unsupported key types are rejected.
- Spark DDL:
CREATE TABLE ... TBLPROPERTIES (primaryKey = '<blob_col>')
- Spark DataSource writes:
.option("hoodie.datasource.write.recordkey.field", "<blob_col>")
Both should fail fast with a clear error message identifying the BLOB column and the unsupported-type reason.
Steps to reproduce:
- Use 1.2.0 Spark bundle.
- Either:
a. DDL path: CREATE TABLE t (id BLOB, label STRING) USING hudi TBLPROPERTIES (primaryKey = 'id')
b. DataSource path: df.write.format("hudi").option("hoodie.datasource.write.recordkey.field", "id").save(...) with id of BLOB type.
- INSERT / write a row with an INLINE BLOB value.
SELECT _hoodie_record_key FROM t → key is the JSON-serialized struct.
Environment:
- Hudi version: 1.2.0-rc2
- Query engine: Spark 3.5
What happened:
CREATE TABLE ... TBLPROPERTIES (primaryKey = '<blob_col>')followed by INSERT both succeed silently when the keyed column is of BLOB type. The resulting_hoodie_record_keyis the JSON-stringified BLOB struct, e.g.{"type":"INLINE","data":"hello-0","reference":null}.BLOB is raw binary bytes (images, video, embeddings, or EXTERNAL references to such payloads). It is not a valid record-key type semantically:
What you expected:
Hudi should reject BLOB-typed columns as the record key, the same way other unsupported key types are rejected.
CREATE TABLE ... TBLPROPERTIES (primaryKey = '<blob_col>').option("hoodie.datasource.write.recordkey.field", "<blob_col>")Both should fail fast with a clear error message identifying the BLOB column and the unsupported-type reason.
Steps to reproduce:
a. DDL path:
CREATE TABLE t (id BLOB, label STRING) USING hudi TBLPROPERTIES (primaryKey = 'id')b. DataSource path:
df.write.format("hudi").option("hoodie.datasource.write.recordkey.field", "id").save(...)withidof BLOB type.SELECT _hoodie_record_key FROM t→ key is the JSON-serialized struct.Environment: