What happened:
Using read_blob() inside a WHERE predicate fails with:
[INTERNAL_ERROR] Cannot generate code for expression: read_blob(...)
Example query:
SELECT id FROM t WHERE length(read_blob(image_bytes)) = 11;
read_blob() works correctly in the SELECT list — only filter predicates trigger the codegen failure.
What you expected:
Two things:
- The codegen restriction should surface as an analyzer-level rejection with a clear "read_blob() is not supported in filter predicates" message, not an INTERNAL_ERROR with a Spark codegen stack trace.
- Docs (AI quick start) should call out the recommended workaround: for length-based filtering, filter on the BLOB struct's
.length subfield from the meta columns (e.g. WHERE image_bytes.length = 11) rather than wrapping read_blob() in length(...). Typical usage is vector search or filtering on structured columns; pulling raw bytes through codegen in a predicate is not a supported path.
Steps to reproduce:
- Use 1.2.0-rc2 Spark bundle.
- Create a table with a BLOB column
image_bytes and insert rows.
- Run:
SELECT id FROM t WHERE length(read_blob(image_bytes)) = 11.
- Observe INTERNAL_ERROR.
Environment:
- Hudi version: 1.2.0-rc2
- Query engine: Spark 3.5
- Found during: 1.2.0-rc2
What happened:
Using
read_blob()inside a WHERE predicate fails with:Example query:
read_blob()works correctly in the SELECT list — only filter predicates trigger the codegen failure.What you expected:
Two things:
.lengthsubfield from the meta columns (e.g.WHERE image_bytes.length = 11) rather than wrappingread_blob()inlength(...). Typical usage is vector search or filtering on structured columns; pulling raw bytes through codegen in a predicate is not a supported path.Steps to reproduce:
image_bytesand insert rows.SELECT id FROM t WHERE length(read_blob(image_bytes)) = 11.Environment: