Skip to content

read_blob() in filter predicate fails with INTERNAL_ERROR instead of clean analyzer rejection #18820

@rahil-c

Description

@rahil-c

What happened:
Using read_blob() inside a WHERE predicate fails with:

[INTERNAL_ERROR] Cannot generate code for expression: read_blob(...)

Example query:

SELECT id FROM t WHERE length(read_blob(image_bytes)) = 11;

read_blob() works correctly in the SELECT list — only filter predicates trigger the codegen failure.

What you expected:
Two things:

  1. The codegen restriction should surface as an analyzer-level rejection with a clear "read_blob() is not supported in filter predicates" message, not an INTERNAL_ERROR with a Spark codegen stack trace.
  2. Docs (AI quick start) should call out the recommended workaround: for length-based filtering, filter on the BLOB struct's .length subfield from the meta columns (e.g. WHERE image_bytes.length = 11) rather than wrapping read_blob() in length(...). Typical usage is vector search or filtering on structured columns; pulling raw bytes through codegen in a predicate is not a supported path.

Steps to reproduce:

  1. Use 1.2.0-rc2 Spark bundle.
  2. Create a table with a BLOB column image_bytes and insert rows.
  3. Run: SELECT id FROM t WHERE length(read_blob(image_bytes)) = 11.
  4. Observe INTERNAL_ERROR.

Environment:

  • Hudi version: 1.2.0-rc2
  • Query engine: Spark 3.5
  • Found during: 1.2.0-rc2

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:bugBug reports and fixes

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions