Skip to content

Inconsistent Null Handling in JsonExtractScalar and JsonExtractIndex.  #18568

@siddharthteotia

Description

@siddharthteotia

Summary

With SET enableNullHandling = true, queries that use jsonExtractScalar(...) (no default value argument) over JSON columns where some rows have no value at the queried path fail with IllegalArgumentException("Cannot resolve JSON path on some records. Consider setting a default value."). The same shape with jsonExtractIndex(...) likewise throws (RuntimeException("Cannot resolve JSON path...")).

Both JsonExtractScalarTransformFunction and JsonExtractIndexTransformFunction have _nullHandlingEnabled plumbed through init() but the six typed SV methods (transformTo{Int,Long,Float,Double,BigDecimal,String}ValuesSV) ignore it and throw unconditionally.

Meanwhile, JsonIndexDistinctOperator (which SELECT DISTINCT jsonExtractIndex(...) routes through when a JSON index is present) does honor null handling — it folds unresolved rows into a single SQL NULL group via handleMissingDocs. This creates a real divergence: same logical query, different outcomes based on whether the DISTINCT operator route is eligible.

Details

A clicks table with a single JSON column:

row 0: {"country": "US", "clicks": 5}
row 1: {"country": "CA", "clicks": 3}
row 2: {"country": null, "clicks": null}
row 3: {}

JSON index is present on json.

Today on master — enableNullHandling = true

# Query shape Default value jsonExtractScalar(json, '\$.country', 'STRING'[, …]) jsonExtractIndex(json, '\$.country', 'STRING'[, …]) Consistent?
1 Projection
SELECT <expr> FROM clicks
none throws IllegalArgumentException("Cannot resolve JSON path on some records…")
scalar transform fn
throws RuntimeException("Cannot resolve JSON path…")
index transform fn
✓ both throw
2 DISTINCT
SELECT DISTINCT <expr> FROM clicks
none throws (same scalar transform path as row 1) returns "US", "CA", NULL
via JsonIndexDistinctOperator.handleMissingDocsaddNull()
inconsistent
3 GROUP BY + COUNT
SELECT <expr> AS c, COUNT(*) FROM clicks GROUP BY c
none throws (same scalar transform path as row 1) throws (same index transform path as row 1) ✓ both throw
4 Projection 'foobar' returns "US", "CA", "foobar", "foobar" returns "US", "CA", "foobar", "foobar"
5 DISTINCT 'foobar' returns "US", "CA", "foobar" returns "US", "CA", "foobar"
via JsonIndexDistinctOperator.handleMissingDocsaddValueToDistinctTable("foobar", …)
6 GROUP BY 'foobar' returns ("CA",1), ("US",1), ("foobar",2) returns ("CA",1), ("US",1), ("foobar",2)

After fix — enableNullHandling = true

Both JsonExtractScalarTransformFunction and JsonExtractIndexTransformFunction patched to honor _nullHandlingEnabled when no default literal is supplied.

# Query shape Default value jsonExtractScalar(json, '\$.country', 'STRING'[, …]) jsonExtractIndex(json, '\$.country', 'STRING'[, …]) Consistent?
1 Projection none returns "US", "CA", NULL, NULL
scalar transform fn — new null-emit branch
returns "US", "CA", NULL, NULL
index transform fn — new null-emit branch
2 DISTINCT none returns "US", "CA", NULL
scalar SV result (with null bitmap) deduped by generic distinct operator
returns "US", "CA", NULL
via JsonIndexDistinctOperator.handleMissingDocsaddNull() (unchanged)
3 GROUP BY + COUNT none returns ("CA",1), ("US",1), (NULL,2)
scalar SV result (with null bitmap), then standard GROUP BY
returns ("CA",1), ("US",1), (NULL,2)
index transform fn SV result (with null bitmap), then standard GROUP BY
4 Projection 'foobar' returns "US", "CA", "foobar", "foobar"
unchanged — default-value branch
returns "US", "CA", "foobar", "foobar"
unchanged — default-value branch
5 DISTINCT 'foobar' returns "US", "CA", "foobar"
unchanged
returns "US", "CA", "foobar"
unchanged
6 GROUP BY 'foobar' returns ("CA",1), ("US",1), ("foobar",2)
unchanged
returns ("CA",1), ("US",1), ("foobar",2)
unchanged

Change summary (NH = ON): rows 1–3 in both function-name columns flip from throws to a SQL NULL (or null group / null row). Rows 4–6 are unchanged. Every cell is consistent across both function names post-fix.

Today on master — enableNullHandling = false

# Query shape Default value jsonExtractScalar(...) jsonExtractIndex(...) Consistent?
1 Projection none throws IllegalArgumentException throws RuntimeException (index transform fn) ✓ both throw
2 DISTINCT none throws (scalar transform) throws RuntimeException("Illegal Json Path: [...], for some docIds in segment [...]")
via JsonIndexDistinctOperator.handleMissingDocs — NH-off, no-default branch
✓ both throw
3 GROUP BY none throws (scalar transform) throws (index transform) ✓ both throw
4 Projection 'foobar' returns "US", "CA", "foobar", "foobar" returns "US", "CA", "foobar", "foobar"
5 DISTINCT 'foobar' returns "US", "CA", "foobar" returns "US", "CA", "foobar"
6 GROUP BY 'foobar' returns ("CA",1), ("US",1), ("foobar",2) returns ("CA",1), ("US",1), ("foobar",2)

With NH off, all cells are already consistent on master — the fix preserves the legacy throw for the no-default case (gated on _nullHandlingEnabled).

Proposed fix

In both JsonExtractScalarTransformFunction and JsonExtractIndexTransformFunction:

  1. When _nullHandlingEnabled = true and no default value was supplied: emit SQL NULL (write the type's null placeholder + mark the row in getNullBitmap()) for unresolved rows, instead of throwing.
  2. When _nullHandlingEnabled = false: preserve the legacy throw.
  3. MV transforms unchanged.
  4. JsonIndexDistinctOperator already handles this correctly; no change needed.

Out of scope (separate follow-up)

  • 4-arg with SQL NULL literal as default: LiteralContext.getStringValue() returns "" for SQL NULL, causing JsonIndexDistinctOperator to insert an empty string instead of a null group. Separate, pre-existing inconsistency.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions