Skip to content

Spec: Add spec for expressions#16652

Open
rdblue wants to merge 1 commit into
apache:mainfrom
rdblue:add-expression-spec
Open

Spec: Add spec for expressions#16652
rdblue wants to merge 1 commit into
apache:mainfrom
rdblue:add-expression-spec

Conversation

@rdblue
Copy link
Copy Markdown
Contributor

@rdblue rdblue commented Jun 2, 2026

This adds a spec for expressions, which was proposed in the Extending Iceberg Expressions design doc.

@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label Jun 2, 2026
@singhpk234 singhpk234 self-requested a review June 2, 2026 16:22
Copy link
Copy Markdown
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome ! Thanks @rdblue !


This section defines the functions in the `iceberg_functions` reserved catalog name.

* `if_else(condition: predicate, when_true: T, when_false: T) -> T`: returns the value of `when_true` when `condition` is true and `when_false` otherwise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we calling if_else as function ?

  • can they be nested ? if_else(condition_1, if_else(condition_2, when_true, when_false, when_true_outer) ? return when_true when condition_1 && condition_2 is true ?

+1 on keeping the data types same !

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this how you would model CASE WHEN statements

Comment on lines +147 to +148
| `IS NULL` | any | | true iff the value is null |
| `IS NOT NULL` | any | | true iff the value is not null |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: may be we should link supported iceberg data tyes in the spec


Both server-side scan planning and the report endpoint can continue to accept filters from older clients without issues by parsing term-based expressions (see [Appendix B: JSON serialization](#appendix-b-json-serialization)).

Residuals passed from services back to clients that do not use the new syntax would cause clients to fail, but services are allowed to omit the residual so that it is calculated on the client side (intended to avoid duplicating large IN filters). For compatibility, REST services should detect client versions and produce deprecated predicates, or omit residuals from tasks.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REST services should detect client versions and produce deprecated predicates

unfortunately there is no reliable way to do this, but the scenario here is might be a bit narrow for example we know client is older when they send older expressions here (which referenced things by name ?) and it can produce the output. for example scan planning we expect the filter ... and the metric report is one ways where client is just trying to persist the report to the server ?

Do we wanna a bit specific ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case where we may want to version the endpoint to explicitly break which form of expression we're using.


* `if_else(condition: predicate, when_true: T, when_false: T) -> T`: returns the value of `when_true` when `condition` is true and `when_false` otherwise

### Partition transforms
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doubt] thoughts of extracting this to a seperate functions page we can add actions there too... i know we did discuss functions.md too wondering if we wanna do this here or we can always do separately

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have them here. This spec isn't too long so I'd just keep them co-located so they are easy to find.


The goal of this specification is to define a simple expression structure and avoid complexity.

To remain simple, the expressions that can be represented are deliberately constrained. Value expressions are constants, field references, or function calls with value expression arguments. Predicates are comparisons of value expressions that produce true or false.
Copy link
Copy Markdown
Contributor

@danielcweeks danielcweeks Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To remain simple, the expressions that can be represented are deliberately constrained. Value expressions are constants, field references, or function calls with value expression arguments. Predicates are comparisons of value expressions that produce true or false.
To remain simple, the expressions that can be represented are deliberately constrained.
- Value expressions: constants, field references, or function calls with value expression arguments.
- Predicates: comparisons of value expressions that produce true or false.

I think bulleting here would help emphasize the two types/categories

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(looks like you did this just below, so maybe we can simplify this to just the value expressions and predicates and let the definition below stand-in for the more complete version).


To remain simple, the expressions that can be represented are deliberately constrained. Value expressions are constants, field references, or function calls with value expression arguments. Predicates are comparisons of value expressions that produce true or false.

This approach is intended to keep focus on the logical structure of expressions. Complexity is pushed to the functions that are called, which can be a limited set of well-defined and portable functions (like Iceberg partition transforms) or could be user-defined functions that can use the full range of SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs that are specific to an engine, rather than importing and duplicating dialects in Iceberg expressions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This approach is intended to keep focus on the logical structure of expressions. Complexity is pushed to the functions that are called, which can be a limited set of well-defined and portable functions (like Iceberg partition transforms) or could be user-defined functions that can use the full range of SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs that are specific to an engine, rather than importing and duplicating dialects in Iceberg expressions.
This approach is intended to keep focus on the logical structure of expressions. Complexity is pushed to the functions that are called, which are be a limited set of well-defined and portable functions (like Iceberg partition transforms) or user-defined functions that can use the full range of SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs that are specific to an engine, rather than importing and duplicating dialects in Iceberg expressions.

Suggestion: make the language a little less uncertain

If we have a UDF reference this might be a good place to include it.


### Value expressions

A value expression is an expression that produces a typed value
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A value expression is an expression that produces a typed value
A value expression is an expression that produces a typed value.


Field references may be named references (unbound) or ID references (bound). ID references identify a field by field ID from a schema. Named references identify a field by name that must be resolved to an ID (bound to a schema) to access the field.

ID references are used for stored expressions, where the identity of the column is determined when the stored expression is created. For example, column constraints are tied to field ID so that renaming a column does not drop its stored constraint.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ID references are used for stored expressions, where the identity of the column is determined when the stored expression is created. For example, column constraints are tied to field ID so that renaming a column does not drop its stored constraint.
ID references are used for stored expressions, where the identity of the column is determined when the stored expression is created. For example, column constraints are tied to field ID so that renaming a column does not incorrectly reference its stored constraint.

Not sure this is the right wording, but 'drop' feels confusing in this context.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should probably be "invalidate the reference in"


The type produced by a value expression may change. For example, an ID reference may produce a widened type after the underlying column's type is promoted.

Function calls may produce different types when function definitions change, and type changes may change the definition that is resolved for a function name. For example, `identity(int) -> int` will change to `identity(long) -> long` when an input field is promoted from `int` to `long`.
Copy link
Copy Markdown
Contributor

@danielcweeks danielcweeks Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused on the wording here. Are we saying that functions will automatically widen (I don't think that's the intent, but it feels like that's what this is saying).

Isn't it more that we're parameter matching when binding the function based on input types and the corresponding return type can change?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next line seems to indicate that I got the intent right, but the wording feels a little confusing in terms of how we're describing it.


Function calls may produce different types when function definitions change, and type changes may change the definition that is resolved for a function name. For example, `identity(int) -> int` will change to `identity(long) -> long` when an input field is promoted from `int` to `long`.

A value expression's type is determined when it is bound to a specific input schema.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A value expression's type is determined when it is bound to a specific input schema.
A value expression's return type is determined when it is bound to a specific input schema.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: I also want to note that there are some cases where you want to track an expected return type. For example, expressions that produce stats need an expected type to produce a content_stats schema.

| `>` | Greater than |
| `>=` | Greater than or equal |

Primitive types are compared using natural order, except for the following types:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reference for "natural order". This is somewhat confusing outside the context of Java since some definitions of natural comparison would be: "i2" > "i10" -> false.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was wondering how to put this. Looks like Parquet uses "signed comparison" (thrift), I'll update to use that instead.

A constant or literal is the simplest type of value expression that represents a specific typed value.


#### Field reference
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In looking around, other systems have two other reference types that I want to callout (though I don't necessarily think we want to include, but should consider):

  1. Positional References (for row-like references)
  2. Subscripted References (indexing into arrays)


A value expression's type is determined when it is bound to a specific input schema.

If types are incompatible at runtime, implementations binding or evaluating expressions may apply type promotion to align types for predicates and to resolve functions. Implementations may choose when to promote values to accomodate engines that differ in casting behavior. However, implementations must fail rather than insert "unsafe" casts.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always implicitly handled by the engine? Do we consider CAST(...) something that would fall under a sql_functions.cast or should we leave it out entirely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sql_functions.cast can be used for explicit casts, but I want to avoid adding a cast definition in this spec for the expression structure.


If value expression types in a predicate are incompatible, implementations should align types using type promotion. For instance, `int_col > 5.0` should promote int values to float. If the types cannot be aligned according to type promotion rules, the predicate must evaluate to false. For instance, `"goats" > -Infinity` should always be `false`.

Value expressions are not valid predicates, even when the expression is expected to return a boolean value. Value expressions must be compared or tested to produce a predicate. For example, `is_empty("")` is not a valid predicate, but `is_empty("") = true` is a valid predicate.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we want to say the distinction is that predicates are two-value boolean logic and a value expression that returns boolean is three-value boolean logic, which is why you can't use value expressions for predicates. Right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly, yes. We also don't know that the function will always produce a boolean since types are determined after functions are resolved, so we want to have the type alignment rules to fall back on.

* `null <= null` is `true`
* `34 < null` is `false`

Comparisons must handle null values when value expressions evaluate to null. However, value expressions used to define a predicate should not directly contain null constants and may reject them. For example, `x = get_item(map, "key")` is valid although `get_item` may return a null value, but `x = null` should be rejected because `x IS NULL` is the recommended unambiguous predicate.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, value expressions used to define a predicate

Context here is a little confusing since we say at line 115 we say: Value expressions are not valid predicates.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referring to the value expressions on the left or right side of a predicate. I'll update it.


Note that `year`, `month`, and `hour` transforms produce ordinal values and not human-readable values. For example, `year(2018-05-13)` produces `48`, not `2018`.

Parameterized functions are called as 2-argument functions. The first argument is an `int` parameter (`N` or `W` from the table spec) and the second argument is the value to transform. For example, `bucket(256, id)` calls `bucket[256]`.
Copy link
Copy Markdown
Contributor

@danielcweeks danielcweeks Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we reference 'Parameterized functions' is a little awkward since a parameterized function is a function with one or more parameters. So it feels strange to read as these are 2-arg functions?

Wording just makes it feel wrong.


| Parameterized function name | Description | Source types | Result type |
|-----------------------------|-----------------------------------------------|----------------------------------------------------------------------------------------------|-------------|
| `bucket(N, value)` | Hash of value, mod `N` (see table spec) | Any primitive except for `geometry`, `geography`, `variant`, `boolean`, `float`, or `double` | `int` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do cross doc references work with how we publish? Something like [Link Text](relative/path/to/file.md#heading-anchor) is supposed to work in markdown. It would be nice to have pointers.


Iceberg expressions are serialized as JSON objects in table, view, and UDF metadata, and in the REST protocol for catalogs.

### Value expressions
Copy link
Copy Markdown
Contributor

@danielcweeks danielcweeks Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have some actual json examples.

Any specifics on how we handle types in json? For example, Json has no concept of int vs long value. The specific type is erased and Json has some weird behavior around the 53-bit boundary for numeric values. I believe they even recommend storing as a string for large values to avoid these issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants