-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Spec: Add spec for expressions #16652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,284 @@ | ||||||||||
| --- | ||||||||||
| title: "Expressions Spec" | ||||||||||
| --- | ||||||||||
| <!-- | ||||||||||
| - Licensed to the Apache Software Foundation (ASF) under one or more | ||||||||||
| - contributor license agreements. See the NOTICE file distributed with | ||||||||||
| - this work for additional information regarding copyright ownership. | ||||||||||
| - The ASF licenses this file to You under the Apache License, Version 2.0 | ||||||||||
| - (the "License"); you may not use this file except in compliance with | ||||||||||
| - the License. You may obtain a copy of the License at | ||||||||||
| - | ||||||||||
| - http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||
| - | ||||||||||
| - Unless required by applicable law or agreed to in writing, software | ||||||||||
| - distributed under the License is distributed on an "AS IS" BASIS, | ||||||||||
| - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||||||
| - See the License for the specific language governing permissions and | ||||||||||
| - limitations under the License. | ||||||||||
| --> | ||||||||||
|
|
||||||||||
| # Iceberg Expressions | ||||||||||
|
|
||||||||||
| This document defines the structure and behavior of expressions for use in Iceberg specifications. The purpose is to define a common structure that enables simple expressions to be stored and exchanged. | ||||||||||
|
|
||||||||||
| Stored expressions are needed for use cases like data validations (`CHECK` constraints) and default values (for instance, `current_timestamp()`). Expressions are exchanged in use cases like server-side scan planning in the catalog protocol. | ||||||||||
|
|
||||||||||
|
|
||||||||||
| ## Overview | ||||||||||
|
|
||||||||||
| The goal of this specification is to define a simple expression structure and avoid complexity. | ||||||||||
|
|
||||||||||
| To remain simple, the expressions that can be represented are deliberately constrained. Value expressions are constants, field references, or function calls with value expression arguments. Predicates are comparisons of value expressions that produce true or false. | ||||||||||
|
|
||||||||||
| This approach is intended to keep focus on the logical structure of expressions. Complexity is pushed to the functions that are called, which can be a limited set of well-defined and portable functions (like Iceberg partition transforms) or could be user-defined functions that can use the full range of SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs that are specific to an engine, rather than importing and duplicating dialects in Iceberg expressions. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Suggestion: make the language a little less uncertain If we have a UDF reference this might be a good place to include it. |
||||||||||
|
|
||||||||||
| This is consistent with Iceberg's conservative approach in other specs. Expressions and predicates are an important part of Iceberg implementation APIs, but have been deliberately limited in specifications. For example, sort orders and partition fields are strictly limited to a small set of transforms over well-defined inputs (source field IDs). This spec is widening what can be expressed, but depends on function calls for complex tasks. | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. More of a "why this is written in the way it is" than a "what this spec is about". Just wondering if we need this paragraph in the text. |
||||||||||
|
|
||||||||||
| This specification covers the structure of Iceberg expressions and includes appendicies that specify serialization as JSON and a set of portable functions defined by Iceberg specifications. | ||||||||||
|
|
||||||||||
|
|
||||||||||
| ## Structure | ||||||||||
|
|
||||||||||
| Iceberg expressions have two types: | ||||||||||
|
|
||||||||||
| * **Value expressions** represent data values and transformations of values (function calls) that produce any Iceberg type | ||||||||||
| * **Predicates** represent comparisons of value expressions and boolean logic that produce `true` or `false` | ||||||||||
|
|
||||||||||
|
|
||||||||||
| ### Value expressions | ||||||||||
|
|
||||||||||
| A value expression is an expression that produces a typed value | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
|
||||||||||
| Value expressions can be one of three types: a constant value, a field reference, or a function applied to zero or more value expressions. | ||||||||||
|
|
||||||||||
|
|
||||||||||
| #### Constant values | ||||||||||
|
|
||||||||||
| A constant or literal is the simplest type of value expression that represents a specific typed value. | ||||||||||
|
|
||||||||||
|
|
||||||||||
| #### Field reference | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In looking around, other systems have two other reference types that I want to callout (though I don't necessarily think we want to include, but should consider):
|
||||||||||
|
|
||||||||||
| A field reference represents the value of a specific field in a row. When an expression is evaluated on a row, it returns the value of the field. | ||||||||||
|
|
||||||||||
| Field references may be named references (unbound) or ID references (bound). ID references identify a field by field ID from a schema. Named references identify a field by name that must be resolved to an ID (bound to a schema) to access the field. | ||||||||||
|
|
||||||||||
| ID references are used for stored expressions, where the identity of the column is determined when the stored expression is created. For example, column constraints are tied to field ID so that renaming a column does not drop its stored constraint. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Not sure this is the right wording, but 'drop' feels confusing in this context.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it should probably be "invalidate the reference in"
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or "invalidate the reference to its stored constraint"
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
|
||||||||||
| Named references are used when the identity of the column is determined when the expression is evaluated. For example, query filters are resolved each time a query runs so servers-side planning uses unbound named references. | ||||||||||
|
|
||||||||||
| The context in which an expression is used determines the type of references that are valid. Iceberg specifications should document whether ID references, named references, or both are allowed. | ||||||||||
|
|
||||||||||
|
|
||||||||||
| #### Apply function | ||||||||||
|
|
||||||||||
| An apply expression represents the result of a function applied to (or called on) zero or more values produced by child value expressions. | ||||||||||
|
|
||||||||||
| Functions are identified by catalog, namespace, and name. | ||||||||||
|
|
||||||||||
| * Function name is always required | ||||||||||
| * Namespace is optional and is assumed to be empty ([]) if it is not present or is null | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does the parenthetical for empty mean here? Are we defining [] as an empty set? |
||||||||||
| * Catalog is optional and is assumed to be the catalog in which the referencing object is stored if it is not present or is null | ||||||||||
|
|
||||||||||
| The catalog name is used to identify the catalog where the function definition can be loaded or it identifies a reserved function set. As in the view and UDF specs, catalog names represent connection configurations that may differ across environments. Omitting catalog names is recommended to avoid depending on consistent environments. For example, if a table has a CHECK constraint that references a UDF without a catalog name (missing or null), the UDF should be loaded from the table’s catalog. | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a little confused about this paragraph. We have the concept of a Catalog and Function Set. It feels like a "catalog" is a superset of the concept of "function set" but this feels like it is hard to disambiguate with a "namespace" which is also a "function set" Maybe if on line 86 it just said "reserved catalog names are" |
||||||||||
|
|
||||||||||
| Reserved function set names are: | ||||||||||
|
|
||||||||||
| * `sql_functions` is used for functions defined by the SQL standard | ||||||||||
| * `iceberg_functions` is used for functions defined in this specification | ||||||||||
|
|
||||||||||
| Engines may document and use a catalog name to identify their built-in functions that are not part of the SQL spec, like `spark_builtin_functions.to_utc_timestamp`. | ||||||||||
|
|
||||||||||
| Producers are responsible for resolving catalog, namespace, and name if the environment is relevant. For example, if a SQL engine uses its current catalog and namespace to find a function, the resolved catalog and namespace must be used to produce an unambiguous function identifier. | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I understand this statement either. Is this just saying that a engine is allowed to resolve an identifier however it likes as long as it would be doing so unambigously? |
||||||||||
|
|
||||||||||
|
|
||||||||||
| #### Value expression types | ||||||||||
|
|
||||||||||
| The type produced by a value expression may change. For example, an ID reference may produce a widened type after the underlying column's type is promoted. | ||||||||||
|
|
||||||||||
| Function calls may produce different types when function definitions change, and type changes may change the definition that is resolved for a function name. For example, `identity(int) -> int` will change to `identity(long) -> long` when an input field is promoted from `int` to `long`. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a little confused on the wording here. Are we saying that functions will automatically widen (I don't think that's the intent, but it feels like that's what this is saying). Isn't it more that we're parameter matching when binding the function based on input types and the corresponding return type can change?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. next line seems to indicate that I got the intent right, but the wording feels a little confusing in terms of how we're describing it. |
||||||||||
|
|
||||||||||
| A value expression's type is determined when it is bound to a specific input schema. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note to myself: I also want to note that there are some cases where you want to track an expected return type. For example, expressions that produce stats need an expected type to produce a |
||||||||||
|
|
||||||||||
| If types are incompatible at runtime, implementations binding or evaluating expressions may apply type promotion to align types for predicates and to resolve functions. Implementations may choose when to promote values to accomodate engines that differ in casting behavior. However, implementations must fail rather than insert "unsafe" casts. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this always implicitly handled by the engine? Do we consider
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand the intent here but adding "unsafe" feels ambiguous here. |
||||||||||
|
|
||||||||||
|
|
||||||||||
| ### Predicates | ||||||||||
|
|
||||||||||
| A predicate is a boolean expression that produces true or false. | ||||||||||
|
|
||||||||||
| Predicates can be constants (true or false), comparisons or tests of value expressions, or logical combinations of predicates (AND, OR, NOT). | ||||||||||
|
|
||||||||||
| If value expression types in a predicate are incompatible, implementations should align types using type promotion. For instance, `int_col > 5.0` should promote int values to float. If the types cannot be aligned according to type promotion rules, the predicate must evaluate to false. For instance, `"goats" > -Infinity` should always be `false`. | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like we are again dancing around what it is a "safe" type promotion and what is allowed. The above examples make sense to me but for me the dangerous ones are always things like
|
||||||||||
|
|
||||||||||
| Value expressions are not valid predicates, even when the expression is expected to return a boolean value. Value expressions must be compared or tested to produce a predicate. For example, `is_empty("")` is not a valid predicate, but `is_empty("") = true` is a valid predicate. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like we want to say the distinction is that predicates are two-value boolean logic and a value expression that returns boolean is three-value boolean logic, which is why you can't use value expressions for predicates. Right?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mainly, yes. We also don't know that the function will always produce a boolean since types are determined after functions are resolved, so we want to have the type alignment rules to fall back on.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This also threw me a bit, so I would appreciate a note in the spec. |
||||||||||
|
|
||||||||||
|
|
||||||||||
| #### Comparisons | ||||||||||
|
|
||||||||||
| Comparisons are predicates that compare two value expressions with the same primitive type. Comparisons are: | ||||||||||
|
|
||||||||||
| | Comparison | Description | | ||||||||||
| |-------------|-------------| | ||||||||||
| | `=` | Is equal | | ||||||||||
| | `!=` | Is not equal | | ||||||||||
| | `<` | Less than | | ||||||||||
| | `<=` | Less than or equal | | ||||||||||
| | `>` | Greater than | | ||||||||||
| | `>=` | Greater than or equal | | ||||||||||
|
|
||||||||||
| Primitive types are compared using natural order, except for the following types: | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have a reference for "natural order". This is somewhat confusing outside the context of Java since some definitions of natural comparison would be:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I was wondering how to put this. Looks like Parquet uses "signed comparison" (thrift), I'll update to use that instead. |
||||||||||
|
|
||||||||||
| * `false` is less than `true` for `boolean` | ||||||||||
| * `fixed` and `binary` use unsigned byte-wise comparison | ||||||||||
| * `string` uses unsigned byte-wise comparison of the UTF-8 representation | ||||||||||
| * `uuid` uses unsigned byte-wise comparison of the UUID bytes | ||||||||||
| * `float` and `double` use IEEE 754 total order after normalizing NaN to the canonical NaN (sign bit 0, exponent bits all 1, matissa msb 1 followed by all 0) | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just wondering why we don't want to follow the IEE standard here and do -Nan < Nan? |
||||||||||
| * `NaN = NaN` is true for any two NaN values | ||||||||||
| * `val < NaN` is true for all non-NaN values | ||||||||||
|
|
||||||||||
| Note type alignment produces `decimal` values with the same scale so that comparison is equivalent to the natural order of the unscaled numeric value. | ||||||||||
|
|
||||||||||
| Tests are predicates that test a single value expression, optionally using a constant or set of constants. Constants must have the same type and must be non-null. Tests are: | ||||||||||
|
|
||||||||||
| | Test | Allowed types | Constant type | Description | | ||||||||||
| |-------------------------|---------------|---------------|-------------| | ||||||||||
| | `IS NULL` | any | | true iff the value is null | | ||||||||||
| | `IS NOT NULL` | any | | true iff the value is not null | | ||||||||||
|
Comment on lines
+147
to
+148
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. minor: may be we should link supported iceberg data tyes in the spec |
||||||||||
| | `IS NaN` | float, double | | true iff the value is an IEEE 754 NaN | | ||||||||||
| | `IS NOT NaN` | float, double | | true iff the value is not an IEEE 754 NaN | | ||||||||||
| | `STARTS WITH const` | string | string | true iff the constant is a prefix of the value | | ||||||||||
| | `NOT STARTS WITH const` | string | string | true iff the constant is not a prefix of the value | | ||||||||||
| | `IN (constant set)` | any | same as value | true iff the value is equal to any constant | | ||||||||||
| | `NOT IN (constant set)` | any | same as value | true iff the value is not equal to all constants | | ||||||||||
|
|
||||||||||
|
|
||||||||||
| #### Boolean logic | ||||||||||
|
|
||||||||||
| Predicates must use 2-valued boolean logic. Evaluation of all predicates must produce `true` or `false`. | ||||||||||
|
|
||||||||||
| Engines that implement SQL 3-valued boolean logic must add `IS NULL` and `NOT NULL` to produce the 2-valued equivalent. This avoids bugs in engines and languages that do not natively implement 3-valued logic. For example, the SQL predicate `x < 10` should be passed as `x < 10 AND x IS NOT NULL` for a SQL `WHERE` condition (or `x < 10`; see null-safe comparisons below). For a `CHECK` constraint, the expression is passed as `x < 10 OR x IS NULL`. This ensures that implementations will make the correct determination, rather than depending depending on context to interpret a null result (`WHERE` vs `CHECK`). | ||||||||||
|
|
||||||||||
| Logical combinations are boolean operators applied to predicates. `AND` and `OR` are binary operations and `NOT` is a unary operation. | ||||||||||
|
|
||||||||||
| Comparisons must be null-safe. For example: | ||||||||||
|
|
||||||||||
| * `null = null` is `true` | ||||||||||
| * `34 = null` is `false` | ||||||||||
| * `null != null` is `false` | ||||||||||
| * `34 != null` is `true` | ||||||||||
| * `null < null` is `false` | ||||||||||
| * `null <= null` is `true` | ||||||||||
| * `34 < null` is `false` | ||||||||||
|
|
||||||||||
| Comparisons must handle null values when value expressions evaluate to null. However, value expressions used to define a predicate should not directly contain null constants and may reject them. For example, `x = get_item(map, "key")` is valid although `get_item` may return a null value, but `x = null` should be rejected because `x IS NULL` is the recommended unambiguous predicate. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Context here is a little confusing since we say at line 115 we say: Value expressions are not valid predicates.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is referring to the value expressions on the left or right side of a predicate. I'll update it. |
||||||||||
|
|
||||||||||
|
|
||||||||||
| ### Compatibility with REST catalog expressions | ||||||||||
|
|
||||||||||
| Older clients use more restrictive forms of predicates and references that used a "term" for specific transforms and named references. These expressions should be supported for backward compatibility to allow older clients to interact with newer REST catalog services. | ||||||||||
|
|
||||||||||
| Prior to this spec, deprecated expressions were passed in the REST API in 3 places: | ||||||||||
|
|
||||||||||
| * As `filter` passed to server-side scan planning | ||||||||||
| * As `filter` passed to the service in `ScanReport` | ||||||||||
| * As `residual` passed to the client with a scan task | ||||||||||
|
|
||||||||||
| Both server-side scan planning and the report endpoint can continue to accept filters from older clients without issues by parsing term-based expressions (see [Appendix B: JSON serialization](#appendix-b-json-serialization)). | ||||||||||
|
|
||||||||||
| Residuals passed from services back to clients that do not use the new syntax would cause clients to fail, but services are allowed to omit the residual so that it is calculated on the client side (intended to avoid duplicating large IN filters). For compatibility, REST services should detect client versions and produce deprecated predicates, or omit residuals from tasks. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
unfortunately there is no reliable way to do this, but the scenario here is might be a bit narrow for example we know client is older when they send older expressions here (which referenced things by name ?) and it can produce the output. for example scan planning we expect the filter ... and the metric report is one ways where client is just trying to persist the report to the server ? Do we wanna a bit specific ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another case where we may want to version the endpoint to explicitly break which form of expression we're using. |
||||||||||
|
|
||||||||||
|
|
||||||||||
| ## Appendix A: Iceberg functions | ||||||||||
|
|
||||||||||
| This section defines the functions in the `iceberg_functions` reserved catalog name. | ||||||||||
|
|
||||||||||
| * `if_else(condition: predicate, when_true: T, when_false: T) -> T`: returns the value of `when_true` when `condition` is true and `when_false` otherwise | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are we calling
+1 on keeping the data types same !
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think this how you would model |
||||||||||
|
|
||||||||||
| ### Partition transforms | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [doubt] thoughts of extracting this to a seperate functions page we can add actions there too... i know we did discuss functions.md too wondering if we wanna do this here or we can always do separately
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it makes sense to have them here. This spec isn't too long so I'd just keep them co-located so they are easy to find. |
||||||||||
|
|
||||||||||
| Iceberg partition transforms are also defined as functions (other than `void`). | ||||||||||
|
|
||||||||||
| All partition transforms produce `null` for a `null` input value. | ||||||||||
|
|
||||||||||
| | Function name | Description | Source types | Result type | | ||||||||||
| |-------------------|--------------------------------------------------------------|----------------------------------------------------------------------|-------------| | ||||||||||
| | `identity(value)` | Source value, unmodified | Any primitive except for `geometry`, `geography`, and `variant` | Source type | | ||||||||||
| | `year(value)` | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | ||||||||||
| | `month(value)` | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | ||||||||||
| | `day(value)` | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `date` | | ||||||||||
| | `hour(value)` | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | ||||||||||
|
|
||||||||||
| Note that `year`, `month`, and `hour` transforms produce ordinal values and not human-readable values. For example, `year(2018-05-13)` produces `48`, not `2018`. | ||||||||||
|
|
||||||||||
| Parameterized functions are called as 2-argument functions. The first argument is an `int` parameter (`N` or `W` from the table spec) and the second argument is the value to transform. For example, `bucket(256, id)` calls `bucket[256]`. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The way we reference 'Parameterized functions' is a little awkward since a parameterized function is a function with one or more parameters. So it feels strange to read as these are 2-arg functions? Wording just makes it feel wrong. |
||||||||||
|
|
||||||||||
| | Parameterized function name | Description | Source types | Result type | | ||||||||||
| |-----------------------------|-----------------------------------------------|----------------------------------------------------------------------------------------------|-------------| | ||||||||||
| | `bucket(N, value)` | Hash of value, mod `N` (see table spec) | Any primitive except for `geometry`, `geography`, `variant`, `boolean`, `float`, or `double` | `int` | | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do cross doc references work with how we publish? Something like |
||||||||||
| | `truncate(W, value)` | Value truncated to width `W` (see table spec) | `int`, `long`, `decimal`, `string`, `binary` | Source type | | ||||||||||
|
|
||||||||||
|
|
||||||||||
| ## Appendix B: JSON serialization | ||||||||||
|
|
||||||||||
| Iceberg expressions are serialized as JSON objects in table, view, and UDF metadata, and in the REST protocol for catalogs. | ||||||||||
|
|
||||||||||
| ### Value expressions | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be nice to have some actual json examples. Any specifics on how we handle types in json? For example, Json has no concept of int vs long value. The specific type is erased and Json has some weird behavior around the 53-bit boundary for numeric values. I believe they even recommend storing as a string for large values to avoid these issues. |
||||||||||
|
|
||||||||||
| ``` | ||||||||||
| EXPR: LITERAL | REFERENCE | APPLY | ||||||||||
|
|
||||||||||
| LITERAL: VALUE | ||||||||||
| | { "type": "literal", "value": VALUE } | ||||||||||
| | { "type": "literal", "value": VALUE, "data-type": DATA_TYPE } | ||||||||||
| LITERALS: [ LITERAL* ] | ||||||||||
|
|
||||||||||
| REFERENCE: BOUND_REF | UNBOUND_REF | ||||||||||
| BOUND_REF: ID | { "type": "reference", "id": ID } | ||||||||||
| UNBOUND_REF: NAME | { "type": "reference", "name": NAME } | ||||||||||
|
|
||||||||||
| APPLY: { "type": "apply", "func-name": FUNC_ID, "arguments": [ EXPR* ] } | ||||||||||
| FUNC_ID: NAME | ||||||||||
| | { "catalog": NAME, "namespace": [ NAME* ], "name": NAME } | ||||||||||
|
|
||||||||||
| ID: integer | ||||||||||
| NAME: string | ||||||||||
|
|
||||||||||
| VALUE: non-null single value JSON from the table spec | ||||||||||
| DATA_TYPE: Iceberg type from the spec | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| If a function identifier is a string, that string is the function name, the namespace is empty ([]), and the catalog is missing/null. | ||||||||||
|
|
||||||||||
| ### Predicates | ||||||||||
|
|
||||||||||
| ``` | ||||||||||
| PREDICATE: true | false | ||||||||||
| | { "type": "not", "child": PREDICATE } | ||||||||||
| | { "type": BINARY_OP, "left": PREDICATE, "right": PREDICATE } | ||||||||||
| | { "type": UNARY_OP, "child": EXPR } | ||||||||||
| | { "type": CMP_OP, "left": EXPR, "right": EXPR } | ||||||||||
| | { "type": SET_OP, "child": EXPR, "values": LITERALS } | ||||||||||
| | DEPRECATED_PREDICATE | ||||||||||
|
|
||||||||||
| BINARY_OP: "and" | "or" | ||||||||||
| UNARY_OP: "is-null" | "not-null" | "is-nan" | "not-nan" | ||||||||||
| CMP_OP: "lt" | "lt-eq" | "gt" | "gt-eq" | "eq" | "not-eq" | ||||||||||
| | "starts-with" | "not-starts-with" | ||||||||||
| SET_OP: "in" | "not-in" | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| ### Backward compatibility | ||||||||||
|
|
||||||||||
| ``` | ||||||||||
| DEPRECATED_PREDICATE: | ||||||||||
| | { "type": UNARY_OP, "term": TERM } | ||||||||||
| | { "type": CMP_OP, "term": TERM, "value": LITERAL } | ||||||||||
| | { "type": SET_OP, "term": TERM, "values": LITERALS } | ||||||||||
|
|
||||||||||
| DEPRECATED_REF: { "type": "reference", "term": NAME } | ||||||||||
|
|
||||||||||
| TERM: NAME | DEPRECATED_REF | TRANSFORM | ||||||||||
| TRANSFORM: { "type": "transform", "transform": NAME, "term": TERM } | ||||||||||
| ``` | ||||||||||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think bulleting here would help emphasize the two types/categories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(looks like you did this just below, so maybe we can simplify this to just the value expressions and predicates and let the definition below stand-in for the more complete version).