-
Notifications
You must be signed in to change notification settings - Fork 3k
SPEC: Add SQL UDF spec #14117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPEC: Add SQL UDF spec #14117
Changes from all commits
d118b64
33ad408
6990aea
35cda2a
8a75909
7432d52
e779099
786f82b
96a5880
071e97f
78223c1
c72e5c5
265d20c
9396821
66da5d2
2608a58
f5baeed
46036fb
eb45369
c08994a
82abbb8
ea39f5c
ed6526d
de8b95b
7ed5721
3a7c708
a2ad6f0
d455d92
87241aa
a7bd85b
9bcd703
7297804
5941544
b8c541f
95dda37
4cd2917
e34bb46
bfef3f7
bf1074f
a1d0fd4
84959e1
c09abf0
64ba2f7
3ec5092
678621e
e6e3bcb
9b02a15
eb97175
af0e694
c1c0a96
c2a3c2e
b7efe21
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,352 @@ | ||
| --- | ||
| title: "SQL UDF Spec" | ||
| --- | ||
| <!-- | ||
| - Licensed to the Apache Software Foundation (ASF) under one or more | ||
| - contributor license agreements. See the NOTICE file distributed with | ||
| - this work for additional information regarding copyright ownership. | ||
| - The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| - (the "License"); you may not use this file except in compliance with | ||
| - the License. You may obtain a copy of the License at | ||
| - | ||
| - http://www.apache.org/licenses/LICENSE-2.0 | ||
| - | ||
| - Unless required by applicable law or agreed to in writing, software | ||
| - distributed under the License is distributed on an "AS IS" BASIS, | ||
| - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| - See the License for the specific language governing permissions and | ||
| - limitations under the License. | ||
| --> | ||
|
|
||
| # Iceberg UDF Spec | ||
|
|
||
| ## Background and Motivation | ||
|
|
||
| A SQL user-defined function (UDF or UDTF) is a callable routine that accepts input parameters and executes a function body. | ||
| Depending on the function type, the result can be: | ||
|
|
||
| - **Scalar function (UDF)** – returns a single value, which may be a primitive type (e.g., `int`, `string`) or a non-primitive type (e.g., `struct`, `list`). | ||
| - **Table function (UDTF)** – returns a table with zero or more rows of columns with a uniform schema. | ||
|
|
||
| This specification introduces a standardized metadata format for UDFs in Iceberg. | ||
|
|
||
| ## Goals | ||
|
|
||
| * Define a portable metadata format for both scalar and table SQL UDFs. The metadata is self-contained and can be moved across catalogs. | ||
| * Support function evolution through versioning and rollback. | ||
| * Provide consistent semantics for representing UDFs across engines. | ||
|
|
||
| ## Overview | ||
|
|
||
| UDF metadata follows the same design principles as Iceberg table and view metadata: each function is represented by a | ||
| **self-contained metadata file**. Metadata captures definitions, parameters, return types, documentation, security, | ||
| properties, and engine-specific representations. | ||
|
|
||
| * UDF metadata files are immutable. Any modification (new definition, updated representation, changed properties, etc.) | ||
| creates a new metadata file. Catalogs use an atomic swap to update the metadata file linked to a catalog identifier. | ||
| * Each metadata file includes recent definition versions, enabling rollbacks without external state. | ||
|
|
||
| ## Specification | ||
|
|
||
| ### UDF Metadata | ||
| The UDF metadata file has the following fields: | ||
stevenzwu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| | Requirement | Field name | Type | Description | | ||
| |-------------|------------------|------------------------|---------------------------------------------------------------------------| | ||
| | *required* | `function-uuid` | `string` | A UUID that identifies this UDF, generated once at creation. | | ||
| | *required* | `format-version` | `int` | UDF specification format version (must be `1`). | | ||
| | *required* | `definitions` | `list<definition>` | List of function [definition](#definition) entities. | | ||
| | *required* | `definition-log` | `list<definition-log>` | History of [versions within the function's definitions](#definition-log). | | ||
| | *optional* | `location` | `string` | The function's base location; used to create metadata file locations. | | ||
| | *optional* | `properties` | `map<string,string>` | A string-to-string map of properties. | | ||
| | *optional* | `secure` | `boolean` | Whether it is a secure function. Default: `false`. | | ||
| | *optional* | `doc` | `string` | Documentation string. | | ||
|
|
||
| Notes: | ||
| 1. When `secure` is set to `true`, engines should prevent leakage of sensitive information to end users. Each engine | ||
| may have its own security definition and mechanisms. It is the administrator's responsibility to ensure that | ||
| UDFs marked as secure are properly configured and protected in their environment. | ||
| 2. Entries in `properties` are treated as hints, not strict rules. | ||
| 3. UDF names are not stored in metadata. It's the catalog's responsibility to map UDF names to metadata file locations. | ||
|
|
||
| ### Definition | ||
|
|
||
| Each `definition` represents one function signature (e.g., `add_one(int)` vs `add_one(float)`). A definition is uniquely | ||
| identified by its signature (the ordered list of parameter types). There can be only one definition for a given signature. | ||
| All versions within a definition must accept the same signature as specified in the definition's `parameters` field and | ||
| must produce values of the declared `return-type`. | ||
|
|
||
| | Requirement | Field name | Type | Description | | ||
| |-------------|----------------------|--------------------------------|---------------------------------------------------------------------------------------------------| | ||
| | *required* | `definition-id` | `string` | An identifier derived from canonical parameter-type tuple (see [Definition ID](#definition-id)). | | ||
| | *required* | `parameters` | `list<parameter>` | Ordered list of [function parameters](#parameter). Invocation order **must** match this list. | | ||
| | *required* | `return-type` | `string` or `object` | Declared return type using [Types](#types). | | ||
| | *optional* | `return-nullable` | `boolean` | A hint to indicate whether the return value is nullable or not. Default: `true`. | | ||
| | *required* | `versions` | `list<definition-version>` | [Versioned implementations](#definition-version) of this definition. | | ||
| | *required* | `current-version-id` | `int` | Identifier of the current version for this definition. | | ||
| | *required* | `function-type` | `string` (`"udf"` or `"udtf"`) | If `"udtf"`, `return-type` must be a `struct` (see [Types](#types)) describing the output schema. | | ||
| | *optional* | `doc` | `string` | Documentation string. | | ||
|
|
||
| ### Parameter | ||
flyrain marked this conversation as resolved.
Show resolved
Hide resolved
flyrain marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | Requirement | Field | Type | Description | | ||
| |-------------|--------|----------|--------------------------------------------| | ||
| | *required* | `type` | `string` | Parameter data type (see [Types](#types)). | | ||
| | *required* | `name` | `string` | Parameter name. | | ||
| | *optional* | `doc` | `string` | Parameter documentation. | | ||
|
|
||
| Notes: | ||
| 1. Variadic (vararg) parameters are not supported. Each definition must declare a fixed number of parameters. | ||
flyrain marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| 2. Parameterized signatures are not supported (e.g., `<E> array_agg(E)`). | ||
| 3. All types, including parameters and return types, are considered nullable. | ||
|
|
||
| #### Types | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we state that all types are considered nullable/optional?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added it as the point 3 in the above section. |
||
| [iceberg-type-json]: https://iceberg.apache.org/spec/#appendix-c-json-serialization | ||
| Types are based on the [Iceberg Type](https://iceberg.apache.org/spec/#schemas-and-data-types). | ||
|
|
||
| Primitive and semi-structured type strings are encoded based on [Iceberg Type JSON Representation][iceberg-type-json] | ||
| (e.g., `int`, `string`, `timestamp`, `decimal(9,2)`, `variant`). Type strings must contain no spaces or quote characters. | ||
|
|
||
| Nested types (`struct`, `list`, `map`) use the [Iceberg Type JSON Representation][iceberg-type-json] with the | ||
| following fields required. Any other fields must be ignored. | ||
| * `list` requires `type` and `element`, e.g., `{ "type": "list", "element": "string" }` | ||
| * `map` requires `type`, `key`, and `value`, e.g., `{ "type": "map", "key": "string", "value": "int" }` | ||
| * `struct` requires `type` and `fields`, where each field requires `name` and `type`, | ||
| e.g., `{ "type": "struct", "fields": [ { "name": "id", "type": "int" }, { "name": "name", "type": "string" } ] }` | ||
|
|
||
| #### Definition ID | ||
| The `definition-id` is a canonical string derived from the parameter types, formatted as a comma-separated list with no | ||
| spaces. Each type uses the following string representation: | ||
|
|
||
| * Primitives and semi-structured: the type name (e.g., `int`, `variant`) | ||
| * List: `list<element-type>` (e.g., `list<int>`) | ||
| * Map: `map<key-type,value-type>` (e.g., `map<string,int>`) | ||
| * Struct: `struct<name1:type1,name2:type2,...>` with field names and types (e.g., `struct<id:int,name:string>`) | ||
|
|
||
| Examples of complete definition-id signatures: | ||
| * `int` – single int parameter | ||
| * `int,string` – two parameters: int and string | ||
| * `int,list<int>,struct<id:int,name:string>` – three parameters: an int, a list and a struct | ||
|
|
||
| ### Definition Version | ||
|
|
||
| Each definition can evolve over time by introducing new versions. | ||
| A `definition version` represents a specific implementation of that definition at a given point in time. | ||
|
|
||
| | Requirement | Field name | Type | Description | | ||
| |-------------|-------------------|----------------------------------------------------------|----------------------------------------------------------------| | ||
| | *required* | `version-id` | `int` | Monotonically increasing identifier of the definition version. | | ||
| | *required* | `representations` | `list<representation>` | [UDF implementations](#representation). | | ||
| | *optional* | `deterministic` | `boolean` (default `false`) | Whether the function is deterministic. | | ||
| | *optional* | `on-null-input` | `string` (`"return-null"` or `"call"`, default `"call"`) | Defines how the UDF behaves when any input parameter is NULL. | | ||
| | *required* | `timestamp-ms` | `long` (unix epoch millis) | Creation timestamp of this version. | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we are discussing this the @stevenzwu for tables, do we want to state that this is monotonically increasing or just go with it as-is?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Discussed with @stevenzwu a bit, some points to share:
With that, I think it's fine to not enforce monotonically increasing.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The version id and timestamp are update per overload. Since the monotonic timestamp at per overload scope, it is not meaningful for rollback (even if we want to). It makes sense to increment the version id at per overload scope. We won't want to increment version id at per overload scope and increment timestamp at global scope. Hence, I am favor of not adding the monotonic requirement to the timestamp value here. |
||
|
|
||
|
|
||
| #### Null Input Handling | ||
| `on-null-input` provides an optimization hint for query engines: | ||
| 1. If set to `return-null`, the function always returns `NULL` if any input argument is `NULL`. This allows engines to | ||
| apply predicate pushdown or skip function evaluation for rows with `NULL` inputs. For a function `f(x, y) = x + y`, | ||
| the engine can safely rewrite `WHERE f(a,b) > 0` as `WHERE a IS NOT NULL AND b IS NOT NULL AND f(a,b) > 0`. | ||
| 2. If set to `call`, the function may handle `NULL`s internally (e.g., `COALESCE`, `NVL`, `IFNULL`), so the engine must | ||
| execute the function even if some inputs are `NULL`. | ||
|
|
||
| ### Representation | ||
| Each representation is an object with at least one common field, `type`, that is one of the following: | ||
| * `sql`: a SQL expression that defines the function body | ||
|
|
||
| Representations further define metadata for each type. | ||
|
|
||
| #### SQL Representation | ||
|
|
||
| A definition version can have multiple SQL representations of different dialects, but only one SQL representation per dialect. | ||
| The SQL representation stores the function body as a SQL expression, with metadata such as the SQL dialect. | ||
|
|
||
| | Requirement | Field name | Type | Description | | ||
| |-------------|------------|----------|------------------------------------------------------| | ||
| | *required* | `type` | `string` | Must be `"sql"` | | ||
| | *required* | `dialect` | `string` | SQL dialect identifier (e.g., `"spark"`, `"trino"`). | | ||
| | *required* | `sql` | `string` | SQL expression text. | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. minor : do we wanna say its client responsibility to make sure the each dialect produces same output before storing ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Discussed offline with @singhpk234. We chose to follow the view spec and intentionally avoid specifying this further. There is a fine balance between providing enough clarity in the specification and over specifying behavior that is better left to implementations. |
||
|
|
||
| Notes: | ||
| 1. The `sql` must reference parameters using the names declared in the definition's `parameters` field. | ||
|
|
||
| ### Definition Log | ||
| | Requirement | Field name | Type | Description | | ||
| |-------------|-----------------------|-----------------------------------------------------|-------------------------------------------------------------------------| | ||
| | *required* | `timestamp-ms` | `long` (unix epoch millis) | Timestamp when the function was updated to use the definition versions. | | ||
| | *required* | `definition-versions` | `list<struct<definition-id:string,version-id:int>>` | Mapping of each definition to its selected version at this time. | | ||
|
|
||
| ## Function Call Convention and Resolution in Engines | ||
| Selecting the definition of a function to use is delegated to engines, which may apply their own casting rules. However, engines should: | ||
|
|
||
| 1. Prefer exact parameter matches over safe (widening) or unsafe casts. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Minor: Point 3 says engines should require explicit casts for unsafe conversions, which might make this less clear because an explicit cast changes the input being matched. For example, when matching
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might be missing some of the subtle conflicts between points 1 and 3. Overall, I slightly prefer the current version, since the three points are largely consistent with each other, and points 2 and 3 are symmetric in structure and intent. |
||
| 2. Safely widen types as needed to avoid failing to find a matching definition. | ||
| 3. Require explicit casts for unsafe or non-obvious conversions. | ||
| 4. Use definitions with the same number of arguments as the input. | ||
| 5. Pass positional arguments in the same position as the input. | ||
| 6. Use definitions with the same set of field names as named input arguments. | ||
|
|
||
| ## Appendix A: Example – Overloaded Scalar Function | ||
|
|
||
| SQL statement: | ||
| ```sql | ||
| -- Trino SQL | ||
| CREATE FUNCTION add_one(x INT COMMENT 'Input integer') | ||
| COMMENT 'Add one to the input integer' | ||
| RETURNS INT | ||
| RETURN x + 1; | ||
|
|
||
| -- Trino SQL | ||
| CREATE FUNCTION add_one(x FLOAT COMMENT 'Input float') | ||
| COMMENT 'Add one to the input float' | ||
| RETURNS FLOAT | ||
| RETURN x + 1.0; | ||
|
|
||
| -- Spark SQL | ||
| CREATE OR REPLACE FUNCTION add_one(x FLOAT) | ||
| RETURNS FLOAT | ||
| RETURN x + 1.0; | ||
| ``` | ||
|
|
||
| ```json | ||
| { | ||
| "function-uuid": "42fd3f91-bc10-41c1-8a52-92b57dd0a9b2", | ||
| "format-version": 1, | ||
| "definitions": [ | ||
| { | ||
| "definition-id": "int", | ||
| "parameters": [ | ||
| { | ||
| "name": "x", "type": "int", "doc": "Input integer" | ||
| } | ||
| ], | ||
| "return-type": "int", | ||
| "function-type": "udf", | ||
| "doc": "Add one to the input integer", | ||
| "versions": [ | ||
| { | ||
| "version-id": 1, | ||
| "deterministic": true, | ||
| "representations": [ | ||
| { "type": "sql", "dialect": "trino", "sql": "x + 2" } | ||
| ], | ||
| "timestamp-ms": 1734507000123 | ||
| }, | ||
| { | ||
| "version-id": 2, | ||
| "deterministic": true, | ||
| "representations": [ | ||
| { "type": "sql", "dialect": "trino", "sql": "x + 1" }, | ||
| { "type": "sql", "dialect": "spark", "sql": "x + 1" } | ||
| ], | ||
| "timestamp-ms": 1735507000124 | ||
| } | ||
| ], | ||
| "current-version-id": 2 | ||
| }, | ||
| { | ||
| "definition-id": "float", | ||
| "parameters": [ | ||
| { | ||
| "name": "x", "type": "float", "doc": "Input float" | ||
| } | ||
| ], | ||
| "return-type": "float", | ||
| "function-type": "udf", | ||
| "doc": "Add one to the input float", | ||
| "versions": [ | ||
| { | ||
| "version-id": 1, | ||
| "deterministic": true, | ||
| "representations": [ | ||
| { "type": "sql", "dialect": "trino", "sql": "x + 1.0" } | ||
| ], | ||
| "timestamp-ms": 1734507001123 | ||
| } | ||
| ], | ||
| "current-version-id": 1 | ||
| } | ||
| ], | ||
| "definition-log": [ | ||
flyrain marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| { | ||
| "timestamp-ms": 1734507000123, | ||
| "definition-versions": [ | ||
| { "definition-id": "int", "version-id": 1 } | ||
| ] | ||
| }, | ||
| { | ||
| "timestamp-ms": 1734507001123, | ||
| "definition-versions": [ | ||
| { "definition-id": "int", "version-id": 1 }, | ||
| { "definition-id": "float", "version-id": 1 } | ||
| ] | ||
| }, | ||
| { | ||
| "timestamp-ms": 1735507000124, | ||
| "definition-versions": [ | ||
| { "definition-id": "int", "version-id": 2 }, | ||
| { "definition-id": "float", "version-id": 1 } | ||
| ] | ||
| } | ||
| ], | ||
| "doc": "Overloaded scalar UDF for integer and float inputs", | ||
| "secure": false | ||
| } | ||
| ``` | ||
|
|
||
| ## Appendix B: UDTF Example | ||
| SQL statement: | ||
|
|
||
| ```sql | ||
| CREATE FUNCTION fruits_by_color(c VARCHAR COMMENT 'Color of fruits') | ||
| COMMENT 'Return fruits of specific color from fruits table' | ||
| RETURNS TABLE (name VARCHAR, color VARCHAR) | ||
| RETURN SELECT name, color FROM fruits WHERE color = c; | ||
| ``` | ||
|
|
||
| ```json | ||
| { | ||
| "function-uuid": "8a7fa39a-6d8f-4a2f-9d8d-3f3a8f3c2a10", | ||
| "format-version": 1, | ||
| "definitions": [ | ||
| { | ||
| "definition-id": "string", | ||
| "parameters": [ | ||
| { | ||
| "name": "c", "type": "string", "doc": "Color of fruits" | ||
| } | ||
| ], | ||
| "return-type": { | ||
| "type": "struct", | ||
| "fields": [ | ||
| { "name": "name", "type": "string" }, | ||
| { "name": "color", "type": "string" } | ||
| ] | ||
| }, | ||
| "function-type": "udtf", | ||
| "doc": "Return fruits of a specific color from the fruits table", | ||
| "versions": [ | ||
| { | ||
| "version-id": 1, | ||
| "deterministic": true, | ||
| "representations": [ | ||
| { "type": "sql", "dialect": "trino", "sql": "SELECT name, color FROM fruits WHERE color = c" }, | ||
| { "type": "sql", "dialect": "spark", "sql": "SELECT name, color FROM fruits WHERE color = c" } | ||
| ], | ||
| "timestamp-ms": 1734508000123 | ||
| } | ||
| ], | ||
| "current-version-id": 1 | ||
| } | ||
| ], | ||
| "definition-log": [ | ||
| { | ||
| "timestamp-ms": 1734508000123, | ||
| "definition-versions": [ | ||
| { "definition-id": "string", "version-id": 1 } | ||
| ] | ||
| } | ||
| ], | ||
| "doc": "UDTF returning (name, color) rows filtered by the given color", | ||
| "secure": false | ||
| } | ||
| ``` | ||
Uh oh!
There was an error while loading. Please reload this page.