Spec: add multi-arg transform support #8579

advancedxy · 2023-09-18T06:26:12Z

As discussed in #8258 and its google doc: https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit?usp=sharing.

This PR adds multi-arg transform into the spec.

Co-authored-by: Szehon Ho szehon.apache@gmail.com

advancedxy · 2023-09-18T06:37:36Z

My vision for the complete support of multi-arg transform would be:

current PR
PR for core and api module
PR for Spark, including 3.3, 3.4, and 3.5
PR for Flink, including 1.15, 1.16 and 1.17
PR for the Python binding if needed
other query engines: such as trino/presto, other language bindings: such as rust and go.
other multi-arg transform except bucket, such as zorder or other geo partitioning transforms.
...

I would commit to finish the API, Core, Spark(the poc PR: #8259 include these supports, but needs some refinements) and Flink support, Python if necessary.
@RussellSpitzer @rdblue @aokolnychyi @szehon-ho would you guys to have a look at this and appreciate your input.

advancedxy · 2023-09-26T02:06:26Z

Gently ping @rdblue @RussellSpitzer @aokolnychyi @szehon-ho

format/spec.md

szehon-ho · 2024-01-09T06:44:41Z

Hi, @advancedxy , thanks for the work. Sorry for the delay, I am just returning from paternity leave. Will love to see this get in to get work on zorder and geo-transforms. I left some comments

advancedxy · 2024-01-10T03:02:09Z

Hi, @advancedxy , thanks for the work. Sorry for the delay, I am just returning from paternity leave. Will love to see this get in to get work on zorder and geo-transforms. I left some comments

WOW, big congrats on the arrival of your newborn. I will resume this work support once I finished my internal project, which I'm leveraging bucketing and sorting to support efficient upsert. It will depends on multi-arg bucket transform at some point.

szehon-ho · 2024-01-10T10:32:48Z

WOW, big congrats on the arrival of your newborn.

Thank you so much!

I will resume this work support once I finished my internal project, which I'm leveraging bucketing and sorting to support efficient upsert.

Sure, understood. Another possibility, if it will take awhile, is that I can also help with this pr to move it forward and we can be the co-authors

advancedxy · 2024-01-10T13:00:43Z

Another possibility, if it will take awhile, is that I can also help with this pr to move it forward and we can be the co-authors

Of course, thanks for offering. As listed in #8579 (comment), there are multiple parts about this feature. For this pr of spec change, I can address your comments in this week and hopefully you could help get more eyes from others on this spec change. I think we can be co-authors about this spec change and the whole feature.

For other parts, It might take me a little more time to refactor/refine/decouple/impl. But it would be great if we can work together to move thing forward.

advancedxy · 2024-01-11T10:01:27Z

format/spec.md

@@ -314,7 +314,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ
 | Transform name    | Description                                                  | Source types                                                                                              | Result type |
 |-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------|
 | **`identity`**    | Source value, unmodified                                     | Any                                                                                                       | Source type |
-| **`bucket[N]`**   | Hash of value, mod `N` (see below)                           | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int`       |
+| **`bucket[N]`**   | Hash of value(s), mod `N` (see below)                        | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int`       |


I'm wondering, weather we should simply add a new bucketV2 partition transform to distinguish the single-arg one.

I am ok with adding bucketv2 here. Let see what @aokolnychyi @rdblue think

Ref: I think it was the decision as per this discussion? https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit?disco=AAAA3dkHA5A

I'd be inclined to keep just bucket as long as we would return UnknownTransform in old readers/writers. If keeping it as bucket leads to exceptions, then we have to consider another name.

szehon-ho

Sure, thanks for the very quick update, I left some more comments

format/spec.md

Co-authored-by: Szehon Ho <szehon.apache@gmail.com>

szehon-ho

Thanks, left some more comments. I think its getting close

I also pinged @aokolnychyi about it, he said he will take a look this week.

szehon-ho · 2024-01-16T03:19:29Z

format/spec.md

@@ -314,7 +314,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ
 | Transform name    | Description                                                  | Source types                                                                                              | Result type |
 |-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------|
 | **`identity`**    | Source value, unmodified                                     | Any                                                                                                       | Source type |
-| **`bucket[N]`**   | Hash of value, mod `N` (see below)                           | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int`       |
+| **`bucket[N]`**   | Hash of value(s), mod `N` (see below)                        | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int`       |


I am ok with adding bucketv2 here. Let see what @aokolnychyi @rdblue think

format/spec.md

szehon-ho · 2024-01-16T06:43:13Z

format/spec.md

+
+| Field                                 | JSON representation                                                                                                                                                                                                                                                      | Example                                                                                                                                                                                                                                |
+|---------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **`Sort Field(multi-arg transform)`** | `JSON object: {`<br />&nbsp;&nbsp;`"transform": <transform JSON>,`<br />&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": <list of ids>,`<br />&nbsp;&nbsp;`"direction": <direction string>,`<br />&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}` | `{`<br />&nbsp;&nbsp;`  "transform": "bucketV2[4]",`<br />&nbsp;&nbsp;`  "source-id": -1,`<br />&nbsp;&nbsp;`  "source-id": [1,2],`<br />&nbsp;&nbsp;`  "direction": "desc",`<br />&nbsp;&nbsp;`  "null-order": "nulls-last"`<br />`}` |


should we add Notes section and also add the note from partition field here (about when to emit and omit source-id and source-ids?

Thanks for your suggestion. I added the notes, and after reviewing this part, I think the table of Sort Fields could be more consistent with Partition Fields and therefore changes that part. WDYT?

format/spec.md

aokolnychyi · 2024-01-17T00:18:50Z

format/spec.md

-|Field|JSON representation|Example|
-|--- |--- |--- |
-|**`Sort Field`**|`JSON object: {`<br />&nbsp;&nbsp;`"transform": <transform JSON>,`<br />&nbsp;&nbsp;`"source-id": <source id int>,`<br />&nbsp;&nbsp;`"direction": <direction string>,`<br />&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}`|`{`<br />&nbsp;&nbsp;`  "transform": "bucket[4]",`<br />&nbsp;&nbsp;`  "source-id": 3,`<br />&nbsp;&nbsp;`  "direction": "desc",`<br />&nbsp;&nbsp;`  "null-order": "nulls-last"`<br />`}`|
+| Field                                         | JSON representation                                                                                                                                                                                                                                                      | Example                                                                                                                                                                                                                                |


format/spec.md

aokolnychyi · 2024-01-17T00:20:27Z

format/spec.md

+
+Notes:
+1. For sort fields with a transform with a single argument, the id of the source field is set on `source-id`, and `source-ids` is omitted.
+2. For sort fields with a transform of multiple arguments, the ids of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.


Is writing an explicit source-id required to avoid exceptions compared to not writing that at all?

Unfortunately, I believe an explicit source-id is required to avoid exception in old versions.

See

iceberg/core/src/main/java/org/apache/iceberg/PartitionSpecParser.java

Line 135 in a1f4642

int sourceId = JsonUtil.getInt(SOURCE_ID, element);

and

iceberg/core/src/main/java/org/apache/iceberg/SortOrderParser.java

Line 154 in ab398a0

int sourceId = JsonUtil.getInt(SOURCE_ID, element);

aokolnychyi · 2024-01-17T00:22:41Z

This seems in a pretty good shape. I guess the open question is about bucket vs bucketV2 naming. I'll also check the math behind bucketing on multiple values with fresh eyes on Thursday.

cc @RussellSpitzer @danielcweeks @nastra @Fokko @rdblue @jackye1995 @amogh-jahagirdar

amogh-jahagirdar · 2024-01-18T04:43:05Z

I'll also take a look at this tomorrow morning as well, thanks @advancedxy !

aokolnychyi · 2024-01-19T00:35:41Z

@rdblue recently pointed me to the Bloom filter spec in Parquet. I think it contains a few interesting ideas that may be applicable to us. First of all, we should evaluate other hash functions apart from Murmur3. Parquet, for instance, uses xxHash that is supposed to be much faster. Second, Parquet avoids the modulo operator for performance reasons. Given all this information, I suggest we make this PR about multi-arg transforms in general (like how they are stored, how they are serialized, what happens during schema evolution, compatibility etc) and submit another one with bucketV2 that will not only support multiple input elements but also be faster. If we merge a general change about multi-arg transforms, we can start working on changes to the expression API while figuring out the details about bucketV2.

@advancedxy @szehon-ho, how does this sound?

advancedxy · 2024-01-19T03:45:50Z

First of all, we should evaluate other hash functions apart from Murmur3. Parquet, for instance, uses xxHash that is supposed to be much faster
Second, Parquet avoids the modulo operator for performance reasons.

Both sounds great improvement to me. Apart from faster hash, I'd like to add another possible option to explore: user defined hash function for bucket transform while we are working bucketV2. From time to time, I got request from users that is it possible to custom Iceberg's bucket partitioning strategy, so that it has exactly the same distribution of downstream systems.

If we merge a general change about multi-arg transforms, we can start working on changes to the expression API while figuring out the details about bucketV2.

I'm ok to merge multi-arg transform first. However I'm not sure how to provide examples for single-arg transform v.s. multi-arg transform as there will be no bucketV2 transform for now. I am referring this part:

|**`Partition Field`** [2]|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": <id int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br />&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br />&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`}`|
|**`Partition Field with multi-arg transform`** [3]|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": <list of ids>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": [1,2],`<br />&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_type_bucket",`<br />&nbsp;&nbsp;`"transform": "bucketV2[16]"`<br />`}`|

@szehon-ho @aokolnychyi do you have any suggestions?

szehon-ho · 2024-01-22T17:09:04Z

Hi @advancedxy , I'm ok to leave that for the next pr.

How about we just keep the notes for PartitionField and SortOrder like?

1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.
2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.

And just omit the example until we the bucketv2 pr?

advancedxy · 2024-01-24T02:25:16Z

@szehon-ho @aokolnychyi the bucketV2 part is removed from this PR. Let me know if you have any more comments.

szehon-ho

Looks good to me!

szehon-ho · 2024-01-25T18:34:01Z

Merged, thanks @advancedxy ! Feel free to work on bucketv2 spec, and we can make any other follow ups as well

aokolnychyi · 2024-01-26T21:03:11Z

This change seems reasonable to me. @advancedxy, could you also post to the dev list that this was merged to get any input from folks who did not review before we release 1.5? I feel that would be important as it is a spec change.

advancedxy · 2024-01-27T00:39:08Z

This change seems reasonable to me. @advancedxy, could you also post to the dev list that this was merged to get any input from folks who did not review before we release 1.5? I feel that would be important as it is a spec change.

of course, nice suggestion.

emkornfield · 2024-01-27T23:14:30Z

format/spec.md

+Notes:
+
+1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.
+2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.


Small nitpick. It seems that it would be better to choose a field ID from the existing range for reserved field IDs (e.g. MAX_INT-200) then to use -1, which as far as I can tell is still potentially a valid field according to the spec (I might have missed it but field IDs simply seem to be defined as integers).

I'd prefer using the first column from the source ID list instead of a fake ID. That way older readers at least see that the transform is associated with one of the correct columns.

I think using a valid source ID here would lead to incorrect results for old clients if a predicate is specified on the column. IIUC invalid ID here makes sure reads should always be correct or fail which seems like better semantics if the aim is forwards compatibility

It seems that it would be better to choose a field ID from the existing range for reserved field IDs (e.g. MAX_INT-200) then to use -1,

Per my understanding, multi-arg transforms will mostly get a new transform name rather than the existing ones. Older readers will treat this multi-arg transform as an UnknownTransform, the persisted source-id is just to make old code happy, see this reply as well: #8579 (comment). So the value of source-id is just a place holder and doesn't make too much sense. It could be a field ID from the reserved range or a negative one since the current reference implementation wouldn't produce a negative field id.
I simply choose -1 as it seems more nature and doesn't need to put a somehow weird reserved field in the MetadataColumns.java , but I think we make always make follow-up pr if there's valid concerns/solutions.

If it is a different transform (I wasn't clear on the final status there) I think it makes it less important so at this point it is bike shedding but I think having a clear signal that this field is meaningless might be useful. I think for V3 it might be worthwhile to consider dropping the backwards compatibility.

Yea I also think older readers will not be able to make any use of the new mulit-arg transforms. So they would only be able to read new tables (though without any partition pushdown), and would fail to write. So I agree , it is moot what to even put for source-id here, though I think choosing a reserved one is a good idea. Is it just so the java reference implementation can properly de-serialize as Unknown and have a better exception message?

Is the idea in v1/v2 to write source-id column as -1/reserved, and in v3, we will write source-ids for everything and drop source-id column?

I guess this is a more general discussion and can wait the new spec pr clarifying v1/2 vs v3 behaviors.

advancedxy changed the title ~~spec: add multi-arg transform support~~ Spec: add multi-arg transform support Sep 28, 2023

szehon-ho reviewed Jan 9, 2024

View reviewed changes

spec: multi-arg transform

097f1ab

advancedxy force-pushed the multi-arg-transform-spec branch from 4868bcf to 097f1ab Compare January 11, 2024 09:25

github-actions bot added the Specification Issues that may introduce spec changes. label Jan 11, 2024

refine wording

6499552

advancedxy commented Jan 11, 2024

View reviewed changes

szehon-ho reviewed Jan 12, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

format/spec.md Outdated Show resolved Hide resolved

format/spec.md Outdated Show resolved Hide resolved

format/spec.md Outdated Show resolved Hide resolved

chore: address comments and refine wording

888c90b

Co-authored-by: Szehon Ho <szehon.apache@gmail.com>

szehon-ho reviewed Jan 16, 2024

View reviewed changes

address comments

a2f7b7a