Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: add multi-arg transform support #8579

Merged
merged 6 commits into from
Jan 25, 2024

Conversation

advancedxy
Copy link
Contributor

@advancedxy advancedxy commented Sep 18, 2023

As discussed in #8258 and its google doc: https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit?usp=sharing.

This PR adds multi-arg transform into the spec.

Co-authored-by: Szehon Ho szehon.apache@gmail.com

@advancedxy
Copy link
Contributor Author

My vision for the complete support of multi-arg transform would be:

  • current PR
  • PR for core and api module
  • PR for Spark, including 3.3, 3.4, and 3.5
  • PR for Flink, including 1.15, 1.16 and 1.17
  • PR for the Python binding if needed
  • other query engines: such as trino/presto, other language bindings: such as rust and go.
  • other multi-arg transform except bucket, such as zorder or other geo partitioning transforms.
  • ...

I would commit to finish the API, Core, Spark(the poc PR: #8259 include these supports, but needs some refinements) and Flink support, Python if necessary.
@RussellSpitzer @rdblue @aokolnychyi @szehon-ho would you guys to have a look at this and appreciate your input.

@advancedxy
Copy link
Contributor Author

Gently ping @rdblue @RussellSpitzer @aokolnychyi @szehon-ho

@advancedxy advancedxy changed the title spec: add multi-arg transform support Spec: add multi-arg transform support Sep 28, 2023
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
@szehon-ho
Copy link
Collaborator

Hi, @advancedxy , thanks for the work. Sorry for the delay, I am just returning from paternity leave. Will love to see this get in to get work on zorder and geo-transforms. I left some comments

@advancedxy
Copy link
Contributor Author

Hi, @advancedxy , thanks for the work. Sorry for the delay, I am just returning from paternity leave. Will love to see this get in to get work on zorder and geo-transforms. I left some comments

WOW, big congrats on the arrival of your newborn. I will resume this work support once I finished my internal project, which I'm leveraging bucketing and sorting to support efficient upsert. It will depends on multi-arg bucket transform at some point.

@szehon-ho
Copy link
Collaborator

WOW, big congrats on the arrival of your newborn.

Thank you so much!

I will resume this work support once I finished my internal project, which I'm leveraging bucketing and sorting to support efficient upsert.

Sure, understood. Another possibility, if it will take awhile, is that I can also help with this pr to move it forward and we can be the co-authors

@advancedxy
Copy link
Contributor Author

Another possibility, if it will take awhile, is that I can also help with this pr to move it forward and we can be the co-authors

Of course, thanks for offering. As listed in #8579 (comment), there are multiple parts about this feature. For this pr of spec change, I can address your comments in this week and hopefully you could help get more eyes from others on this spec change. I think we can be co-authors about this spec change and the whole feature.

For other parts, It might take me a little more time to refactor/refine/decouple/impl. But it would be great if we can work together to move thing forward.

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Jan 11, 2024
format/spec.md Outdated
@@ -314,7 +314,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ
| Transform name | Description | Source types | Result type |
|-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------|
| **`identity`** | Source value, unmodified | Any | Source type |
| **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` |
| **`bucket[N]`** | Hash of value(s), mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering, weather we should simply add a new bucketV2 partition transform to distinguish the single-arg one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with adding bucketv2 here. Let see what @aokolnychyi @rdblue think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to keep just bucket as long as we would return UnknownTransform in old readers/writers. If keeping it as bucket leads to exceptions, then we have to consider another name.

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for the very quick update, I left some more comments

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
Co-authored-by: Szehon Ho <szehon.apache@gmail.com>
Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left some more comments. I think its getting close

I also pinged @aokolnychyi about it, he said he will take a look this week.

format/spec.md Outdated
@@ -314,7 +314,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ
| Transform name | Description | Source types | Result type |
|-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------|
| **`identity`** | Source value, unmodified | Any | Source type |
| **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` |
| **`bucket[N]`** | Hash of value(s), mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with adding bucketv2 here. Let see what @aokolnychyi @rdblue think

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated

| Field | JSON representation | Example |
|---------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **`Sort Field(multi-arg transform)`** | `JSON object: {`<br />&nbsp;&nbsp;`"transform": <transform JSON>,`<br />&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": <list of ids>,`<br />&nbsp;&nbsp;`"direction": <direction string>,`<br />&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}` | `{`<br />&nbsp;&nbsp;` "transform": "bucketV2[4]",`<br />&nbsp;&nbsp;` "source-id": -1,`<br />&nbsp;&nbsp;` "source-id": [1,2],`<br />&nbsp;&nbsp;` "direction": "desc",`<br />&nbsp;&nbsp;` "null-order": "nulls-last"`<br />`}` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add Notes section and also add the note from partition field here (about when to emit and omit source-id and source-ids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion. I added the notes, and after reviewing this part, I think the table of Sort Fields could be more consistent with Partition Fields and therefore changes that part. WDYT?

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated
|Field|JSON representation|Example|
|--- |--- |--- |
|**`Sort Field`**|`JSON object: {`<br />&nbsp;&nbsp;`"transform": <transform JSON>,`<br />&nbsp;&nbsp;`"source-id": <source id int>,`<br />&nbsp;&nbsp;`"direction": <direction string>,`<br />&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}`|`{`<br />&nbsp;&nbsp;` "transform": "bucket[4]",`<br />&nbsp;&nbsp;` "source-id": 3,`<br />&nbsp;&nbsp;` "direction": "desc",`<br />&nbsp;&nbsp;` "null-order": "nulls-last"`<br />`}`|
| Field | JSON representation | Example |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated

Notes:
1. For sort fields with a transform with a single argument, the id of the source field is set on `source-id`, and `source-ids` is omitted.
2. For sort fields with a transform of multiple arguments, the ids of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is writing an explicit source-id required to avoid exceptions compared to not writing that at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I believe an explicit source-id is required to avoid exception in old versions.

See

int sourceId = JsonUtil.getInt(SOURCE_ID, element);

and
int sourceId = JsonUtil.getInt(SOURCE_ID, element);

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Jan 17, 2024

This seems in a pretty good shape. I guess the open question is about bucket vs bucketV2 naming. I'll also check the math behind bucketing on multiple values with fresh eyes on Thursday.

cc @RussellSpitzer @danielcweeks @nastra @Fokko @rdblue @jackye1995 @amogh-jahagirdar

@amogh-jahagirdar
Copy link
Contributor

I'll also take a look at this tomorrow morning as well, thanks @advancedxy !

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Jan 19, 2024

@rdblue recently pointed me to the Bloom filter spec in Parquet. I think it contains a few interesting ideas that may be applicable to us. First of all, we should evaluate other hash functions apart from Murmur3. Parquet, for instance, uses xxHash that is supposed to be much faster. Second, Parquet avoids the modulo operator for performance reasons. Given all this information, I suggest we make this PR about multi-arg transforms in general (like how they are stored, how they are serialized, what happens during schema evolution, compatibility etc) and submit another one with bucketV2 that will not only support multiple input elements but also be faster. If we merge a general change about multi-arg transforms, we can start working on changes to the expression API while figuring out the details about bucketV2.

@advancedxy @szehon-ho, how does this sound?

@advancedxy
Copy link
Contributor Author

First of all, we should evaluate other hash functions apart from Murmur3. Parquet, for instance, uses xxHash that is supposed to be much faster
Second, Parquet avoids the modulo operator for performance reasons.

Both sounds great improvement to me. Apart from faster hash, I'd like to add another possible option to explore: user defined hash function for bucket transform while we are working bucketV2. From time to time, I got request from users that is it possible to custom Iceberg's bucket partitioning strategy, so that it has exactly the same distribution of downstream systems.

If we merge a general change about multi-arg transforms, we can start working on changes to the expression API while figuring out the details about bucketV2.

I'm ok to merge multi-arg transform first. However I'm not sure how to provide examples for single-arg transform v.s. multi-arg transform as there will be no bucketV2 transform for now. I am referring this part:

|**`Partition Field`** [2]|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": <id int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br />&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br />&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`}`|
|**`Partition Field with multi-arg transform`** [3]|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": <list of ids>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": [1,2],`<br />&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_type_bucket",`<br />&nbsp;&nbsp;`"transform": "bucketV2[16]"`<br />`}`|

@szehon-ho @aokolnychyi do you have any suggestions?

@szehon-ho
Copy link
Collaborator

Hi @advancedxy , I'm ok to leave that for the next pr.

How about we just keep the notes for PartitionField and SortOrder like?

1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.
2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1. 

And just omit the example until we the bucketv2 pr?

@advancedxy
Copy link
Contributor Author

@szehon-ho @aokolnychyi the bucketV2 part is removed from this PR. Let me know if you have any more comments.

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@szehon-ho szehon-ho merged commit 200b9c1 into apache:main Jan 25, 2024
2 checks passed
@szehon-ho
Copy link
Collaborator

szehon-ho commented Jan 25, 2024

Merged, thanks @advancedxy ! Feel free to work on bucketv2 spec, and we can make any other follow ups as well

geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024
@aokolnychyi
Copy link
Contributor

This change seems reasonable to me. @advancedxy, could you also post to the dev list that this was merged to get any input from folks who did not review before we release 1.5? I feel that would be important as it is a spec change.

@advancedxy
Copy link
Contributor Author

This change seems reasonable to me. @advancedxy, could you also post to the dev list that this was merged to get any input from folks who did not review before we release 1.5? I feel that would be important as it is a spec change.

of course, nice suggestion.

Notes:

1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.
2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.
Copy link
Contributor

@emkornfield emkornfield Jan 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick. It seems that it would be better to choose a field ID from the existing range for reserved field IDs (e.g. MAX_INT-200) then to use -1, which as far as I can tell is still potentially a valid field according to the spec (I might have missed it but field IDs simply seem to be defined as integers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer using the first column from the source ID list instead of a fake ID. That way older readers at least see that the transform is associated with one of the correct columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using a valid source ID here would lead to incorrect results for old clients if a predicate is specified on the column. IIUC invalid ID here makes sure reads should always be correct or fail which seems like better semantics if the aim is forwards compatibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it would be better to choose a field ID from the existing range for reserved field IDs (e.g. MAX_INT-200) then to use -1,

Per my understanding, multi-arg transforms will mostly get a new transform name rather than the existing ones. Older readers will treat this multi-arg transform as an UnknownTransform, the persisted source-id is just to make old code happy, see this reply as well: #8579 (comment). So the value of source-id is just a place holder and doesn't make too much sense. It could be a field ID from the reserved range or a negative one since the current reference implementation wouldn't produce a negative field id.
I simply choose -1 as it seems more nature and doesn't need to put a somehow weird reserved field in the MetadataColumns.java , but I think we make always make follow-up pr if there's valid concerns/solutions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a different transform (I wasn't clear on the final status there) I think it makes it less important so at this point it is bike shedding but I think having a clear signal that this field is meaningless might be useful. I think for V3 it might be worthwhile to consider dropping the backwards compatibility.

Copy link
Collaborator

@szehon-ho szehon-ho Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I also think older readers will not be able to make any use of the new mulit-arg transforms. So they would only be able to read new tables (though without any partition pushdown), and would fail to write. So I agree , it is moot what to even put for source-id here, though I think choosing a reserved one is a good idea. Is it just so the java reference implementation can properly de-serialize as Unknown and have a better exception message?

Is the idea in v1/v2 to write source-id column as -1/reserved, and in v3, we will write source-ids for everything and drop source-id column?

I guess this is a more general discussion and can wait the new spec pr clarifying v1/2 vs v3 behaviors.

adnanhemani pushed a commit to adnanhemani/iceberg that referenced this pull request Jan 30, 2024
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants