fix: DayTransform result type override and docs #1208

kevinzwang · 2024-09-25T21:33:36Z

DayTransform.result_type() incorrectly returns DateType(). This just removes that so it has the same behavior as other time transforms, returning an IntegerType(). The actual return values of DayTransform.transform and DayTransform.pyarrow_transform have been correct

kevinjqliu · 2024-09-25T21:59:04Z

is this the source of truth?
https://iceberg.apache.org/spec/#partition-transforms

kevinzwang · 2024-09-25T23:27:29Z

is this the source of truth? https://iceberg.apache.org/spec/#partition-transforms

Yup, precisely

kevinjqliu · 2024-09-25T23:55:27Z

pyiceberg/transforms.py

-    def result_type(self, source: IcebergType) -> IcebergType:
-        return DateType()
-


nit: can we be explicit here and do

def result_type(self, source: IcebergType) -> IntegerType: return IntegerType()

I'm matching the behavior of the other transforms that extend TimeTransform, which all just implicitly use TimeTransform.result_type instead of overriding it. Should we change this for all of them?

gotcha, thanks for the context. This is fine!

kevinjqliu · 2024-09-25T23:55:49Z

pyiceberg/transforms.py


    Example:
-        >>> transform = MonthTransform()
+        >>> transform = DayTransform()


kevinzwang · 2024-09-26T00:59:36Z

Ok so interesting... Spark actually does store day transforms as date type in the metadata, which is why the integration test is failing. This is probably why this library had that behavior before. So Spark itself does not follow the Iceberg spec. Thoughts on deviating from Spark behavior?

kevinjqliu · 2024-09-26T01:38:45Z

Im not 100% sure, perhaps the metadata table does the transformation.
https://iceberg.apache.org/docs/latest/spark-queries/#partitions

kevinzwang · 2024-09-26T03:40:40Z

Im not 100% sure, perhaps the metadata table does the transformation.

https://iceberg.apache.org/docs/latest/spark-queries/#partitions

I think you are correct.

Physically, the integer value stored is days since epoch. But Spark encodes the type of the day transformed partition field in the metadata file as date type. On the other hand, for other transforms Spark conforms to the Iceberg spec. Strange.

kevinjqliu · 2024-09-26T03:51:55Z

I like that its converted, its more readable! Do you know where the transform happens? Is it only for the metadata table?

kevinzwang · 2024-09-26T17:48:14Z

It's more readable sure, but just does not conform to the spec. I'm not entirely sure what you mean by your question @kevinjqliu, I believe partition values are only stored in metadata.

kevinjqliu · 2024-09-26T19:01:32Z

The partition value is stored in the metadata as int, but the "partition metadata table" (https://iceberg.apache.org/docs/latest/spark-queries/#partitions) shows the partition data as a "timestamp" instead.
When querying the "partition metadata table", the int partition value is transformed to its "timestamp" representation.

This is the difference between spark and pyiceberg, as seen in the failed test, test_inspect_partitions_partitioned.

So perhaps the "partition metadata table" is implemented wrong.

To verify, we need to find where the transformation, from int to timestamp, happens on the spark side

kevinzwang · 2024-09-26T20:33:56Z

I played around with Spark and inspected the generated metadata. I believe the partition value is actually stored in the metadata as date type for day transformed partitions.

The reason why we are getting the right value but just as an integer in the test is that the physical type of date type is integer, and pyiceberg coerces the metadata table to the types from Transform.result_type upon reading them.

Basically, Spark stores day transformed partition values incorrectly in the metadata. However, since the underlying data of integer "days since epoch" and date are the same, pyiceberg and spark should still be compatible. It's just that the metadata tables will not be equal when read due to the type difference. I would say the solution is just to coerce the types in this test so that they match.

kevinjqliu · 2024-09-27T01:20:39Z

Basically, Spark stores day transformed partition values incorrectly in the metadata

Thats an interesting find... The core Iceberg library is using DateType as the Result Type for DayTransform
https://github.com/apache/iceberg/blame/main/api/src/main/java/org/apache/iceberg/transforms/Days.java#L47

In fact, its the only "time-based" transforms which uses this
https://grep.app/search?q=getResultType&filter[repo][0]=apache/iceberg&filter[path][0]=api/src/main/java/org/apache/iceberg/

Let me bring this up in the devlist. In the meantime, I think we can either

Put back DateType as the result type for Day Transform to match the behavior
Fix it in the Partitions metadata table as a workaround and link an issue to fix forward

I don't want to a workaround in the tests as it will become difficult to maintain. WYDT?

kevinjqliu · 2024-09-27T01:39:54Z

devlist https://lists.apache.org/thread/2gq7b54nvc9q6f1j08l9lnzgm5onkmx5

kevinzwang · 2024-09-27T19:00:59Z

Perhaps let's just wait on the response on the devlist first.

kevinjqliu

Turns out. removing result_type from DayTransform is the cause of all our woes in this PR.
The function name result_type is an overloaded turn and doesn't correspond 1 to 1 with the "Result type" described in the spec

The result_type for DayTransform should be DateType.
This conforms to the spec since DateType is converted to int here

The DateType result_type is needed since it mapped to pyarrow DateType here which corresponds with the Java library behavior.

Let's add a comment to result_type to explain this dynamic for future reference.

kevinzwang · 2024-09-30T18:55:42Z

Ah ok, are we choosing to conform to the Java library behavior? In that case, I will close this PR, and we could make another one that improves the documentation for result_type

kevinjqliu · 2024-09-30T19:00:34Z

Ah ok, are we choosing to conform to the Java library behavior?

yep! Both Java and python libraries behave according to the spec. As Ryan mentioned on the devlist, the DateType result_type for DayTransform is a nice-to-have feature so that the engine displays the day partition as date instead of int.

In that case, I will close this PR, and we could make another one that improves the documentation for result_type

WDYT of repurposing this PR for documentation? The comments on this PR is helpful for future context

kevinzwang · 2024-09-30T22:09:04Z

Personally I would prefer to close this and then create a different PR that refers to this one, so that I don't have to make a commit to undo the work here and change the name.

kevinzwang · 2024-09-30T22:19:51Z

Made a new PR here: #1211

kevinjqliu · 2024-09-30T22:30:33Z

ty!

kevinzwang added 2 commits September 25, 2024 14:31

fix: DayTransform result type override and docs

f5d7263

change test to match behavior

3f49451

kevinzwang added 2 commits September 25, 2024 16:30

fix additional tests

fbe44a0

format

465d520

kevinjqliu reviewed Sep 25, 2024

View reviewed changes

workaround for spark partition check

e96f739

kevinjqliu reviewed Sep 30, 2024

View reviewed changes

kevinzwang mentioned this pull request Sep 30, 2024

Add clarifying docs to transform result types #1211

Merged

kevinzwang closed this Sep 30, 2024

kevinjqliu mentioned this pull request Dec 16, 2024

Docs: add note for day transform apache/iceberg#11749

Closed

jessie-young mentioned this pull request Apr 9, 2025

Iceberg writes with a date partition are creating wrong partitions Eventual-Inc/Daft#4157

Closed

		def result_type(self, source: IcebergType) -> IcebergType:
		return DateType()

fix: DayTransform result type override and docs #1208

fix: DayTransform result type override and docs #1208

Uh oh!

Conversation

kevinzwang commented Sep 25, 2024

Uh oh!

kevinjqliu commented Sep 25, 2024

Uh oh!

kevinzwang commented Sep 25, 2024

Uh oh!

kevinjqliu Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

kevinzwang Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

kevinzwang commented Sep 26, 2024

Uh oh!

kevinjqliu commented Sep 26, 2024

Uh oh!

kevinzwang commented Sep 26, 2024

Uh oh!

kevinjqliu commented Sep 26, 2024

Uh oh!

kevinzwang commented Sep 26, 2024

Uh oh!

kevinjqliu commented Sep 26, 2024

Uh oh!

kevinzwang commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu commented Sep 27, 2024

Uh oh!

kevinjqliu commented Sep 27, 2024

Uh oh!

kevinzwang commented Sep 27, 2024

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinzwang commented Sep 30, 2024

Uh oh!

kevinjqliu commented Sep 30, 2024

Uh oh!

kevinzwang commented Sep 30, 2024

Uh oh!

kevinzwang commented Sep 30, 2024

Uh oh!

kevinjqliu commented Sep 30, 2024

Uh oh!

Uh oh!

kevinzwang commented Sep 26, 2024 •

edited

Loading