Skip to content

Conversation

kevinzwang
Copy link
Contributor

DayTransform.result_type() incorrectly returns DateType(). This just removes that so it has the same behavior as other time transforms, returning an IntegerType(). The actual return values of DayTransform.transform and DayTransform.pyarrow_transform have been correct

@kevinjqliu
Copy link
Contributor

is this the source of truth?
https://iceberg.apache.org/spec/#partition-transforms

@kevinzwang
Copy link
Contributor Author

is this the source of truth? https://iceberg.apache.org/spec/#partition-transforms

Yup, precisely

Comment on lines -520 to -522
def result_type(self, source: IcebergType) -> IcebergType:
return DateType()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we be explicit here and do

    def result_type(self, source: IcebergType) -> IntegerType:
        return IntegerType()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm matching the behavior of the other transforms that extend TimeTransform, which all just implicitly use TimeTransform.result_type instead of overriding it. Should we change this for all of them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, thanks for the context. This is fine!

Example:
>>> transform = MonthTransform()
>>> transform = DayTransform()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty!

@kevinzwang
Copy link
Contributor Author

Ok so interesting... Spark actually does store day transforms as date type in the metadata, which is why the integration test is failing. This is probably why this library had that behavior before. So Spark itself does not follow the Iceberg spec. Thoughts on deviating from Spark behavior?

@kevinjqliu
Copy link
Contributor

Im not 100% sure, perhaps the metadata table does the transformation.
https://iceberg.apache.org/docs/latest/spark-queries/#partitions

@kevinzwang
Copy link
Contributor Author

Im not 100% sure, perhaps the metadata table does the transformation.

https://iceberg.apache.org/docs/latest/spark-queries/#partitions

I think you are correct.

Physically, the integer value stored is days since epoch. But Spark encodes the type of the day transformed partition field in the metadata file as date type. On the other hand, for other transforms Spark conforms to the Iceberg spec. Strange.

@kevinjqliu
Copy link
Contributor

I like that its converted, its more readable! Do you know where the transform happens? Is it only for the metadata table?

@kevinzwang
Copy link
Contributor Author

It's more readable sure, but just does not conform to the spec. I'm not entirely sure what you mean by your question @kevinjqliu, I believe partition values are only stored in metadata.

@kevinjqliu
Copy link
Contributor

The partition value is stored in the metadata as int, but the "partition metadata table" (https://iceberg.apache.org/docs/latest/spark-queries/#partitions) shows the partition data as a "timestamp" instead.
When querying the "partition metadata table", the int partition value is transformed to its "timestamp" representation.

This is the difference between spark and pyiceberg, as seen in the failed test, test_inspect_partitions_partitioned.

So perhaps the "partition metadata table" is implemented wrong.

To verify, we need to find where the transformation, from int to timestamp, happens on the spark side

@kevinzwang
Copy link
Contributor Author

kevinzwang commented Sep 26, 2024

I played around with Spark and inspected the generated metadata. I believe the partition value is actually stored in the metadata as date type for day transformed partitions.

The reason why we are getting the right value but just as an integer in the test is that the physical type of date type is integer, and pyiceberg coerces the metadata table to the types from Transform.result_type upon reading them.

Basically, Spark stores day transformed partition values incorrectly in the metadata. However, since the underlying data of integer "days since epoch" and date are the same, pyiceberg and spark should still be compatible. It's just that the metadata tables will not be equal when read due to the type difference. I would say the solution is just to coerce the types in this test so that they match.

@kevinjqliu
Copy link
Contributor

Basically, Spark stores day transformed partition values incorrectly in the metadata

Thats an interesting find... The core Iceberg library is using DateType as the Result Type for DayTransform
https://github.com/apache/iceberg/blame/main/api/src/main/java/org/apache/iceberg/transforms/Days.java#L47

In fact, its the only "time-based" transforms which uses this
https://grep.app/search?q=getResultType&filter[repo][0]=apache/iceberg&filter[path][0]=api/src/main/java/org/apache/iceberg/

Let me bring this up in the devlist. In the meantime, I think we can either

  1. Put back DateType as the result type for Day Transform to match the behavior
  2. Fix it in the Partitions metadata table as a workaround and link an issue to fix forward

I don't want to a workaround in the tests as it will become difficult to maintain. WYDT?

@kevinjqliu
Copy link
Contributor

@kevinzwang
Copy link
Contributor Author

Perhaps let's just wait on the response on the devlist first.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out. removing result_type from DayTransform is the cause of all our woes in this PR.
The function name result_type is an overloaded turn and doesn't correspond 1 to 1 with the "Result type" described in the spec

The result_type for DayTransform should be DateType.
This conforms to the spec since DateType is converted to int here

The DateType result_type is needed since it mapped to pyarrow DateType here which corresponds with the Java library behavior.

Let's add a comment to result_type to explain this dynamic for future reference.

@kevinzwang
Copy link
Contributor Author

Ah ok, are we choosing to conform to the Java library behavior? In that case, I will close this PR, and we could make another one that improves the documentation for result_type

@kevinjqliu
Copy link
Contributor

Ah ok, are we choosing to conform to the Java library behavior?

yep! Both Java and python libraries behave according to the spec. As Ryan mentioned on the devlist, the DateType result_type for DayTransform is a nice-to-have feature so that the engine displays the day partition as date instead of int.

In that case, I will close this PR, and we could make another one that improves the documentation for result_type

WDYT of repurposing this PR for documentation? The comments on this PR is helpful for future context

@kevinzwang
Copy link
Contributor Author

Personally I would prefer to close this and then create a different PR that refers to this one, so that I don't have to make a commit to undo the work here and change the name.

@kevinzwang
Copy link
Contributor Author

Made a new PR here: #1211

@kevinzwang kevinzwang closed this Sep 30, 2024
@kevinjqliu
Copy link
Contributor

ty!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants