-
Notifications
You must be signed in to change notification settings - Fork 356
fix: DayTransform result type override and docs #1208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
is this the source of truth? |
Yup, precisely |
def result_type(self, source: IcebergType) -> IcebergType: | ||
return DateType() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we be explicit here and do
def result_type(self, source: IcebergType) -> IntegerType:
return IntegerType()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm matching the behavior of the other transforms that extend TimeTransform
, which all just implicitly use TimeTransform.result_type
instead of overriding it. Should we change this for all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, thanks for the context. This is fine!
Example: | ||
>>> transform = MonthTransform() | ||
>>> transform = DayTransform() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty!
Ok so interesting... Spark actually does store day transforms as date type in the metadata, which is why the integration test is failing. This is probably why this library had that behavior before. So Spark itself does not follow the Iceberg spec. Thoughts on deviating from Spark behavior? |
Im not 100% sure, perhaps the metadata table does the transformation. |
I think you are correct. Physically, the integer value stored is days since epoch. But Spark encodes the type of the day transformed partition field in the metadata file as date type. On the other hand, for other transforms Spark conforms to the Iceberg spec. Strange. |
I like that its converted, its more readable! Do you know where the transform happens? Is it only for the metadata table? |
It's more readable sure, but just does not conform to the spec. I'm not entirely sure what you mean by your question @kevinjqliu, I believe partition values are only stored in metadata. |
The partition value is stored in the metadata as int, but the "partition metadata table" (https://iceberg.apache.org/docs/latest/spark-queries/#partitions) shows the partition data as a "timestamp" instead. This is the difference between spark and pyiceberg, as seen in the failed test, So perhaps the "partition metadata table" is implemented wrong. To verify, we need to find where the transformation, from int to timestamp, happens on the spark side |
I played around with Spark and inspected the generated metadata. I believe the partition value is actually stored in the metadata as date type for day transformed partitions. The reason why we are getting the right value but just as an integer in the test is that the physical type of date type is integer, and pyiceberg coerces the metadata table to the types from Basically, Spark stores day transformed partition values incorrectly in the metadata. However, since the underlying data of integer "days since epoch" and date are the same, pyiceberg and spark should still be compatible. It's just that the metadata tables will not be equal when read due to the type difference. I would say the solution is just to coerce the types in this test so that they match. |
Thats an interesting find... The core Iceberg library is using In fact, its the only "time-based" transforms which uses this Let me bring this up in the devlist. In the meantime, I think we can either
I don't want to a workaround in the tests as it will become difficult to maintain. WYDT? |
Perhaps let's just wait on the response on the devlist first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out. removing result_type
from DayTransform
is the cause of all our woes in this PR.
The function name result_type
is an overloaded turn and doesn't correspond 1 to 1 with the "Result type" described in the spec
The result_type
for DayTransform
should be DateType
.
This conforms to the spec since DateType
is converted to int
here
The DateType
result_type
is needed since it mapped to pyarrow DateType here which corresponds with the Java library behavior.
Let's add a comment to result_type
to explain this dynamic for future reference.
Ah ok, are we choosing to conform to the Java library behavior? In that case, I will close this PR, and we could make another one that improves the documentation for result_type |
yep! Both Java and python libraries behave according to the spec. As Ryan mentioned on the devlist, the
WDYT of repurposing this PR for documentation? The comments on this PR is helpful for future context |
Personally I would prefer to close this and then create a different PR that refers to this one, so that I don't have to make a commit to undo the work here and change the name. |
Made a new PR here: #1211 |
ty! |
DayTransform.result_type()
incorrectly returnsDateType()
. This just removes that so it has the same behavior as other time transforms, returning anIntegerType()
. The actual return values ofDayTransform.transform
andDayTransform.pyarrow_transform
have been correct