-
-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect value returned for overflow timestamps in micros format for V1 footers #872
Comments
Thanks for the report, I'll get back to you. |
Actually, it was not the change in parquet metadata per se - but pandas started to support non-ns time units at around the same time this change was made. What pandas version do you have? V2.0 definitely had non-ns units, it seems to have come in first for pandas 1.5.0. |
I hit this with pandas version=2.0.3 The full pip list in case you need it is
|
That PR doesn't work yet, I'll fix it when I can. |
It appears that at least for version 2023.7.0 when trying to read a timestamp in MICROS that is out of the range for what nanos can hold in an int64, that a wrong value is returned if the footer is still in parquet V1 format. It is kind of a very specific corner case. The V1 footer format is used by some older versions of Spark, like Spark 3.1.1 and also currently by CUDF.
To reproduce this you can use pyspark to write a file with the following code.
If you do this on spark 3.1.1 you get a file that fastparquet cannot read correctly. But if you use spark 3.3.0 fastparquet works just fine.
If I use the parquet command line tool to dump the data, they all come out correctly.
Here are the files for reference.
As a side note NVIDIA/spark-rapids#8778 was the original issue for this.
The text was updated successfully, but these errors were encountered: