-
-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading a Parquet file produced by pyarrow results to corrupted data read #849
Comments
Hi Martin. I just tested with |
Also @martindurant I do not have a button to re-open the issue, so I kindly ask you to reopen this. |
ran just now with the linked file and fastparquet at main (8e9d419, 16 commits since last release 2023.8.0). |
@martindurant Kindly thank you for verifying this for me. I will clarify tests to repeat and also check on another computer if it is somehow related to the local development environment. |
Thank you, it was indeed a local error (was using an old Git checkout instead of latest version). |
Describe the issue:
I have a Parquet file written with pyarrow that fastparquet cannot load properly. The read has correct amount of rows, but the data seems to be corrupted.
I have prepared the 54 MB file to be downloaded from Google Drive to be debugged. Data is public.
In the verifiable example we compare one of the columns of DataFrame, read through fastparquet and pyarrow, and see that fastparquet read is missing a lot of data.
I would expect pyarrow and fastparquet to result to the same data read for the same parquet file, unless I am missing something I could not spot in the documention.
Minimal Complete Verifiable Example:
Anything else we need to know?:
This Parquet file uses row groups.
If DataFraem rows are manually iterated, the first mismatch between fastparquet and pyarrow seems to be row 256 - a nice round number.
Environment:
The text was updated successfully, but these errors were encountered: