Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a Parquet file produced by pyarrow results to corrupted data read #849

Closed
miohtama opened this issue Jan 31, 2023 · 5 comments · Fixed by #850
Closed

Reading a Parquet file produced by pyarrow results to corrupted data read #849

miohtama opened this issue Jan 31, 2023 · 5 comments · Fixed by #850

Comments

@miohtama
Copy link

Describe the issue:

I have a Parquet file written with pyarrow that fastparquet cannot load properly. The read has correct amount of rows, but the data seems to be corrupted.

I have prepared the 54 MB file to be downloaded from Google Drive to be debugged. Data is public.

In the verifiable example we compare one of the columns of DataFrame, read through fastparquet and pyarrow, and see that fastparquet read is missing a lot of data.

I would expect pyarrow and fastparquet to result to the same data read for the same parquet file, unless I am missing something I could not spot in the documention.

Minimal Complete Verifiable Example:

pip install pyarrow
pip install fastparquet
"""Compare pyarrow and fastparquet files.

- Use the sample Parquet file (54MB): https://drive.google.com/file/d/1LtiL-n50cNx1JMBC0QE-Ztvf8ENSb-OE/view?usp=sharing
  - save as "/tmp/candles-30d.parquet"

- We found out that rows and the contents of pair_id column is corrupted
  (among other corruption if you do row by row comparison, but this is easy to spot)]

- One potential cause is usage of row groups in the file

- The file has been written with pyarrow / Python


"""
from pyarrow import parquet as pq  # pyarrow 10
from fastparquet import ParquetFile  # fastparquet  2023.1.0

path = "/tmp/candles-30d.parquet"

pf1 = ParquetFile(path)
pf2 = pq.read_table(path)

df1 = pf1.to_pandas()
df2 = pf2.to_pandas()

print("Rows ", len(df2))
assert len(df1) == len(df2)  # Passes, looks like row count matches

# fastparquet only sees 1280 pair_ids out of 150903
print("Unique pairs fastparquet", len(df1.pair_id.unique()))
print("Unique pairs pyarrow", len(df2.pair_id.unique()))
assert len(df1.pair_id.unique()) == len(df2.pair_id.unique())

Anything else we need to know?:

This Parquet file uses row groups.

If DataFraem rows are manually iterated, the first mismatch between fastparquet and pyarrow seems to be row 256 - a nice round number.

Environment:

  • Python version: 3.10
  • Operating System: macOS M1
  • Install method: pip
@miohtama
Copy link
Author

miohtama commented Oct 16, 2023

Hi Martin. I just tested with fastparquet 2023.8.0. Looks like this issue is still persistent using the steps above to repeat.

@miohtama
Copy link
Author

Also @martindurant I do not have a button to re-open the issue, so I kindly ask you to reopen this.

@martindurant
Copy link
Member

In [4]: from pyarrow import parquet as pq  # pyarrow 10
   ...: from fastparquet import ParquetFile  # fastparquet  2023.1.0
   ...:
   ...: path = "/Users/mdurant/Downloads/candles-30d.parquet"
   ...:
   ...: pf1 = ParquetFile(path)
   ...: pf2 = pq.read_table(path)
   ...:
   ...: df1 = pf1.to_pandas()
   ...: df2 = pf2.to_pandas()
   ...:
   ...: print("Rows ", len(df2))
   ...: assert len(df1) == len(df2)  # Passes, looks like row count matches
   ...:
   ...: # fastparquet only sees 1280 pair_ids out of 150903
   ...: print("Unique pairs fastparquet", len(df1.pair_id.unique()))
   ...: print("Unique pairs pyarrow", len(df2.pair_id.unique()))
   ...: assert len(df1.pair_id.unique()) == len(df2.pair_id.unique())
Rows  812183
Unique pairs fastparquet 150903
Unique pairs pyarrow 150903

ran just now with the linked file and fastparquet at main (8e9d419, 16 commits since last release 2023.8.0).

@miohtama
Copy link
Author

@martindurant Kindly thank you for verifying this for me. I will clarify tests to repeat and also check on another computer if it is somehow related to the local development environment.

@miohtama
Copy link
Author

Thank you, it was indeed a local error (was using an old Git checkout instead of latest version).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants