Reading a Parquet file produced by pyarrow results to corrupted data read #849

miohtama · 2023-01-31T00:04:48Z

Describe the issue:

I have a Parquet file written with pyarrow that fastparquet cannot load properly. The read has correct amount of rows, but the data seems to be corrupted.

I have prepared the 54 MB file to be downloaded from Google Drive to be debugged. Data is public.

In the verifiable example we compare one of the columns of DataFrame, read through fastparquet and pyarrow, and see that fastparquet read is missing a lot of data.

I would expect pyarrow and fastparquet to result to the same data read for the same parquet file, unless I am missing something I could not spot in the documention.

Minimal Complete Verifiable Example:

pip install pyarrow
pip install fastparquet

"""Compare pyarrow and fastparquet files.

- Use the sample Parquet file (54MB): https://drive.google.com/file/d/1LtiL-n50cNx1JMBC0QE-Ztvf8ENSb-OE/view?usp=sharing
  - save as "/tmp/candles-30d.parquet"

- We found out that rows and the contents of pair_id column is corrupted
  (among other corruption if you do row by row comparison, but this is easy to spot)]

- One potential cause is usage of row groups in the file

- The file has been written with pyarrow / Python


"""
from pyarrow import parquet as pq  # pyarrow 10
from fastparquet import ParquetFile  # fastparquet  2023.1.0

path = "/tmp/candles-30d.parquet"

pf1 = ParquetFile(path)
pf2 = pq.read_table(path)

df1 = pf1.to_pandas()
df2 = pf2.to_pandas()

print("Rows ", len(df2))
assert len(df1) == len(df2)  # Passes, looks like row count matches

# fastparquet only sees 1280 pair_ids out of 150903
print("Unique pairs fastparquet", len(df1.pair_id.unique()))
print("Unique pairs pyarrow", len(df2.pair_id.unique()))
assert len(df1.pair_id.unique()) == len(df2.pair_id.unique())

Anything else we need to know?:

This Parquet file uses row groups.

If DataFraem rows are manually iterated, the first mismatch between fastparquet and pyarrow seems to be row 256 - a nice round number.

Environment:

Python version: 3.10
Operating System: macOS M1
Install method: pip

miohtama · 2023-10-16T11:00:30Z

Hi Martin. I just tested with fastparquet 2023.8.0. Looks like this issue is still persistent using the steps above to repeat.

miohtama · 2023-10-16T11:04:38Z

Also @martindurant I do not have a button to re-open the issue, so I kindly ask you to reopen this.

martindurant · 2023-10-16T13:24:49Z

In [4]: from pyarrow import parquet as pq  # pyarrow 10
   ...: from fastparquet import ParquetFile  # fastparquet  2023.1.0
   ...:
   ...: path = "/Users/mdurant/Downloads/candles-30d.parquet"
   ...:
   ...: pf1 = ParquetFile(path)
   ...: pf2 = pq.read_table(path)
   ...:
   ...: df1 = pf1.to_pandas()
   ...: df2 = pf2.to_pandas()
   ...:
   ...: print("Rows ", len(df2))
   ...: assert len(df1) == len(df2)  # Passes, looks like row count matches
   ...:
   ...: # fastparquet only sees 1280 pair_ids out of 150903
   ...: print("Unique pairs fastparquet", len(df1.pair_id.unique()))
   ...: print("Unique pairs pyarrow", len(df2.pair_id.unique()))
   ...: assert len(df1.pair_id.unique()) == len(df2.pair_id.unique())
Rows  812183
Unique pairs fastparquet 150903
Unique pairs pyarrow 150903

ran just now with the linked file and fastparquet at main (8e9d419, 16 commits since last release 2023.8.0).

miohtama · 2023-10-16T19:03:36Z

@martindurant Kindly thank you for verifying this for me. I will clarify tests to repeat and also check on another computer if it is somehow related to the local development environment.

miohtama · 2023-10-16T21:13:17Z

Thank you, it was indeed a local error (was using an old Git checkout instead of latest version).

martindurant mentioned this issue Jan 31, 2023

Use bigger int type for dereferencing dicts in V2 #850

Merged

martindurant closed this as completed in #850 Jan 31, 2023

miohtama mentioned this issue Oct 16, 2023

Switch to FastParquet as the default Parquet file reading backend tradingstrategy-ai/trading-strategy#123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading a Parquet file produced by pyarrow results to corrupted data read #849

Reading a Parquet file produced by pyarrow results to corrupted data read #849

miohtama commented Jan 31, 2023

miohtama commented Oct 16, 2023 •

edited

Loading

miohtama commented Oct 16, 2023

martindurant commented Oct 16, 2023

miohtama commented Oct 16, 2023

miohtama commented Oct 16, 2023

Reading a Parquet file produced by pyarrow results to corrupted data read #849

Reading a Parquet file produced by pyarrow results to corrupted data read #849

Comments

miohtama commented Jan 31, 2023

miohtama commented Oct 16, 2023 • edited Loading

miohtama commented Oct 16, 2023

martindurant commented Oct 16, 2023

miohtama commented Oct 16, 2023

miohtama commented Oct 16, 2023

miohtama commented Oct 16, 2023 •

edited

Loading