Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect value returned for overflow timestamps in micros format for V1 footers #872

Closed
revans2 opened this issue Jul 26, 2023 · 4 comments · Fixed by #874
Closed

Incorrect value returned for overflow timestamps in micros format for V1 footers #872

revans2 opened this issue Jul 26, 2023 · 4 comments · Fixed by #874

Comments

@revans2
Copy link

revans2 commented Jul 26, 2023

It appears that at least for version 2023.7.0 when trying to read a timestamp in MICROS that is out of the range for what nanos can hold in an int64, that a wrong value is returned if the footer is still in parquet V1 format. It is kind of a very specific corner case. The V1 footer format is used by some older versions of Spark, like Spark 3.1.1 and also currently by CUDF.

To reproduce this you can use pyspark to write a file with the following code.

import datetime
my_path = "..."
df = spark.createDataFrame([(datetime.datetime(3023, 7, 14, 7, 38, 45, 418688),)], 'ts timestamp')
df.write.mode("overwrite").parquet(my_path)
spark.read.parquet(my_path).show(truncate = False)
+--------------------------+
|ts                        |
+--------------------------+
|3023-07-14 07:38:45.418688|
+--------------------------+

If you do this on spark 3.1.1 you get a file that fastparquet cannot read correctly. But if you use spark 3.3.0 fastparquet works just fine.

import fastparquet
cpu_file = fastparquet.ParquetFile("CPU_311/part-00000-9fcbe985-36aa-4765-86a0-47a2c6cc4926-c000.snappy.parquet")
cpu_file.head(1)
                             ts
0 1854-06-04 13:29:37.999584768
gpu_file = fastparquet.ParquetFile("CUDF/part-00000-98db0b25-66e1-48c2-91bd-ac78f2ac30ee-c000.snappy.parquet")
gpu_file.head(1)
                             ts
0 1854-06-04 13:29:37.999584768
newer_cpu_file = fastparquet.ParquetFile("CPU_330/part-00000-7aaa467a-aa1b-43db-8102-c604b9c04862-c000.snappy.parquet")
newer_cpu_file.head(1)
                          ts
0 3023-07-14 12:38:45.418688

If I use the parquet command line tool to dump the data, they all come out correctly.

$ java -jar ./target/parquet-tools-deprecated-1.12.2.jar dump CPU_311/*.parquet
row group 0 
--------------------------------------------------------------------------------
ts:  INT64 SNAPPY DO:0 FPO:4 SZ:77/75/0.97 VC:1 ENC:PLAIN,BIT_PACKED,RLE [more]...

    ts TV=1 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[min: 3023-07-14T12:3 [more]... SZ:14

INT64 ts 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 *** 
value 1: R:0 D:1 V:3023-07-14T12:38:45.418688+0000
$ java -jar ./target/parquet-tools-deprecated-1.12.2.jar dump CUDF/*.parquet
row group 0 
--------------------------------------------------------------------------------
ts:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:31/31/1.00 VC:1 ENC:PLAIN,RLE ST [more]...

    ts TV=1 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[no stats for this column] CRC:[none] [more]...

INT64 ts 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 *** 
value 1: R:0 D:1 V:3023-07-14T12:38:45.418688+0000
$ java -jar ./target/parquet-tools-deprecated-1.12.2.jar dump CPU_330/*.parquet
row group 0 
--------------------------------------------------------------------------------
ts:  INT64 SNAPPY DO:0 FPO:4 SZ:39/37/0.95 VC:1 ENC:BIT_PACKED,PLAIN,RLE [more]...

    ts TV=1 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] [more]... VC:1

INT64 ts 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 *** 
value 1: R:0 D:1 V:3023-07-14T12:38:45.418688+0000

Here are the files for reference.

As a side note NVIDIA/spark-rapids#8778 was the original issue for this.

@martindurant
Copy link
Member

Thanks for the report, I'll get back to you.

@martindurant
Copy link
Member

Actually, it was not the change in parquet metadata per se - but pandas started to support non-ns time units at around the same time this change was made. What pandas version do you have? V2.0 definitely had non-ns units, it seems to have come in first for pandas 1.5.0.

@revans2
Copy link
Author

revans2 commented Jul 31, 2023

I hit this with pandas version=2.0.3

The full pip list in case you need it is

Package           Version
----------------- --------
asttokens         2.2.1
backcall          0.2.0
cfgv              3.3.1
comm              0.1.3
cramjam           2.6.2
debugpy           1.6.7
decorator         5.1.1
distlib           0.3.7
exceptiongroup    1.1.2
execnet           2.0.2
executing         1.2.0
fastparquet       2023.7.0
filelock          3.12.2
findspark         2.0.1
fsspec            2023.6.0
identify          2.5.25
iniconfig         2.0.0
ipykernel         6.24.0
ipython           8.14.0
jedi              0.18.2
jupyter_client    8.3.0
jupyter_core      5.3.1
matplotlib-inline 0.1.6
nest-asyncio      1.5.6
nodeenv           1.8.0
numpy             1.25.1
packaging         23.1
pandas            2.0.3
parso             0.8.3
pexpect           4.8.0
pickleshare       0.7.5
pip               22.0.2
platformdirs      3.9.1
pluggy            1.2.0
pre-commit        3.3.3
prompt-toolkit    3.0.39
psutil            5.9.5
ptyprocess        0.7.0
pure-eval         0.2.2
pyarrow           12.0.1
Pygments          2.15.1
pytest            7.4.0
pytest-xdist      3.3.1
python-dateutil   2.8.2
pytz              2023.3
PyYAML            6.0.1
pyzmq             25.1.0
setuptools        59.6.0
six               1.16.0
sre-yield         1.2
stack-data        0.6.2
tomli             2.0.1
tornado           6.3.2
traitlets         5.9.0
tzdata            2023.3
virtualenv        20.24.1
wcwidth           0.2.6

@martindurant
Copy link
Member

That PR doesn't work yet, I'll fix it when I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants