Incorrect value returned for overflow timestamps in micros format for V1 footers #872

revans2 · 2023-07-26T14:45:36Z

It appears that at least for version 2023.7.0 when trying to read a timestamp in MICROS that is out of the range for what nanos can hold in an int64, that a wrong value is returned if the footer is still in parquet V1 format. It is kind of a very specific corner case. The V1 footer format is used by some older versions of Spark, like Spark 3.1.1 and also currently by CUDF.

To reproduce this you can use pyspark to write a file with the following code.

import datetime
my_path = "..."
df = spark.createDataFrame([(datetime.datetime(3023, 7, 14, 7, 38, 45, 418688),)], 'ts timestamp')
df.write.mode("overwrite").parquet(my_path)
spark.read.parquet(my_path).show(truncate = False)
+--------------------------+
|ts                        |
+--------------------------+
|3023-07-14 07:38:45.418688|
+--------------------------+

If you do this on spark 3.1.1 you get a file that fastparquet cannot read correctly. But if you use spark 3.3.0 fastparquet works just fine.

import fastparquet
cpu_file = fastparquet.ParquetFile("CPU_311/part-00000-9fcbe985-36aa-4765-86a0-47a2c6cc4926-c000.snappy.parquet")
cpu_file.head(1)
                             ts
0 1854-06-04 13:29:37.999584768
gpu_file = fastparquet.ParquetFile("CUDF/part-00000-98db0b25-66e1-48c2-91bd-ac78f2ac30ee-c000.snappy.parquet")
gpu_file.head(1)
                             ts
0 1854-06-04 13:29:37.999584768
newer_cpu_file = fastparquet.ParquetFile("CPU_330/part-00000-7aaa467a-aa1b-43db-8102-c604b9c04862-c000.snappy.parquet")
newer_cpu_file.head(1)
                          ts
0 3023-07-14 12:38:45.418688

If I use the parquet command line tool to dump the data, they all come out correctly.

$ java -jar ./target/parquet-tools-deprecated-1.12.2.jar dump CPU_311/*.parquet
row group 0 
--------------------------------------------------------------------------------
ts:  INT64 SNAPPY DO:0 FPO:4 SZ:77/75/0.97 VC:1 ENC:PLAIN,BIT_PACKED,RLE [more]...

    ts TV=1 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[min: 3023-07-14T12:3 [more]... SZ:14

INT64 ts 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 *** 
value 1: R:0 D:1 V:3023-07-14T12:38:45.418688+0000
$ java -jar ./target/parquet-tools-deprecated-1.12.2.jar dump CUDF/*.parquet
row group 0 
--------------------------------------------------------------------------------
ts:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:31/31/1.00 VC:1 ENC:PLAIN,RLE ST [more]...

    ts TV=1 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[no stats for this column] CRC:[none] [more]...

INT64 ts 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 *** 
value 1: R:0 D:1 V:3023-07-14T12:38:45.418688+0000
$ java -jar ./target/parquet-tools-deprecated-1.12.2.jar dump CPU_330/*.parquet
row group 0 
--------------------------------------------------------------------------------
ts:  INT64 SNAPPY DO:0 FPO:4 SZ:39/37/0.95 VC:1 ENC:BIT_PACKED,PLAIN,RLE [more]...

    ts TV=1 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] [more]... VC:1

INT64 ts 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 *** 
value 1: R:0 D:1 V:3023-07-14T12:38:45.418688+0000

Here are the files for reference.

As a side note NVIDIA/spark-rapids#8778 was the original issue for this.

The text was updated successfully, but these errors were encountered:

martindurant · 2023-07-31T13:48:31Z

Thanks for the report, I'll get back to you.

martindurant · 2023-07-31T14:34:49Z

Actually, it was not the change in parquet metadata per se - but pandas started to support non-ns time units at around the same time this change was made. What pandas version do you have? V2.0 definitely had non-ns units, it seems to have come in first for pandas 1.5.0.

revans2 · 2023-07-31T14:41:35Z

I hit this with pandas version=2.0.3

The full pip list in case you need it is

Package           Version
----------------- --------
asttokens         2.2.1
backcall          0.2.0
cfgv              3.3.1
comm              0.1.3
cramjam           2.6.2
debugpy           1.6.7
decorator         5.1.1
distlib           0.3.7
exceptiongroup    1.1.2
execnet           2.0.2
executing         1.2.0
fastparquet       2023.7.0
filelock          3.12.2
findspark         2.0.1
fsspec            2023.6.0
identify          2.5.25
iniconfig         2.0.0
ipykernel         6.24.0
ipython           8.14.0
jedi              0.18.2
jupyter_client    8.3.0
jupyter_core      5.3.1
matplotlib-inline 0.1.6
nest-asyncio      1.5.6
nodeenv           1.8.0
numpy             1.25.1
packaging         23.1
pandas            2.0.3
parso             0.8.3
pexpect           4.8.0
pickleshare       0.7.5
pip               22.0.2
platformdirs      3.9.1
pluggy            1.2.0
pre-commit        3.3.3
prompt-toolkit    3.0.39
psutil            5.9.5
ptyprocess        0.7.0
pure-eval         0.2.2
pyarrow           12.0.1
Pygments          2.15.1
pytest            7.4.0
pytest-xdist      3.3.1
python-dateutil   2.8.2
pytz              2023.3
PyYAML            6.0.1
pyzmq             25.1.0
setuptools        59.6.0
six               1.16.0
sre-yield         1.2
stack-data        0.6.2
tomli             2.0.1
tornado           6.3.2
traitlets         5.9.0
tzdata            2023.3
virtualenv        20.24.1
wcwidth           0.2.6

martindurant · 2023-08-01T15:08:43Z

That PR doesn't work yet, I'll fix it when I can.

revans2 mentioned this issue Jul 26, 2023

[BUG] GPU Parquet output for TIMESTAMP_MICROS is misinteterpreted by fastparquet as nanos NVIDIA/spark-rapids#8778

Closed

martindurant mentioned this issue Jul 31, 2023

Use non-ns units on timestamps declared the old way #874

Merged

martindurant closed this as completed in #874 Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect value returned for overflow timestamps in micros format for V1 footers #872

Incorrect value returned for overflow timestamps in micros format for V1 footers #872

revans2 commented Jul 26, 2023

martindurant commented Jul 31, 2023

martindurant commented Jul 31, 2023

revans2 commented Jul 31, 2023

martindurant commented Aug 1, 2023

Incorrect value returned for overflow timestamps in micros format for V1 footers #872

Incorrect value returned for overflow timestamps in micros format for V1 footers #872

Comments

revans2 commented Jul 26, 2023

martindurant commented Jul 31, 2023

martindurant commented Jul 31, 2023

revans2 commented Jul 31, 2023

martindurant commented Aug 1, 2023