-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Labels
Description
Describe the bug, including details regarding any error messages, version, and platform.
Sometimes a file is written that is missing the last byte, so it ends in .PAR when it should be .PAR1. This causes EOFException when attempting to read the file.
$ hexdump -C good.snappy.parquet| tail -n 10
004fff70 6b 2e 6c 65 67 61 63 79 44 61 74 65 54 69 6d 65 |k.legacyDateTime|
004fff80 18 00 00 18 4a 70 61 72 71 75 65 74 2d 6d 72 20 |....Jparquet-mr |
004fff90 76 65 72 73 69 6f 6e 20 31 2e 31 32 2e 33 20 28 |version 1.12.3 (|
004fffa0 62 75 69 6c 64 20 66 38 64 63 65 64 31 38 32 63 |build f8dced182c|
004fffb0 34 63 31 66 62 64 65 63 36 63 63 62 33 31 38 35 |4c1fbdec6ccb3185|
004fffc0 35 33 37 62 35 61 30 31 65 36 65 64 36 62 29 19 |537b5a01e6ed6b).|
004fffd0 dc 1c 00 00 1c 00 00 1c 00 00 1c 00 00 1c 00 00 |................|
004fffe0 1c 00 00 1c 00 00 1c 00 00 1c 00 00 1c 00 00 1c |................|
004ffff0 00 00 1c 00 00 1c 00 00 00 e7 0f 00 00 50 41 52 |.............PAR|
00500000
This might be related - we are seeing this issue only on GCP, not AWS. For GCP we do disk seeks randomly and on AWS we do disk seeks sequentially.
We can rerun a job that writes the corrupt parquet file, and it will succeed the second time, so it seems to be nondeterministic.
This is on version 1.14.3.
Component(s)
No response
Reactions are currently unavailable