-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long overflow when Iceberg reading INT96 timestamp column from Spark parquet table #8949
Comments
@yabola @aokolnychyi @nastra please kindly advise. |
That sounds like there's currently no support for INT96 for the non-vectorized code path available. #6962 only added support for vectorized reads. According to the description in #6962, vectorized reads for INT96 columns were available, so maybe this needs to be properly tested to see what exactly is missing |
As per parquet format encoding,
I've verified the timestamp column has both encodings |
@manuzhang Hi, Please correct me if I understand wrong, your error case is vectorized reading, and timestamp96 column is not using dictionary ( encodings |
@yabola timestamp96 column is using dictionary with encodings |
If you check Spark's VectorizedColumnReader, encoding is checked for each page. |
@manuzhang Can you provide a way for me to reproduce it locally? |
@yabola It's hard to reproduce with simple data samples. I may send you a parquet file via slack or other channels if possible. Fee free to ping me on Iceberg's slack. |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' |
Apache Iceberg version
1.4.1 (latest release)
Query engine
Spark 3.1.1
Please describe the bug 🐞
When a Spark parquet table is imported into Iceberg via
add_files
procedure, reading the timestamp column will fail withError while decoding: java.lang.ArithmeticException: long overflow
. All the overflow records have negative long values.The underlying parquet file is written by Spark in legacy mode, and the timestamp column is of type INT96.
After digging into it, I find
isVectorDictEncoded
is false in the following codepath, and thusTimestampMicroTzAccessor
is used instead of the expectedDictionaryTimestampInt96Accessor
.The text was updated successfully, but these errors were encountered: