Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long overflow when Iceberg reading INT96 timestamp column from Spark parquet table #8949

Closed
manuzhang opened this issue Oct 30, 2023 · 10 comments
Labels

Comments

@manuzhang
Copy link
Contributor

manuzhang commented Oct 30, 2023

Apache Iceberg version

1.4.1 (latest release)

Query engine

Spark 3.1.1

Please describe the bug 🐞

When a Spark parquet table is imported into Iceberg via add_files procedure, reading the timestamp column will fail with Error while decoding: java.lang.ArithmeticException: long overflow. All the overflow records have negative long values.

SELECT $timestamp_column FROM $spark_table limit 1000;

The underlying parquet file is written by Spark in legacy mode, and the timestamp column is of type INT96.

After digging into it, I find isVectorDictEncoded is false in the following codepath, and thus TimestampMicroTzAccessor is used instead of the expected DictionaryTimestampInt96Accessor.

  public ArrowVectorAccessor<DecimalT, Utf8StringT, ArrayT, ChildVectorT> getVectorAccessor(
      VectorHolder holder) {
    Dictionary dictionary = holder.dictionary();
    boolean isVectorDictEncoded = holder.isDictionaryEncoded();
    FieldVector vector = holder.vector();
    if (isVectorDictEncoded) {
      ColumnDescriptor desc = holder.descriptor();
      PrimitiveType primitive = desc.getPrimitiveType();
      return getDictionaryVectorAccessor(dictionary, desc, vector, primitive);
    } else {
      return getPlainVectorAccessor(vector);
    }
  }
@manuzhang
Copy link
Contributor Author

@yabola @aokolnychyi @nastra please kindly advise.

@nastra
Copy link
Contributor

nastra commented Oct 30, 2023

That sounds like there's currently no support for INT96 for the non-vectorized code path available. #6962 only added support for vectorized reads.

According to the description in #6962, vectorized reads for INT96 columns were available, so maybe this needs to be properly tested to see what exactly is missing

@manuzhang
Copy link
Contributor Author

As per parquet format encoding,

If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding.

I've verified the timestamp column has both encodings PLAIN_DICTIONARY and PLAIN. It looks this case is not handled currently.

@yabola
Copy link
Contributor

yabola commented Oct 31, 2023

@manuzhang Hi, Please correct me if I understand wrong, your error case is vectorized reading, and timestamp96 column is not using dictionary ( encodings PLAIN) . But I have dealt with this case actually.
Please look at this test case. The code path you mentioned can actually be reached (useDict = false, useVectorization = true). Can you reproduce this issue in this test case(or provide a parquet file to reproduce this problem) ? Sorry that there may be other points not considered.
https://github.com/apache/iceberg/blob/86bb1c09f5ffd2b6a7c72683cb86bb95f4c2b72f/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java#L2121C2-L2161

@manuzhang
Copy link
Contributor Author

@yabola timestamp96 column is using dictionary with encodings PLAIN_DICTIONARY and PLAIN, which is not covered in your case.

@manuzhang
Copy link
Contributor Author

manuzhang commented Oct 31, 2023

If you check Spark's VectorizedColumnReader, encoding is checked for each page.

@yabola
Copy link
Contributor

yabola commented Nov 1, 2023

@manuzhang Can you provide a way for me to reproduce it locally?

@manuzhang
Copy link
Contributor Author

manuzhang commented Nov 2, 2023

@yabola It's hard to reproduce with simple data samples. I may send you a parquet file via slack or other channels if possible. Fee free to ping me on Iceberg's slack.

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Sep 28, 2024
Copy link

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants