Long overflow when Iceberg reading INT96 timestamp column from Spark parquet table #8949

manuzhang · 2023-10-30T01:22:18Z

Apache Iceberg version

1.4.1 (latest release)

Query engine

Spark 3.1.1

Please describe the bug 🐞

When a Spark parquet table is imported into Iceberg via add_files procedure, reading the timestamp column will fail with Error while decoding: java.lang.ArithmeticException: long overflow. All the overflow records have negative long values.

SELECT $timestamp_column FROM $spark_table limit 1000;

The underlying parquet file is written by Spark in legacy mode, and the timestamp column is of type INT96.

After digging into it, I find isVectorDictEncoded is false in the following codepath, and thus TimestampMicroTzAccessor is used instead of the expected DictionaryTimestampInt96Accessor.

  public ArrowVectorAccessor<DecimalT, Utf8StringT, ArrayT, ChildVectorT> getVectorAccessor(
      VectorHolder holder) {
    Dictionary dictionary = holder.dictionary();
    boolean isVectorDictEncoded = holder.isDictionaryEncoded();
    FieldVector vector = holder.vector();
    if (isVectorDictEncoded) {
      ColumnDescriptor desc = holder.descriptor();
      PrimitiveType primitive = desc.getPrimitiveType();
      return getDictionaryVectorAccessor(dictionary, desc, vector, primitive);
    } else {
      return getPlainVectorAccessor(vector);
    }
  }

The text was updated successfully, but these errors were encountered:

manuzhang · 2023-10-30T01:50:54Z

@yabola @aokolnychyi @nastra please kindly advise.

nastra · 2023-10-30T08:22:41Z

That sounds like there's currently no support for INT96 for the non-vectorized code path available. #6962 only added support for vectorized reads.

According to the description in #6962, vectorized reads for INT96 columns were available, so maybe this needs to be properly tested to see what exactly is missing

manuzhang · 2023-10-31T06:06:39Z

As per parquet format encoding,

If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding.

I've verified the timestamp column has both encodings PLAIN_DICTIONARY and PLAIN. It looks this case is not handled currently.

yabola · 2023-10-31T06:44:44Z

@manuzhang Hi, Please correct me if I understand wrong, your error case is vectorized reading, and timestamp96 column is not using dictionary ( encodings PLAIN) . But I have dealt with this case actually.
Please look at this test case. The code path you mentioned can actually be reached (useDict = false, useVectorization = true). Can you reproduce this issue in this test case(or provide a parquet file to reproduce this problem) ? Sorry that there may be other points not considered.
https://github.com/apache/iceberg/blob/86bb1c09f5ffd2b6a7c72683cb86bb95f4c2b72f/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java#L2121C2-L2161

manuzhang · 2023-10-31T07:12:54Z

@yabola timestamp96 column is using dictionary with encodings PLAIN_DICTIONARY and PLAIN, which is not covered in your case.

manuzhang · 2023-10-31T08:44:40Z

If you check Spark's VectorizedColumnReader, encoding is checked for each page.

yabola · 2023-11-01T11:20:08Z

@manuzhang Can you provide a way for me to reproduce it locally?

manuzhang · 2023-11-02T01:44:36Z

@yabola It's hard to reproduce with simple data samples. I may send you a parquet file via slack or other channels if possible. Fee free to ping me on Iceberg's slack.

github-actions · 2024-09-28T00:14:54Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions · 2024-10-15T00:15:07Z

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

nastra mentioned this issue Nov 7, 2023

Parquet: Support reading INT96 column in row group filter #8988

Merged

github-actions bot added the stale label Sep 28, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long overflow when Iceberg reading INT96 timestamp column from Spark parquet table #8949

Long overflow when Iceberg reading INT96 timestamp column from Spark parquet table #8949

manuzhang commented Oct 30, 2023 •

edited

Loading

manuzhang commented Oct 30, 2023

nastra commented Oct 30, 2023

manuzhang commented Oct 31, 2023

yabola commented Oct 31, 2023 •

edited

Loading

manuzhang commented Oct 31, 2023

manuzhang commented Oct 31, 2023 •

edited

Loading

yabola commented Nov 1, 2023

manuzhang commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Sep 28, 2024

github-actions bot commented Oct 15, 2024

Long overflow when Iceberg reading INT96 timestamp column from Spark parquet table #8949

Long overflow when Iceberg reading INT96 timestamp column from Spark parquet table #8949

Comments

manuzhang commented Oct 30, 2023 • edited Loading

Apache Iceberg version

Query engine

Please describe the bug 🐞

manuzhang commented Oct 30, 2023

nastra commented Oct 30, 2023

manuzhang commented Oct 31, 2023

yabola commented Oct 31, 2023 • edited Loading

manuzhang commented Oct 31, 2023

manuzhang commented Oct 31, 2023 • edited Loading

yabola commented Nov 1, 2023

manuzhang commented Nov 2, 2023 • edited Loading

github-actions bot commented Sep 28, 2024

github-actions bot commented Oct 15, 2024

manuzhang commented Oct 30, 2023 •

edited

Loading

yabola commented Oct 31, 2023 •

edited

Loading

manuzhang commented Oct 31, 2023 •

edited

Loading

manuzhang commented Nov 2, 2023 •

edited

Loading