Access by column position in parquet data files #10582

AndreiZhoraven · 2024-06-28T15:25:47Z

Query engine

Spark, trino

Question

Hello everyone,

I ran into an issue with an Iceberg table after adding some Parquet files using the Iceberg Java API. While the file contains all fields defined in the table schema, their order differs from the original schema. When attempting to read the table using Spark SQL or Trino, I receive a ClassCastException.

Here's the error message:

Copyorg.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 149.0 failed 4 times, most recent failure: Lost task 0.3 in stage 149.0 (TID 271) (10.104.94.54 executor 1): java.lang.ClassCastException: class java.lang.Long cannot be cast to class org.apache.spark.unsafe.types.UTF8String (java.lang.Long is in module java.base of loader 'bootstrap'; org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app')
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String$(rows.scala:46)

This error suggests that Spark SQL is accessing the Iceberg table using column positions rather than column names. The mismatch in field order is causing Spark to attempt casting a Long to a UTF8String, resulting in the ClassCastException.
Spark's read operation relies on the original schema's column order, which can lead to data misinterpretation when this order is not preserved in new files.
Has anyone else encountered similar issues or can provide insights on best practices for managing schema evolution in Iceberg tables, particularly when adding new files with reordered fields

Best regards,
Andrei

The text was updated successfully, but these errors were encountered:

AndreiZhoraven added the question Further information is requested label Jun 28, 2024

AndreiZhoraven changed the title ~~Order fields in the parquet data files~~ Access by column position in parquet data files Jun 28, 2024

AndreiZhoraven closed this as completed Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access by column position in parquet data files #10582

Access by column position in parquet data files #10582

AndreiZhoraven commented Jun 28, 2024

Access by column position in parquet data files #10582

Access by column position in parquet data files #10582

Comments

AndreiZhoraven commented Jun 28, 2024

Query engine

Question