Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access by column position in parquet data files #10582

Closed
AndreiZhoraven opened this issue Jun 28, 2024 · 0 comments
Closed

Access by column position in parquet data files #10582

AndreiZhoraven opened this issue Jun 28, 2024 · 0 comments
Labels
question Further information is requested

Comments

@AndreiZhoraven
Copy link

Query engine

Spark, trino

Question

Hello everyone,

I ran into an issue with an Iceberg table after adding some Parquet files using the Iceberg Java API. While the file contains all fields defined in the table schema, their order differs from the original schema. When attempting to read the table using Spark SQL or Trino, I receive a ClassCastException.

Here's the error message:

Copyorg.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 149.0 failed 4 times, most recent failure: Lost task 0.3 in stage 149.0 (TID 271) (10.104.94.54 executor 1): java.lang.ClassCastException: class java.lang.Long cannot be cast to class org.apache.spark.unsafe.types.UTF8String (java.lang.Long is in module java.base of loader 'bootstrap'; org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app')
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String$(rows.scala:46)

This error suggests that Spark SQL is accessing the Iceberg table using column positions rather than column names. The mismatch in field order is causing Spark to attempt casting a Long to a UTF8String, resulting in the ClassCastException.
Spark's read operation relies on the original schema's column order, which can lead to data misinterpretation when this order is not preserved in new files.
Has anyone else encountered similar issues or can provide insights on best practices for managing schema evolution in Iceberg tables, particularly when adding new files with reordered fields

Best regards,
Andrei

@AndreiZhoraven AndreiZhoraven added the question Further information is requested label Jun 28, 2024
@AndreiZhoraven AndreiZhoraven changed the title Order fields in the parquet data files Access by column position in parquet data files Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant