Skip to content

Spark dataframe count on entries metadata table throws an IllegalArgumentException for partitioned tables #1378

@shardulm94

Description

@shardulm94

I have created a test case to demonstrate this issue at shardulm94@6e0d0c2

The failing test can be run using ./gradlew :iceberg-spark2:test --tests="*TestIcebergSourceHiveTables24*"

> Task :iceberg-spark2:test

org.apache.iceberg.spark.source.TestIcebergSourceHiveTables24 > testCountEntriesPartitionedTable FAILED
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 38.0 failed 1 times, most recent failure: Lost task 0.0 in stage 38.0 (TID 447, localhost, executor driver): java.lang.IllegalArgumentException: Missing required field: data_file
        at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:217)
        at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:98)
        at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:42)
        at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor.visit(AvroCustomOrderSchemaVisitor.java:51)
        at org.apache.iceberg.avro.AvroSchemaUtil.buildAvroProjection(AvroSchemaUtil.java:105)
        at org.apache.iceberg.avro.ProjectionDatumReader.setSchema(ProjectionDatumReader.java:68)
        at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:132)
        at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:106)
        at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:98)
        at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:66)
        at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:100)
        at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:77)
        at org.apache.iceberg.io.CloseableIterable$3$1.<init>(CloseableIterable.java:95)
        at org.apache.iceberg.io.CloseableIterable$3.iterator(CloseableIterable.java:94)
        at org.apache.iceberg.io.CloseableIterable$3$1.<init>(CloseableIterable.java:95)
        at org.apache.iceberg.io.CloseableIterable$3.iterator(CloseableIterable.java:94)
        at org.apache.iceberg.io.CloseableIterable$3$1.<init>(CloseableIterable.java:95)
        at org.apache.iceberg.io.CloseableIterable$3.iterator(CloseableIterable.java:94)
        at org.apache.iceberg.io.CloseableIterable$3$1.<init>(CloseableIterable.java:95)
        at org.apache.iceberg.io.CloseableIterable$3.iterator(CloseableIterable.java:94)
        at org.apache.iceberg.spark.source.RowDataReader.open(RowDataReader.java:100)
        at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:81)
        at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_1$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions