Skip to content

Required field 'num_values' was not found in serialized data! #3084

@wardlican

Description

@wardlican

Describe the bug, including details regarding any error messages, version, and platform.

When using iceberg, we encountered a situation where a parquet file we wrote could not be read. When reading, the following error message appeared. Judging from the exception information, it is speculated that the parquet file is damaged or has not been written properly and cannot be parsed. We have also tried a variety of parsing tools but cannot parse it normally. However, the footer of the file is normal and the schema information of the file can be obtained, but the read data cannot be parsed the DataPageHeader. parquet version is 1.13.1. Is there any tool that can restore damaged files?

 org.apache.iceberg.exceptions.RuntimeIOException: java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:165)
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141)
	at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:130)
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:93)
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:130)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.agg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1501)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.read(Util.java:366)
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.readPageHeader(Util.java:133)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1458)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1505)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1478)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1088)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:956)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:909)
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163)
	... 23 more
Caused by: org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.protocol.TProtocolException: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
	at org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme.read(DataPageHeader.java:781)
	at org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme.read(DataPageHeader.java:719)
	at org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader.read(DataPageHeader.java:624)
	at org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1072)
	at org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1025)
	at org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader.read(PageHeader.java:902)
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.read(Util.java:363)
	... 31 more

Component(s)

Thrift

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions