Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.IllegalArgumentException: requirement failed while read migrated parquet table #8863

Open
camper42 opened this issue Oct 18, 2023 · 7 comments

Comments

@camper42
Copy link

camper42 commented Oct 18, 2023

Apache Iceberg version

1.4.0 (latest release)

Query engine

Spark 3.4.1

Please describe the bug 馃悶

We have some irrationally partitioned parquet table, which have 3 level partition and have tons of small files. It looks like table/log_date=2023-10-16/date=2023-10-17/attr=attr1.

after call spark_catalog.system.migrate("db.table") success, spark read failed on some of partitions that have about 1k+ parquet(zstd) files (about 30KiB per file), but we can read from table_backup_.

Error:

23/10/18 01:08:56 WARN TaskSetManager: Lost task 70.0 in stage 2.0 (TID 1029) (10.113.6.67 executor 9): java.lang.IllegalArgumentException: requirement failed: length (-254841238) cannot be smaller than -1
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:79)
	at org.apache.spark.rdd.InputFileBlockHolder.set(InputFileBlockHolder.scala)
	at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:89)
	at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:41)
	at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:141)
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120)
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
	at scala.Option.exists(Option.scala:376)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)


@camper42
Copy link
Author

can reproduce with snapshot:
image

previous failure on migrated table:
image
image

@camper42
Copy link
Author

I'm lost.... seems both snapshot and migrate use SparkTableUtil.importSparkTable.
if read error on migrated table, it should failed on snapshot table. but no error when read from snapshot table.

and I try to:

  1. cp failed partition to a new table (new_test)
  2. REPAIR TABLE new_test
  3. snapshot + rewrite -> no error
  4. migrate + rewrite -> no error

@Omega359
Copy link
Contributor

I've just encountered this exception as well but the circumstances are somewhat different. One process is writing out an iceberg table primarily via appends with the occasional delete. Upon completion a new cluster is spun up that reads that table. This issue presented itself during that process:

23/10/24 22:25:51 ERROR BaseReader: Error reading file(s): s3://vdcint-transaction-dev-txn/AR_IDw/dev/transaction_2023_10_ndleq8xl/source_alias=MC3/transaction_date_year=2016/00081-10437993-40ecbe40-af70-41a4-9fde-61bc6a6abeb2-00001.parquet
java.lang.IllegalArgumentException: requirement failed: length (-135733377) cannot be smaller than -1
	at scala.Predef$.require(Predef.scala:281) ~[scala-library-2.12.15.jar:?]
	at org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:79) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.rdd.InputFileBlockHolder.set(InputFileBlockHolder.scala) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:89) ~[app.jar:?]
	at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:41) ~[app.jar:?]
	at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:141) ~[app.jar:?]
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120) ~[spark-sql_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158) ~[spark-sql_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.hashAgg_doAggregateWithoutKey_0$(Unknown Source) ~[?:?]
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source) ~[?:?]
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) ~[spark-sql_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.hasNext(Unknown Source) ~[?:?]
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:959) ~[spark-sql_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:142) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.scheduler.Task.run(Task.scala:141) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]

Iceberg 1.4.0, Spark 3.4.1 (EMR 6.14.0), JDK 17.

@camper42
Copy link
Author

Can this caused by critical bug described in #8834 ? @bryanck

If it is, this issue could be closed

@camper42
Copy link
Author

camper42 commented Nov 8, 2023

after 1.4.2, no IllegalArgumentException

@Hathoute
Copy link

Hathoute commented Nov 9, 2023

Upgrading to 1.4.1 or later AND rewriting manifests seems to resolve this.

@valerio-leap
Copy link

Any idea how to overcome this issue if upgrading is not an option?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants