Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Manifest is missing" ValidationException when there have Concurrent applications to rewrite manifests #3466

Open
fennuzhichui opened this issue Nov 4, 2021 · 15 comments · May be fixed by #4612
Labels
beginner Issues for apache iceberg beginners, enjoy to contribute ! good first issue Good for newcomers

Comments

@fennuzhichui
Copy link

we concurrently run a lots of sql-shell to overwrite different day's data in iceberg table, and every application end with "CALL spark_catalog.system.rewrite_manifests(table => 'dwm.tableA', use_caching => false)". The applications will rewrite the tableA's manifests concurrently , and throw ValidationException("Manifest is missing: xxx") in validateDeletedManifests method.
diagnostics: User class threw exception: org.apache.spark.SparkException: org.apache.iceberg.exceptions.ValidationException: Manifest is missing: oss://xgimi-data/apps/spark/warehouse/ods.db/screen_event_log_hi/metadata/52c1e98f-02a5-4ce3-ae05-d7382a6a68c2-m2.avro at org.apache.iceberg.BaseRewriteManifests.lambda$validateDeletedManifests$7(BaseRewriteManifests.java:261) at java.util.Optional.ifPresent(Optional.java:159) at org.apache.iceberg.BaseRewriteManifests.validateDeletedManifests(BaseRewriteManifests.java:260) at org.apache.iceberg.BaseRewriteManifests.apply(BaseRewriteManifests.java:169) at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:163) at org.apache.iceberg.BaseRewriteManifests.apply(BaseRewriteManifests.java:53) at org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:276) at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404) at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:213) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:197) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:189) at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:275) at org.apache.iceberg.BaseRewriteManifests.commit(BaseRewriteManifests.java:53) at org.apache.iceberg.actions.BaseSnapshotUpdateAction.commit(BaseSnapshotUpdateAction.java:40) at org.apache.iceberg.actions.RewriteManifestsAction.replaceManifests(RewriteManifestsAction.java:309) at org.apache.iceberg.actions.RewriteManifestsAction.execute(RewriteManifestsAction.java:196)

@GaruGaru
Copy link

GaruGaru commented Jan 3, 2022

Same issue with iceberg 0.11 and spark 3.1.1

@flyrain
Copy link
Contributor

flyrain commented Apr 20, 2022

cc @RussellSpitzer

@RussellSpitzer
Copy link
Member

We were just discussing this today, the message is a bit confusing but is the correct response. I think we should probably change it to something like

"Cannot apply RewriteManifests result since manifests being replaced have already been removed in the current snapshot"

because what is happening actually the same as an other validation error we do during the commit phase. For example, say we run two optimize metadata operations simultaneously

Optimize 1: Rewrites Manfiests A and B into C'      : Manifests Deleted [A, B], Manifests Added [C']
Optimize 2: Rewrites Manifests A and B into C''    : Manifests Deleted [A, B], Manifests Added [C'']

If 1 finishes first the state of our table is

Manifests [C']

If we try to apply Optimize 2 we see that [A and B] are already deleted and not present, if we added [C''] our table would look like

Manifests [C', C'']

And end up with duplicate records! So the correct response is to cancel the 2nd optimize oepration

@RussellSpitzer RussellSpitzer added the beginner Issues for apache iceberg beginners, enjoy to contribute ! label Apr 20, 2022
@RussellSpitzer
Copy link
Member

Adding the label Beginner if anyone wants to take a crack at improving the error message

@kyle-cx91
Copy link
Contributor

I think put the current-snapshot-id into the error message would make it easier for the user to figure out what happened.

@lintingbin
Copy link
Contributor

@RussellSpitzer We have a flink stream writing data to an iceberg table (the table is set to 'commit.manifest-merge.enabled' = 'false'). When I execute a rewrite manifest task in spark, this error comes up quite often, causing the task to fail. Does it mean that 'commit.manifest-merge.enabled' is not taking effect?

@372242283
Copy link

I have the same problem, which happens occasionally,big guys, help me analyze it Thank you:
version:iceberg1.1.0,spark3.1.3
CALL hive_catalog.system.rewrite_manifests('xx.tableA', false)
ERROR Msg:
23/02/05 21:12:04 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
23/02/05 21:12:04 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
23/02/05 21:12:05 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/02/05 21:12:05 INFO MemoryStore: MemoryStore cleared
23/02/05 21:12:05 INFO BlockManager: BlockManager stopped
23/02/05 21:12:05 INFO BlockManagerMaster: BlockManagerMaster stopped
23/02/05 21:12:05 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/02/05 21:12:05 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.iceberg.exceptions.ValidationException: Manifest is missing: hdfs://xxx:8020/user/hive/warehouse/xx.db/tableA/metadata/d9c19289-fbf1-48a8-90be-a8eecdf088b4-m0.avro
at org.apache.iceberg.BaseRewriteManifests.lambda$validateDeletedManifests$7(BaseRewriteManifests.java:299)
at java.base/java.util.Optional.ifPresent(Unknown Source)
at org.apache.iceberg.BaseRewriteManifests.validateDeletedManifests(BaseRewriteManifests.java:297)
at org.apache.iceberg.BaseRewriteManifests.apply(BaseRewriteManifests.java:195)
at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:223)
at org.apache.iceberg.BaseRewriteManifests.apply(BaseRewriteManifests.java:50)
at org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:369)
at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:402)
at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:212)
at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:189)
at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:367)
at org.apache.iceberg.BaseRewriteManifests.commit(BaseRewriteManifests.java:50)
at org.apache.iceberg.spark.actions.BaseSnapshotUpdateSparkAction.commit(BaseSnapshotUpdateSparkAction.java:43)
at org.apache.iceberg.spark.actions.BaseRewriteManifestsSparkAction.replaceManifests(BaseRewriteManifestsSparkAction.java:312)
at org.apache.iceberg.spark.actions.BaseRewriteManifestsSparkAction.doExecute(BaseRewriteManifestsSparkAction.java:185)
at org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:97)
at org.apache.iceberg.spark.actions.BaseRewriteManifestsSparkAction.execute(BaseRewriteManifestsSparkAction.java:150)
at org.apache.iceberg.spark.actions.BaseRewriteManifestsSparkAction.execute(BaseRewriteManifestsSparkAction.java:80)
at org.apache.iceberg.spark.procedures.RewriteManifestsProcedure.lambda$call$0(RewriteManifestsProcedure.java:97)
at org.apache.iceberg.spark.procedures.BaseProcedure.execute(BaseProcedure.java:86)
at org.apache.iceberg.spark.procedures.BaseProcedure.modifyIcebergTable(BaseProcedure.java:74)
at org.apache.iceberg.spark.procedures.RewriteManifestsProcedure.call(RewriteManifestsProcedure.java:88)
at org.apache.spark.sql.execution.datasources.v2.CallExec.run(CallExec.scala:33)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:40)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:40)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:46)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3700)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)

@RussellSpitzer
Copy link
Member

As I wrote above the issue os that the rewrite command becomes out of date while running so it fails. At least this is my hypothesis above

@372242283
Copy link

As I wrote above the issue os that the rewrite command becomes out of date while running so it fails. At least this is my hypothesis above

This problem happened by accident. I think other people also encountered the same problem. How should we fix or solve this problem? Thank you

@obogobo
Copy link

obogobo commented Apr 10, 2023

Would using the DynamoDB lock table help at all with this issue? We have a Spark Streaming job that commits every 3 minutes and a compaction job that runs hourly (rewrite manifests, expire snapshots, rewrite data files) due to an extremely high volume of data ingest. Past a certain point, compaction began failing every time due to the streaming job updating the latest manifest before compaction could complete.

I think I'm going to try the commit.manifest-merge.enabled setting and fiddle with commit retries to see if that helps at all. Thanks for everything you do maintaining Iceberg!

edit: ^ disabling merge made matters a bit worse, would not recommend. We're going to try calling rewriteManifests less often and wrap it in some application level retries for now.

@okayhooni
Copy link

I got the same issue..

#4161 (comment)

@wfxxh
Copy link

wfxxh commented Jan 8, 2024

iceberg_error

I got the same issue , my iceberg maintain order is expireSnapshots -> rewriteDataFiles -> rewriteManifests -> deleteOrphanFiles(older than 20 minutes) . The image is rewriteManifests error,but after rewriteDataFiles and before rewriteManifests I have get the currentSnapshot and print the manifest files.

@amitgilad3
Copy link

is this issue still relevant ?? - i saw a pr that is opened but im not sure that somebody is working on it and if not i would like to give it a try ?? @RussellSpitzer

@372242283
Copy link

is this issue still relevant ?? - i saw a pr that is opened but im not sure that somebody is working on it and if not i would like to give it a try ?? @RussellSpitzer

This error still occasionally occurs,Thanks

@amitgilad3
Copy link

Thanks @372242283 - i know this is still an issue but i saw a pr and was wondering if the pr was abandoned and if so i would like to work on fixing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginner Issues for apache iceberg beginners, enjoy to contribute ! good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.