[Api][Spark]Add cleanExpiredFiles to Actions/ExpireSnapshots by zhongyujiang · Pull Request #3772 · apache/iceberg

zhongyujiang · 2021-12-19T10:37:30Z

I found that the expire_snapshots and remove_orphan_files may confilct when remove_orphan_files procedure happens to start right after expire_snapshots's commit and delete files before expire_snapshots' delete, which will cause expire_snapshots fails.
This PR adds cleanExpiredFiles to Actions/ExpireSnapshots, which allows expiration of snapshots without any cleanup of files just like Iceberg/ExpireSnapshots. And the expired files will be left to remove_orphan_files.
This PR only implent it in Spark3.2, so I convert it to draft.
@aokolnychyi Do you think this is a reasonable way to avoid such conflict? Can you help review this pr? Thanks!

…pshots without any cleanup of files when using spark

ajantha-bhat · 2021-12-19T18:15:07Z

api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java

+   * @param cleanEnabled setting this to false will skip deleting expired manifests and files
+   * @return this for method chaining
+   */
+  ExpireSnapshots cleanExpiredFiles(boolean cleanEnabled);


This action is already having deleteWith(Consumer<String> deleteFunc); method.
we can plugin dummy consumer method to avoid deleting the files.

Default implementation is to delete files.
https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java#L84-L89

So, I think this PR is not required.

But we can have changes for call procedure I guess. We can expose an argument to skipDelete and use the dummy consumer in spark actions of this call procedure if the skipDelete = true.

rdblue · 2021-12-19T18:36:28Z

I agree with @ajantha-bhat that there's already a way to do this. I don't think we should expose this to users, who probably would not understand the situations in which you'd want to not clean up data files.

The situation you described seems to be a different bug to me. I don't think that deleting data files should ever cause the job to fail. We should catch errors and log them, but continue trying to clean up.

rdblue · 2021-12-19T18:39:35Z

@zhongyujiang can you give more detail on what failed? Looking at the code that deletes files, I think that it should already log a warning and skip the file delete, rather than fail. https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java#L245-L271

What failure did you see?

zhongyujiang · 2021-12-20T11:53:35Z

@ajantha-bhat @rdblue Thanks for your review! Maybe I didn't make it clear, the failure of expire_snapshots I met is not caused by deleting files instead of building expired files dataset, cause some files needed to build expiredFiles are deleted by remove_orphan_files which happens to be started right after expire_snapshots' commit. So plugin dummy consumer method to avoid deleting the files has no effect on this situation.
This situation is quite rare, but truely it happens. I think a simple way to avoid this is execute remove_orphan_files after expire_snapshots job is done, but our remove_orphan_files procedure is automatically scheduled by program. So it's a bit difficult to control timing.

zhongyujiang · 2021-12-20T11:55:15Z

...3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java

@@ -172,11 +179,14 @@ public Dataset<Row> expire() {

      expireSnapshots.commit();


Dataset<Row> originalFiles = buildValidFileDF(ops.current());

The originalFiles dataset is calculated lazily, so the situation is like this:
expireSnapshots.commit() -> originalFiles.except(validFiles) (calculating) -> remove_orphan_files start and delete some file for building originalFiles -> expire_snapshots falis baceuse of FileNotFoundException.

Intersting.
I am rewording for my understanding...

expireSnapshots.commit() has removed snapshot entries from metadata.
But the files (manifest list or manifest files) referring from that snapshot was present in originalFiles dataframe.
So, when remove_orphan_files called during that time, as metadata entry doesn't exist for those files, it has deleted it.
So, when originalFiles.except(validFiles) is called to actually compute the originalFiles, the manifest list or manifest files will be read which was deleted by remove_orphan_files. Leading to FileNotFoundException.

In your PR, expireSnapshots will remove the expired snapshot entires from the metadata but doesn't return list of removed files (which can potentially cause FileNotFoundException in above scenario) and also doesn't delete the files as output list is empty.
And these files will be later cleaned by running remove_orphan_files again. [Also note that cleaning expired files from remove_orphan_files is costlier (slower) compared to cleaning them in expiredSnapshots.

This PR can solve the issue. In my opinion, it looks hacky and user need to know when to set configuration to skipClean and when not to set it. I suggest we can block remove_orphan_files when expireSnapshots is running instead and vice versa (using event listener or some lock?). We can wait for others opinion.
cc: @rdblue , @RussellSpitzer

I agree, we already have APIs for expiring without deleting and IMHO we should be pushing users towards deleting with expire snapshots. So anything that makes it seem like this is not the right way to clean old files should be avoided.

Your understanding is quite right. But I don't think event listener or lock is a good idea, cause that seems need to introduce external dependencies.

Agree. Your new solution of "trigger calculation for Dataset originalFiles before commit and cache it to reduce the probability of such failure" sounds good to me 👍🏻

RussellSpitzer · 2021-12-20T13:46:44Z

@ajantha-bhat @rdblue Thanks for your review! Maybe I didn't make it clear, the failure of expire_snapshots I met is not caused by deleting files instead of building expired files dataset, cause some files needed to build expiredFiles are deleted by remove_orphan_files which happens to be started right after expire_snapshots' commit. So plugin dummy consumer method to avoid deleting the files has no effect on this situation. This situation is quite rare, but truely it happens. I think a simple way to avoid this is execute remove_orphan_files after expire_snapshots job is done, but our remove_orphan_files procedure is automatically scheduled by program. So it's a bit difficult to control timing.

I agree with Ryan, the thing to do here is to make it so that "expire snapshots" won't fail in this sort of situation. Basically just place in a bunch of try's to make it that if we can't determine the correct set of files to remove, we just log warnings for those files and continue as normal. This would result in not cleaning the total set of unused files since we wouldn't be able to determine the full set of unused files but since RemoveOrphans is running concurrently in this example, that really doesn't matter.

That said the ExpireSnapshots method is far more efficient than RemoveOrphans and we should always be encouraging users towards using Expire to delete rather than Remove.For this particular use case I would probably just have the same program that runs expire snapshots also run remove orphans so that they never run concurrently.

zhongyujiang · 2021-12-20T14:56:39Z

That said the ExpireSnapshots method is far more efficient than RemoveOrphans and we should always be encouraging users towards using Expire to delete rather than Remove.

This makes sense, I think we can trigger calculation for Dataset originalFiles before commit and cache it to reduce the probability of such failure. @RussellSpitzer What do you think? cc @ajantha-bhat

rdblue · 2021-12-20T18:28:09Z

@zhongyujiang, caching is an optimization and we should not rely on it for correctness. There is no guarantee that cached data will still be available. Plus, it is expensive and causes scaling issues in Spark clusters.

Instead, I think that if a metadata file is missing, we should ignore it and move on. This should warn in that case but otherwise be a best effort delete.

Add clean_enabled to Actions/ExpireSnapshots, allow expiration of sna…

43a8e84

…pshots without any cleanup of files when using spark

github-actions bot added API spark labels Dec 19, 2021

ajantha-bhat reviewed Dec 19, 2021

View reviewed changes

zhongyujiang commented Dec 20, 2021

View reviewed changes

zhongyujiang closed this Jan 6, 2022

zhongyujiang deleted the spark-clean-expired-files-optionally branch April 26, 2022 02:02

		@@ -172,11 +179,14 @@ public Dataset<Row> expire() {

		expireSnapshots.commit();

Conversation

zhongyujiang commented Dec 19, 2021

Uh oh!

ajantha-bhat Dec 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat Dec 19, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue commented Dec 19, 2021

Uh oh!

rdblue commented Dec 19, 2021

Uh oh!

zhongyujiang commented Dec 20, 2021

Uh oh!

zhongyujiang Dec 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat Dec 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Dec 20, 2021

Choose a reason for hiding this comment

Uh oh!

zhongyujiang Dec 20, 2021

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat Dec 20, 2021

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Dec 20, 2021

Uh oh!

zhongyujiang commented Dec 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Dec 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ajantha-bhat Dec 19, 2021 •

edited

Loading

zhongyujiang Dec 20, 2021 •

edited

Loading

ajantha-bhat Dec 20, 2021 •

edited

Loading

zhongyujiang commented Dec 20, 2021 •

edited

Loading