Skip to content

[Api][Spark]Add cleanExpiredFiles to Actions/ExpireSnapshots#3772

Closed
zhongyujiang wants to merge 1 commit intoapache:masterfrom
zhongyujiang:spark-clean-expired-files-optionally
Closed

[Api][Spark]Add cleanExpiredFiles to Actions/ExpireSnapshots#3772
zhongyujiang wants to merge 1 commit intoapache:masterfrom
zhongyujiang:spark-clean-expired-files-optionally

Conversation

@zhongyujiang
Copy link
Contributor

I found that the expire_snapshots and remove_orphan_files may confilct when remove_orphan_files procedure happens to start right after expire_snapshots's commit and delete files before expire_snapshots' delete, which will cause expire_snapshots fails.
This PR adds cleanExpiredFiles to Actions/ExpireSnapshots, which allows expiration of snapshots without any cleanup of files just like Iceberg/ExpireSnapshots. And the expired files will be left to remove_orphan_files.
This PR only implent it in Spark3.2, so I convert it to draft.
@aokolnychyi Do you think this is a reasonable way to avoid such conflict? Can you help review this pr? Thanks!

…pshots without any cleanup of files when using spark
* @param cleanEnabled setting this to false will skip deleting expired manifests and files
* @return this for method chaining
*/
ExpireSnapshots cleanExpiredFiles(boolean cleanEnabled);
Copy link
Member

@ajantha-bhat ajantha-bhat Dec 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This action is already having deleteWith(Consumer<String> deleteFunc); method.
we can plugin dummy consumer method to avoid deleting the files.

Default implementation is to delete files.
https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java#L84-L89

So, I think this PR is not required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can have changes for call procedure I guess. We can expose an argument to skipDelete and use the dummy consumer in spark actions of this call procedure if the skipDelete = true.

@rdblue
Copy link
Contributor

rdblue commented Dec 19, 2021

I agree with @ajantha-bhat that there's already a way to do this. I don't think we should expose this to users, who probably would not understand the situations in which you'd want to not clean up data files.

The situation you described seems to be a different bug to me. I don't think that deleting data files should ever cause the job to fail. We should catch errors and log them, but continue trying to clean up.

@rdblue
Copy link
Contributor

rdblue commented Dec 19, 2021

@zhongyujiang can you give more detail on what failed? Looking at the code that deletes files, I think that it should already log a warning and skip the file delete, rather than fail. https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java#L245-L271

What failure did you see?

@zhongyujiang
Copy link
Contributor Author

@ajantha-bhat @rdblue Thanks for your review! Maybe I didn't make it clear, the failure of expire_snapshots I met is not caused by deleting files instead of building expired files dataset, cause some files needed to build expiredFiles are deleted by remove_orphan_files which happens to be started right after expire_snapshots' commit. So plugin dummy consumer method to avoid deleting the files has no effect on this situation.
This situation is quite rare, but truely it happens. I think a simple way to avoid this is execute remove_orphan_files after expire_snapshots job is done, but our remove_orphan_files procedure is automatically scheduled by program. So it's a bit difficult to control timing.

@@ -172,11 +179,14 @@ public Dataset<Row> expire() {

expireSnapshots.commit();
Copy link
Contributor Author

@zhongyujiang zhongyujiang Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset<Row> originalFiles = buildValidFileDF(ops.current());

The originalFiles dataset is calculated lazily, so the situation is like this:
expireSnapshots.commit() -> originalFiles.except(validFiles) (calculating) -> remove_orphan_files start and delete some file for building originalFiles -> expire_snapshots falis baceuse of FileNotFoundException.

Copy link
Member

@ajantha-bhat ajantha-bhat Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intersting.
I am rewording for my understanding...

expireSnapshots.commit() has removed snapshot entries from metadata.
But the files (manifest list or manifest files) referring from that snapshot was present in originalFiles dataframe.
So, when remove_orphan_files called during that time, as metadata entry doesn't exist for those files, it has deleted it.
So, when originalFiles.except(validFiles) is called to actually compute the originalFiles, the manifest list or manifest files will be read which was deleted by remove_orphan_files. Leading to FileNotFoundException.

In your PR, expireSnapshots will remove the expired snapshot entires from the metadata but doesn't return list of removed files (which can potentially cause FileNotFoundException in above scenario) and also doesn't delete the files as output list is empty.
And these files will be later cleaned by running remove_orphan_files again. [Also note that cleaning expired files from remove_orphan_files is costlier (slower) compared to cleaning them in expiredSnapshots.

This PR can solve the issue. In my opinion, it looks hacky and user need to know when to set configuration to skipClean and when not to set it. I suggest we can block remove_orphan_files when expireSnapshots is running instead and vice versa (using event listener or some lock?). We can wait for others opinion.
cc: @rdblue , @RussellSpitzer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we already have APIs for expiring without deleting and IMHO we should be pushing users towards deleting with expire snapshots. So anything that makes it seem like this is not the right way to clean old files should be avoided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is quite right. But I don't think event listener or lock is a good idea, cause that seems need to introduce external dependencies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Your new solution of "trigger calculation for Dataset originalFiles before commit and cache it to reduce the probability of such failure" sounds good to me 👍🏻

@RussellSpitzer
Copy link
Member

@ajantha-bhat @rdblue Thanks for your review! Maybe I didn't make it clear, the failure of expire_snapshots I met is not caused by deleting files instead of building expired files dataset, cause some files needed to build expiredFiles are deleted by remove_orphan_files which happens to be started right after expire_snapshots' commit. So plugin dummy consumer method to avoid deleting the files has no effect on this situation. This situation is quite rare, but truely it happens. I think a simple way to avoid this is execute remove_orphan_files after expire_snapshots job is done, but our remove_orphan_files procedure is automatically scheduled by program. So it's a bit difficult to control timing.

I agree with Ryan, the thing to do here is to make it so that "expire snapshots" won't fail in this sort of situation. Basically just place in a bunch of try's to make it that if we can't determine the correct set of files to remove, we just log warnings for those files and continue as normal. This would result in not cleaning the total set of unused files since we wouldn't be able to determine the full set of unused files but since RemoveOrphans is running concurrently in this example, that really doesn't matter.

That said the ExpireSnapshots method is far more efficient than RemoveOrphans and we should always be encouraging users towards using Expire to delete rather than Remove.For this particular use case I would probably just have the same program that runs expire snapshots also run remove orphans so that they never run concurrently.

@zhongyujiang
Copy link
Contributor Author

zhongyujiang commented Dec 20, 2021

That said the ExpireSnapshots method is far more efficient than RemoveOrphans and we should always be encouraging users towards using Expire to delete rather than Remove.

This makes sense, I think we can trigger calculation for Dataset originalFiles before commit and cache it to reduce the probability of such failure. @RussellSpitzer What do you think? cc @ajantha-bhat

@rdblue
Copy link
Contributor

rdblue commented Dec 20, 2021

@zhongyujiang, caching is an optimization and we should not rely on it for correctness. There is no guarantee that cached data will still be available. Plus, it is expensive and causes scaling issues in Spark clusters.

Instead, I think that if a metadata file is missing, we should ignore it and move on. This should warn in that case but otherwise be a best effort delete.

@zhongyujiang zhongyujiang deleted the spark-clean-expired-files-optionally branch April 26, 2022 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants