Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-5300] Add more gentle file deletion procedure #2970

Closed

Conversation

tillrohrmann
Copy link
Contributor

Before deleting a parent directory always check the directory whether it contains some
files. If not, then try to delete the parent directory.

This will give a more gentle behaviour wrt storage systems which are not instructed to
delete a non-empty directory.

cc: @StefanRRichter

@StephanEwen
Copy link
Contributor

I like the idea.
I am wondering how expensive getting the array of FileStatus for all files in the directory is. HDFS in Hadoop 2 has the option to get a ContentSummary that has the number of files in a directory. I assume that this is more lightweight.

We could extend Flink's FileSystem class to also offer something like that and then use that method.

If we decide to not do that, it would be good to put the repeated logic for "delete if empty" into a utility function.

@tillrohrmann
Copy link
Contributor Author

I like the idea of not listing the status for all contained files. However, I've looked at the implementation of Hadoop's FileSystem#getContentSummary and FileSystem#listLocatedStatus and both implementations call FileSystem#listStatus. Thus, unless this changes in the future, we wouldn't win a lot by calling the getContentSummary instead (actually we would have the overhead of aggregating the different FileStatus objects).

Therefore, I'll refactor the code and add a FileUtils#deleteDirectoryIfEmpty method which will encapsulate the logic.

@tillrohrmann
Copy link
Contributor Author

I've update this PR @StephanEwen. Unfortunately, I couldn't use Hadoop's FileSystem#getContentSummary because it will first request the status for the given path, then list all files and directories if the path is a directory. For each file it will aggregate the FileStatus and then recursively descend into each directory. Thus, I think that this method is not faster.

I've refactored the code to contain a method FileUtils#deletePathIfEmpty to delete the path if it does not contain any files/directories.

@tillrohrmann
Copy link
Contributor Author

Rebasing on the latest master. @StephanEwen since I couldn't find a more efficient way to list the directory contents (wrt Hadoop FS) than listStatus, I think we can merge this PR.

@StephanEwen
Copy link
Contributor

Looks good to me. I would actually suggest to add two tests, one in flink-core based on the local file system, and one in flink-fs-tests, based on HDFS.
That way we make sure that there are no "unexpected behaviors", like some default file status always included (. or .. or whatever).

@tillrohrmann
Copy link
Contributor Author

True. Will add the tests and then merge the PR.

@tillrohrmann tillrohrmann force-pushed the moreGentleFileDeletion branch 2 times, most recently from c8524a6 to 613a636 Compare December 13, 2016 09:04
Before deleting a parent directory always check the directory whether it contains some
files. If not, then try to delete the parent directory.

This will give a more gentle behaviour wrt storage systems which are not instructed to
delete a non-empty directory.

Add test case for more gentle file deletion
@asfgit asfgit closed this in 8a7288e Dec 13, 2016
joseprupi pushed a commit to joseprupi/flink that referenced this pull request Feb 12, 2017
Before deleting a parent directory always check the directory whether it contains some
files. If not, then try to delete the parent directory.

This will give a more gentle behaviour wrt storage systems which are not instructed to
delete a non-empty directory.

Add test case for more gentle file deletion

This closes apache#2970.
@tillrohrmann tillrohrmann deleted the moreGentleFileDeletion branch March 6, 2017 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants