New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][C++] S3FileSystem delete_dir() regression in PyArrow 14 #38618
Comments
Could you provide a minimal script to reproduce this problem? If you can build PyArrow on local, could you try |
I have created a script to reproduce the problem, the script and outputs for both version 13 and version 14 are below which show the differences. It appears to be an issue when the parent dir contains explicit sub dir objects. In version 14, an empty dir is deleted ok and so is a dir containing blobs, but a dir containing other dirs is not. It only seems to be an issue when the sub dirs are created explicitly - if you put a blob in the dir with extra slashes in the blob name, the dir and content is deleted fine. Also if you explicitly delete the problematic sub dirs, then you can delete the parent dir. I also had a look into the code. It does seem the S3 implementation is listing dir contents and then deleting batches of objects in DoDeleteDirContentsAsync(). So I guess either the listing part of the deleting part isn't working right when the dir contains explicit sub dir objects. That's as far as I got before breakfast! Script and results are below. Is this enough for you to find and fix the issue or do you still need the bisection? Realistically I won't have time to set up all the tooling etc for that until the weekend.
Results with PyArrow 13:
Results with PyArrow 14:
|
Also just to mention this does work as expected for the local and GCS implementations, only the S3 implementation is affected. |
I'm seeing the same problem using PyArrow 14. 😢 |
This was the only significant PR merged between 13.0 and 14.0 and it sounds very plausible that it's the cause of this behavior change: #35440 |
I am trying to reproduce it using our test infrastructure with minio. However, I haven't been able to do so, and so it might be specific to actual AWS S3, and not reproducible with minio. I ran your reproducer script above using pyarrow 14.0 and a minio server running, and I get:
And I also tried to mimic the reproducer into a test function that can be dropped into def test_delete_dir_with_explicit_subdir(fs, pathfn):
skip_fsspec_s3fs(fs)
d = pathfn('directory/')
nd = pathfn('directory/nested/')
# deleting dir with explicit subdir
fs.create_dir(d)
fs.create_dir(nd)
fs.delete_dir(d)
dir_info = fs.get_file_info(d)
assert dir_info.type == FileType.NotFound
# deleting dir with blob in explicit subdir
d = pathfn('directory2')
nd = pathfn('directory2/nested')
f = pathfn('directory2/nested/target-file')
fs.create_dir(d)
fs.create_dir(nd)
with fs.open_output_stream(f) as s:
s.write(b'data')
fs.delete_dir(d)
dir_info = fs.get_file_info(d)
assert dir_info.type == FileType.NotFound But this passes for all tested filesystems. |
I can reproduce the issue with an actual AWS S3 bucket. Enabling some debug logging, and comparing the output using pyarrow 13.0 vs 14.0 using the example of "Dir with subdir deleted" ( With pyarrow 13.0:
With pyarrow 14.0:
(and there is no indication that this last DELETE failed, but in practice it did not delete the test_dir, and neither the sub_dir was deleted) In case this is useful for someone more familiar with S3 / the AWS SDK / our bindings. |
I suspect it's because
Yes, this would be the issue because an empty object with key |
So before we had a custom |
I don't know exactly what introduced the issue, but from your trace, it's clear that the |
If I manually ensure that the directory paths get a --- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -2408,7 +2408,11 @@ class S3FileSystem::Impl : public std::enable_shared_from_this<S3FileSystem::Imp
std::vector<std::string> file_paths;
for (const auto& file_info : file_infos) {
DCHECK_GT(file_info.path().size(), bucket.size());
- file_paths.push_back(file_info.path().substr(bucket.size() + 1));
+ auto file_path = file_info.path().substr(bucket.size() + 1);
+ if (file_info.IsDirectory()) {
+ file_path = file_path + kSep;
+ }
+ file_paths.push_back(file_path);
}
scheduler->AddSimpleTask(
[=, file_paths = std::move(file_paths)] { Then all the reproducer cases from above now work (all print It is the FileSelector that returns the paths without a
So assuming for a moment that |
…itly created sub-directories
Opened a PR with the patch I showed above -> #38845 |
@jorisvandenbossche I think this is the right fix. |
…reated sub-directories (#38845) ### Rationale for this change See #38618 (comment) and below for the analysis. When deleting the dir contents, we use a GetFileInfo with recursive FileSelector to list all objects to delete, but when doing that the file paths for directories don't end in a trailing `/`, so for deleting explicitly created directories we need to add the `kSep` here as well to properly delete the object. ### Are these changes tested? I tested them manually with an actual S3 bucket. The problem is that MinIO doesn't have the same problem, and so it's not actually tested with the test I added using our MinIO testing setup. ### Are there any user-facing changes? Fixes the regression * Closes: #38618 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…reated sub-directories (#38845) ### Rationale for this change See #38618 (comment) and below for the analysis. When deleting the dir contents, we use a GetFileInfo with recursive FileSelector to list all objects to delete, but when doing that the file paths for directories don't end in a trailing `/`, so for deleting explicitly created directories we need to add the `kSep` here as well to properly delete the object. ### Are these changes tested? I tested them manually with an actual S3 bucket. The problem is that MinIO doesn't have the same problem, and so it's not actually tested with the test I added using our MinIO testing setup. ### Are there any user-facing changes? Fixes the regression * Closes: #38618 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
This fix worked for us, thank you. |
…itly created sub-directories (apache#38845) ### Rationale for this change See apache#38618 (comment) and below for the analysis. When deleting the dir contents, we use a GetFileInfo with recursive FileSelector to list all objects to delete, but when doing that the file paths for directories don't end in a trailing `/`, so for deleting explicitly created directories we need to add the `kSep` here as well to properly delete the object. ### Are these changes tested? I tested them manually with an actual S3 bucket. The problem is that MinIO doesn't have the same problem, and so it's not actually tested with the test I added using our MinIO testing setup. ### Are there any user-facing changes? Fixes the regression * Closes: apache#38618 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
I think I have found a regression for S3FileSystem in PyArrow 14. Using the delete_dir() method with PyArrow 13 will remove a "directory" object and all its children, i.e. it will delete everything with the directory prefix. However when I switch to PyArrow 14 this no longer works, after calling delete_dir() the object still exists (I am checking with get_file_info() != FileType.NotFound). The issue also seems to affect delete_dir_contents(). The delete_dir() method still works on empty directories, and if I remove all the objects inside a directory and then call delete_dir() it will also work.
Could this be something to do with the AWS call for removing objects by prefix? In the native APIs for GCP / Azure this is not possible and objects need to be listed and deleted in batches, but for S3 delete by prefix is available so I'm guessing the S3 FileSystem implementation uses this feature? These low level details are hidden by the Arrow FS abstraction though, so while delete_dir() is not working there is no effective way to work around without writing an implementation in the native AWS SDK - I'd prefer not to do that for the sake of one call!
I have seen this issue on macOS with Python 3.12 and 3.10, also on Ubuntu with Python 3.11. In all cases the Arrow 13 version worked and the Arrow 14 version didn't. I haven't tried Arrow 14 on Windows but I do know Arrow 13 used to work on Windows. Also I'm assuming this is in the C implementation so it could affect more than just the Python component, but I have only tested with Python.
Please let me know if there's anything I can do to help diagnose this issue. If it is a regression and it's possible to fix with a 14.0.1 that would be amazing but I don't know how possible that is.
Component(s)
Python
The text was updated successfully, but these errors were encountered: