Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++] S3FileSystem delete_dir() regression in PyArrow 14 #38618

Closed
martin-traverse opened this issue Nov 6, 2023 · 14 comments · Fixed by #38845
Closed

[Python][C++] S3FileSystem delete_dir() regression in PyArrow 14 #38618

martin-traverse opened this issue Nov 6, 2023 · 14 comments · Fixed by #38845

Comments

@martin-traverse
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

I think I have found a regression for S3FileSystem in PyArrow 14. Using the delete_dir() method with PyArrow 13 will remove a "directory" object and all its children, i.e. it will delete everything with the directory prefix. However when I switch to PyArrow 14 this no longer works, after calling delete_dir() the object still exists (I am checking with get_file_info() != FileType.NotFound). The issue also seems to affect delete_dir_contents(). The delete_dir() method still works on empty directories, and if I remove all the objects inside a directory and then call delete_dir() it will also work.

Could this be something to do with the AWS call for removing objects by prefix? In the native APIs for GCP / Azure this is not possible and objects need to be listed and deleted in batches, but for S3 delete by prefix is available so I'm guessing the S3 FileSystem implementation uses this feature? These low level details are hidden by the Arrow FS abstraction though, so while delete_dir() is not working there is no effective way to work around without writing an implementation in the native AWS SDK - I'd prefer not to do that for the sake of one call!

I have seen this issue on macOS with Python 3.12 and 3.10, also on Ubuntu with Python 3.11. In all cases the Arrow 13 version worked and the Arrow 14 version didn't. I haven't tried Arrow 14 on Windows but I do know Arrow 13 used to work on Windows. Also I'm assuming this is in the C implementation so it could affect more than just the Python component, but I have only tested with Python.

Please let me know if there's anything I can do to help diagnose this issue. If it is a regression and it's possible to fix with a 14.0.1 that would be amazing but I don't know how possible that is.

Component(s)

Python

@kou
Copy link
Member

kou commented Nov 7, 2023

Could you provide a minimal script to reproduce this problem?

If you can build PyArrow on local, could you try git bisect to detect the commit that is related to this problem?
See also: https://arrow.apache.org/docs/developers/python.html

@martin-traverse
Copy link
Contributor Author

martin-traverse commented Nov 7, 2023

I have created a script to reproduce the problem, the script and outputs for both version 13 and version 14 are below which show the differences. It appears to be an issue when the parent dir contains explicit sub dir objects.

In version 14, an empty dir is deleted ok and so is a dir containing blobs, but a dir containing other dirs is not. It only seems to be an issue when the sub dirs are created explicitly - if you put a blob in the dir with extra slashes in the blob name, the dir and content is deleted fine. Also if you explicitly delete the problematic sub dirs, then you can delete the parent dir.

I also had a look into the code. It does seem the S3 implementation is listing dir contents and then deleting batches of objects in DoDeleteDirContentsAsync(). So I guess either the listing part of the deleting part isn't working right when the dir contains explicit sub dir objects.

That's as far as I got before breakfast! Script and results are below. Is this enough for you to find and fix the issue or do you still need the bisection? Realistically I won't have time to set up all the tooling etc for that until the weekend.

import pyarrow.fs as pa_fs
import uuid

s3_args = {
    "region": "<region>",
    "access_key": "<access_key_id>",
    "secret_key": "<secret_access_key>"
}

s3_bucket = "<bucket>"

s3fs = pa_fs.S3FileSystem(**s3_args)

test_dir = f"test_{uuid.uuid4()}"
s3fs.create_dir(s3_bucket + "/" + test_dir)
s3fs.delete_dir(s3_bucket + "/" + test_dir)
dir_info = s3fs.get_file_info(s3_bucket + "/" + test_dir)
print(f"Single dir deleted: [{dir_info.type == pa_fs.FileType.NotFound}]")

test_dir = f"test_{uuid.uuid4()}"
s3fs.create_dir(s3_bucket + "/" + test_dir)
with s3fs.open_output_stream(s3_bucket + "/" + test_dir + "/some_blob.dat") as stream:
  stream.write(b"Some data")
s3fs.delete_dir(s3_bucket + "/" + test_dir)
dir_info = s3fs.get_file_info(s3_bucket + "/" + test_dir)
print(f"Dir with content deleted: [{dir_info.type == pa_fs.FileType.NotFound}]")

test_dir = f"test_{uuid.uuid4()}"
s3fs.create_dir(s3_bucket + "/" + test_dir)
s3fs.create_dir(s3_bucket + "/" + test_dir + "/sub_dir")
s3fs.delete_dir(s3_bucket + "/" + test_dir)
dir_info = s3fs.get_file_info(s3_bucket + "/" + test_dir)
print(f"Dir with sub dir deleted: [{dir_info.type == pa_fs.FileType.NotFound}]")

test_dir = f"test_{uuid.uuid4()}"
s3fs.create_dir(s3_bucket + "/" + test_dir)
s3fs.create_dir(s3_bucket + "/" + test_dir + "/sub_dir")
s3fs.delete_dir(s3_bucket + "/" + test_dir + "/sub_dir")
s3fs.delete_dir(s3_bucket + "/" + test_dir)
dir_info = s3fs.get_file_info(s3_bucket + "/" + test_dir)
print(f"Dir deleted after subdir removed explicitly: [{dir_info.type == pa_fs.FileType.NotFound}]")

test_dir = f"test_{uuid.uuid4()}"
s3fs.create_dir(s3_bucket + "/" + test_dir)
with s3fs.open_output_stream(s3_bucket + "/" + test_dir + "/sub_dir/some_blob.dat") as stream:
  stream.write(b"Some data")
s3fs.delete_dir(s3_bucket + "/" + test_dir)
dir_info = s3fs.get_file_info(s3_bucket + "/" + test_dir)
print(f"Dir deleted with blob in implicit subdir: [{dir_info.type == pa_fs.FileType.NotFound}]")

test_dir = f"test_{uuid.uuid4()}"
s3fs.create_dir(s3_bucket + "/" + test_dir)
s3fs.create_dir(s3_bucket + "/" + test_dir + "/sub_dir/")
with s3fs.open_output_stream(s3_bucket + "/" + test_dir + "/sub_dir/some_blob.dat") as stream:
  stream.write(b"Some data")
s3fs.delete_dir(s3_bucket + "/" + test_dir)
dir_info = s3fs.get_file_info(s3_bucket + "/" + test_dir)
print(f"Dir deleted with blob in explicit subdir: [{dir_info.type == pa_fs.FileType.NotFound}]")

Results with PyArrow 13:

Single dir deleted: [True]
Dir with content deleted: [True]
Dir with sub dir deleted: [True]
Dir deleted after subdir removed explicitly: [True]
Dir deleted with blob in implicit subdir: [True]
Dir deleted with blob in explicit subdir: [True]

Results with PyArrow 14:

Single dir deleted: [True]
Dir with content deleted: [True]
Dir with sub dir deleted: [False]
Dir deleted after subdir removed explicitly: [True]
Dir deleted with blob in implicit subdir: [True]
Dir deleted with blob in explicit subdir: [False]

@martin-traverse
Copy link
Contributor Author

Also just to mention this does work as expected for the local and GCS implementations, only the S3 implementation is affected.

@grumBit
Copy link

grumBit commented Nov 17, 2023

I'm seeing the same problem using PyArrow 14. 😢

@felipecrv
Copy link
Contributor

felipecrv commented Nov 21, 2023

This was the only significant PR merged between 13.0 and 14.0 and it sounds very plausible that it's the cause of this behavior change: #35440

@jorisvandenbossche
Copy link
Member

I am trying to reproduce it using our test infrastructure with minio. However, I haven't been able to do so, and so it might be specific to actual AWS S3, and not reproducible with minio.

I ran your reproducer script above using pyarrow 14.0 and a minio server running, and I get:

Single dir deleted: [True]
Dir with content deleted: [True]
Dir with sub dir deleted: [True]
Dir deleted after subdir removed explicitly: [True]
Dir deleted with blob in implicit subdir: [True]
Dir deleted with blob in explicit subdir: [True]

And I also tried to mimic the reproducer into a test function that can be dropped into test_fs.py using our test infrastructure (containing the two failure cases):

def test_delete_dir_with_explicit_subdir(fs, pathfn):
    skip_fsspec_s3fs(fs)

    d = pathfn('directory/')
    nd = pathfn('directory/nested/')

    # deleting dir with explicit subdir
    fs.create_dir(d)
    fs.create_dir(nd)
    fs.delete_dir(d)
    dir_info = fs.get_file_info(d)
    assert dir_info.type == FileType.NotFound

    # deleting dir with blob in explicit subdir
    d = pathfn('directory2')
    nd = pathfn('directory2/nested')
    f = pathfn('directory2/nested/target-file')

    fs.create_dir(d)
    fs.create_dir(nd)
    with fs.open_output_stream(f) as s:
        s.write(b'data')

    fs.delete_dir(d)
    dir_info = fs.get_file_info(d)
    assert dir_info.type == FileType.NotFound

But this passes for all tested filesystems.

@jorisvandenbossche jorisvandenbossche changed the title S3FileSystem delete_dir() regression in PyArrow 14 [Python][C++] S3FileSystem delete_dir() regression in PyArrow 14 Nov 21, 2023
@jorisvandenbossche
Copy link
Member

I can reproduce the issue with an actual AWS S3 bucket. Enabling some debug logging, and comparing the output using pyarrow 13.0 vs 14.0 using the example of "Dir with subdir deleted" (s3fs.delete_dir(s3_bucket + "/" + test_dir)).

With pyarrow 13.0:

  • Listing with test_dir/ as prefix (GET with list-type), which returns a ListBucketResult
  • Listing with sub_dir/ as prefix
  • POST /?delete (passing both test_dir/ as test_dir/sub_dir/ as objects to delete with a Delete xml object
  • POST /?delete with only the test_dir/sub_dir/ (no idea why this is done a second time)
  • DELETE /test_dir/

With pyarrow 14.0:

  • HEAD /test_dir
  • Listing with test_dir/ as prefix
  • POST /?delete with specifying test_dir/sub_dir as single object to delete (without trailing /, is that the difference?)
  • DELETE /test_dir/

(and there is no indication that this last DELETE failed, but in practice it did not delete the test_dir, and neither the sub_dir was deleted)

In case this is useful for someone more familiar with S3 / the AWS SDK / our bindings.

@felipecrv
Copy link
Contributor

POST /?delete (passing both test_dir/ as test_dir/sub_dir/ as objects to delete with a Delete xml object
POST /?delete with only the test_dir/sub_dir/ (no idea why this is done a second time)

I suspect it's because test_dir/sub_dir/ as the prefix leads to itself appearing in the listing response.

POST /?delete with specifying test_dir/sub_dir as single object to delete (without trailing /, is that the difference?)

Yes, this would be the issue because an empty object with key test_dir/sub_dir/ is created to indicate that the test_dir/sub_dir directory exists (something that CreateEmptyDir will do, so DeleteDir has to undo.

@jorisvandenbossche
Copy link
Member

So before we had a custom WalkForDeleteDirAsync, after #35440 DoDeleteDirContentsAsync uses a FileSelector / GetFileInfo, and that returns the paths for directories without trailing slash?

@felipecrv
Copy link
Contributor

I don't know exactly what introduced the issue, but from your trace, it's clear that the DELETE for test_dir/sub_dir/ is missing. test_dir/sub_dir and test_dir/sub_dir/ are different object from S3's point of view. It could be that test_dir/sub_dir/ is present in the listing, but when the request for deletion is sent, the test_dir/sub_dir key is used instead.

@jorisvandenbossche
Copy link
Member

If I manually ensure that the directory paths get a / appended when iterating over the file paths in DoDeleteDirContentsAsync, with the following quick patch:

--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -2408,7 +2408,11 @@ class S3FileSystem::Impl : public std::enable_shared_from_this<S3FileSystem::Imp
                   std::vector<std::string> file_paths;
                   for (const auto& file_info : file_infos) {
                     DCHECK_GT(file_info.path().size(), bucket.size());
-                    file_paths.push_back(file_info.path().substr(bucket.size() + 1));
+                    auto file_path = file_info.path().substr(bucket.size() + 1);
+                    if (file_info.IsDirectory()) {
+                      file_path = file_path + kSep;
+                    }
+                    file_paths.push_back(file_path);
                   }
                   scheduler->AddSimpleTask(
                       [=, file_paths = std::move(file_paths)] {

Then all the reproducer cases from above now work (all print True). However, not fully sure that is the correct fix? (are we sure the path never already has a trailing slash?).

It is the FileSelector that returns the paths without a / for directories:

In [18]: s3fs.get_file_info(pa_fs.FileSelector(s3_bucket + "/test_dir/", recursive=True))
Out[18]: 
[<FileInfo for 'test-bucket-joris/test_dir/sub_dir': type=FileType.Directory>,
 <FileInfo for 'test-bucket-joris/test_dir/sub_dir/some_blob.dat': type=FileType.File, size=9>]

So assuming for a moment that FileSelector always does that (and we use FileSelector within DeleteDirContents right now), we can maybe assume we can always add a / to the path for directories (and also must do that to keep to the contract that "folders" in S3 are objects with a trailing /).

@jorisvandenbossche
Copy link
Member

Opened a PR with the patch I showed above -> #38845

@felipecrv
Copy link
Contributor

@jorisvandenbossche I think this is the right fix. FileInfo is probably normalizing paths to "files" and "directories" to never have a trailing / like a real filesystem API would do, but to ask S3 to delete a directory (a concept they don't have), the client has to ask for the deletion of the empty object with a key ending with / as that is the convention used to represent the existence of an empty directory on an object store like S3.

@raulcd raulcd modified the milestones: 14.0.2, 15.0.0 Nov 28, 2023
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Dec 5, 2023
jorisvandenbossche added a commit that referenced this issue Dec 5, 2023
…reated sub-directories (#38845)

### Rationale for this change

See #38618 (comment) and below for the analysis. When deleting the dir contents, we use a GetFileInfo with recursive FileSelector to list all objects to delete, but when doing that the file paths for directories don't end in a trailing `/`, so for deleting explicitly created directories we need to add the `kSep` here as well to properly delete the object.

### Are these changes tested?

I tested them manually with an actual S3 bucket. The problem is that MinIO doesn't have the same problem, and so it's not actually tested with the test I added using our MinIO testing setup.

### Are there any user-facing changes?

Fixes the regression
* Closes: #38618

Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche modified the milestones: 15.0.0, 14.0.2 Dec 5, 2023
raulcd pushed a commit that referenced this issue Dec 6, 2023
…reated sub-directories (#38845)

### Rationale for this change

See #38618 (comment) and below for the analysis. When deleting the dir contents, we use a GetFileInfo with recursive FileSelector to list all objects to delete, but when doing that the file paths for directories don't end in a trailing `/`, so for deleting explicitly created directories we need to add the `kSep` here as well to properly delete the object.

### Are these changes tested?

I tested them manually with an actual S3 bucket. The problem is that MinIO doesn't have the same problem, and so it's not actually tested with the test I added using our MinIO testing setup.

### Are there any user-facing changes?

Fixes the regression
* Closes: #38618

Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@martin-traverse
Copy link
Contributor Author

This fix worked for us, thank you.

dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…itly created sub-directories (apache#38845)

### Rationale for this change

See apache#38618 (comment) and below for the analysis. When deleting the dir contents, we use a GetFileInfo with recursive FileSelector to list all objects to delete, but when doing that the file paths for directories don't end in a trailing `/`, so for deleting explicitly created directories we need to add the `kSep` here as well to properly delete the object.

### Are these changes tested?

I tested them manually with an actual S3 bucket. The problem is that MinIO doesn't have the same problem, and so it's not actually tested with the test I added using our MinIO testing setup.

### Are there any user-facing changes?

Fixes the regression
* Closes: apache#38618

Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment