Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

difficult to perform delete_object request instead of delete_objects using S3FileSystem #850

Closed
tboddyspargo opened this issue Feb 14, 2024 · 7 comments

Comments

@tboddyspargo
Copy link

It appears that, while the implementation of _rm in AsyncFileSystem invokes the delete_object API, the S3FileSystem class overrides it with a version that uses the delete_objects API. This is usually fine, but since GCS doesn't support delete_objects, S3FileSystem.rm(path) will fail on GCS.

It would be great if S3FileSystem provided an option to delete with the delete_object API instead of forcing the use of delete_objects.

In the meantime, I've had to resort to doing this:

sync(fs.loop, super(S3FileSystem, fs)._rm, path=final_path, recursive=recursive)

Which allows me to force the use of the _rm method on the AsyncFileSystem (parent class), but doesn't have a blocking/synchronous entrypoint that I could find.

@martindurant
Copy link
Member

Quick question before going further: if you use GCS, why not use gcsfs?

Secondly, rm_file (or async version) does I think exactly what you want, but only on one file at a time, with no recursion/expansion.

@tboddyspargo
Copy link
Author

tboddyspargo commented Feb 14, 2024

Thanks for your quick response, @martindurant !

Quick question before going further: if you use GCS, why not use gcsfs?

That's something we can consider. My initial thought is that we simultaneously want to support AWS S3, GCS, seaweedfs, and localstack with a single common filesystem abstraction. So far s3fs has been able to do that for us pretty well, with the exception of not having convenient access to call the delete_object API (although there may be more inconsistencies we haven't run into yet).

Secondly, rm_file (or async version) does I think exactly what you want, but only on one file at a time, with no recursion/expansion.

I might be missing something, but from what I can tell, rm_file will invoke _rm, which on an instance of S3FileSystem will force the use of the delete_objects API. The implementation of _rm on the parent class (AsyncFileSystem) has exactly what I'm looking for (access to the delete_object endpoint that even supports recursive), but I haven't found a blocking/synchronous entrypoint for that logic on the S3FileSystem class...

@martindurant
Copy link
Member

a single common filesystem abstraction

The various implementations of fsspec are designed to be as close to each other as possible. Unfortunately, there can be many optional features, and S3 is particularly guilty of this, since it's not quite a standard, and the permissioning model is complex.

rm_file will invoke _rm

This is the current implementation of _rm_file:

    async def _rm_file(self, path, **kwargs):
        bucket, key, _ = self.split_path(path)
        self.invalidate_cache(path)

        try:
            await self._call_s3("delete_object", Bucket=bucket, Key=key)
        except ClientError as e:
            raise translate_boto_error(e)

and you can see the call to "delete_object". The AsyncFileSystem's _rm calls _rm_file, which is why calling the superclass works for you. The "bulk delete" offered by S3FileSystem's _rm is more efficient for most people.

@tboddyspargo
Copy link
Author

rm_file will invoke _rm

This is the current implementation of _rm_file:

Thanks for pointing that out. Indeed, the private method S3FileSystem._rm_file would also work for me, although I'd lose out on the convenience of recursive that AsyncFileSystem._rm has.

Unfortunately, there also doesn't appear to be a public/blocking/synchronous entrypoint for calling _rm_file. The public rm_file API doesn't actually invoke _rm_file (copied below from fsspec).

    def rm_file(self, path):
        """Delete a file"""
        self._rm(path)

I think my solution suggestion for s3fs would be to implement an override for the AbstractFileSystem's rm_file method that calls _rm_file instead of _rm (ideally in strong imitation of AsyncFileSystem._rm). However, I'm not sure how taboo it would be to override a method from the spec class...

@tboddyspargo
Copy link
Author

(ideally in strong imitation of AsyncFileSystem._rm). However, I'm not sure how taboo it would be to override a method from the spec class...

Actually, I can see that imitating AsyncFileSystem._rm would not be appropriate since it would diverge from the method signature of AbstractFileSystem.rm_file, so just overriding it to call _rm_file instead of _rm would seem more appropriate.

@martindurant
Copy link
Member

The public rm_file API doesn't actually invoke _rm_file

It does! Sync variants of all the async methods listed in AsyncFileSystem are auto-generated.

In [4]: fs = fsspec.filesystem("s3")

In [6]: fs.rm_file??
Signature: fs.rm_file(path, **kwargs)
Docstring: Delete a file
Source:
    async def _rm_file(self, path, **kwargs):
        bucket, key, _ = self.split_path(path)
        self.invalidate_cache(path)

        try:
            await self._call_s3("delete_object", Bucket=bucket, Key=key)
        except ClientError as e:
            raise translate_boto_error(e)

@tboddyspargo
Copy link
Author

The public rm_file API doesn't actually invoke _rm_file

It does! Sync variants of all the async methods listed in AsyncFileSystem are auto-generated.

Fantastic! I didn't realize that it would be sort of auto-generated. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants