Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the ListRequests for a recursive put for targets ending in "/" #821

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MatthijsPiek
Copy link
Contributor

Recently we have found that s3.put("/var/tmp/", "s3://test/wibble") and s3.put("/var/tmp/", "s3://test/wibble/") result in one ListObjectsV2 per directory which is written.

Investigation indicates it is the _exists check to see if the bucket you are writing to really exists. It does this for every directory. In practice we have seen 12 ListObjectsV2 calls for 1 PutObject.

This pull request resolves this for targets with do not end in "/" because it leverages the dircache state from the isdir() call to see if this bucket already exists. Targets which do end in "/" skip the isdir() call and still exhibit the old behaviour. Best would be if there would be a way to cache the _exists result or maybe disable the bucket creation entirely.
I didn't know how to cache the _exists result in the current DirCache setup.

To test this properly I created a fixture which allows you to count the requests made to S3. That may be useful for other similar problems as well.

@MatthijsPiek
Copy link
Contributor Author

Sorry, I broke some tests. I don't think I can justify spending more time on this, so not sure I'll be able to remedy that. Maybe some parts of this pull request are valuable to you.

@martindurant
Copy link
Member

To test this properly I created a fixture which allows you to count the requests made to S3. That may be useful for other similar problems as well.

This certainly sounds useful.

I'll look through the rest of your work in the near future.

@martindurant
Copy link
Member

I am contemplating fsspec/filesystem_spec#1592 as an explicit override point, so that s3fs can control how it creates a "directory tree" and so enable not making the exists() calls without having to mess about with the directory cache. The s3fs (and gcsfs, adlfs) versions would select the paths to attempt to create as follows:

set(p.lsplit("/", 1)[0] for p in paths)

@MatthijsPiek
Copy link
Contributor Author

Sounds good. Using the dircache for this isn't straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants