Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to disable implicit directory listing? #285

Closed
rabernat opened this issue Feb 14, 2020 · 10 comments · Fixed by #286
Closed

how to disable implicit directory listing? #285

rabernat opened this issue Feb 14, 2020 · 10 comments · Fixed by #286

Comments

@rabernat
Copy link

I do not like how s3fs implicitly lists directories. This can lead to extreme performance degradation for directories with large numbers of objects. Here is a simple example

import fsspec

# get a public object from s3 via regular http
url_base = 'https://mur-sst.s3.us-west-2.amazonaws.com/zarr/'
mapper_http = fsspec.get_mapper(url_base)
%time _ = mapper_http['analysed_sst/.zarray']
# > Wall time: 280 ms

# now do this with s3fs
file_location = 's3://mur-sst/zarr'
mapper = fsspec.get_mapper(file_location, anon=True)
%time _ = mapper['analysed_sst/.zarray']
# > Wall time: 41.4 s

The reason the second version is so slow is that fsspec is listing the directory contents, even though I know exactly which object I would like. Note that this behavior is unchanged if I add check=False.

Explicit is better than implicit is a a key part of the Zen of python. I would really prefer if this directory listing / caching had to be enabled explicitly, with the default behavior being simply passing through the objects with as little overhead as possible.

This issue is also raised in #279.

cc @cgentemann

@rabernat rabernat changed the title how to disable implicit directory listing how to disable implicit directory listing? Feb 14, 2020
@martindurant
Copy link
Member

It would be simple enough to have s3fs get the metadata for only the one file of interest for info and open actions. However, this is still a question of preference: there will be cases where you want to access all the files in some directory, and it would be much faster to do the one list operation and thereafter read from the cache. How do you suppose the API for expressing this preference might look?

@rabernat
Copy link
Author

How do you suppose the API for expressing this preference might look?

mapper = fsspec.get_mapper(file_location, anon=True, cache_directory_listing=True)

where the default (cache_directory_listing=False) would have the same behavior and same performance as HTTPFileSystem.

@rabernat
Copy link
Author

Or even keep caching as is, but just don't list directories unless ls or similar operations are explicitly called.

@martindurant
Copy link
Member

Right, there are two preferences to choose on:

  • whether info calls ls if the file isn't already cached
  • how we cache listings, whether they expire, etc.

If we don't cache at all, obviously info->ls is a bad idea; if we do, then it may or may not make sense, depending on the situation.

It's worth pointing out that the cost to list a directory is the same as to get details on a single key for small listings (<1000s of entries), so there would be no good reason not to, unless you expect the bucket contents to be volatile.

@martindurant
Copy link
Member

Ref fsspec/filesystem_spec#216
(@TomAugspurger , you may have thoughts on this)

@martindurant
Copy link
Member

The simplest change looks like this

--- a/s3fs/core.py
+++ b/s3fs/core.py
@@ -451,7 +451,7 @@ class S3FileSystem(AbstractFileSystem):
                 raise ValueError("version_id cannot be specified if the "
                                  "filesystem is not version aware")
             kwargs['VersionId'] = version_id
-        if self.version_aware:
+        if self.version_aware or self._ls_from_cache(path) is None:
             try:
                 bucket, key = self.split_path(path)
                 out = self._call_s3(self.s3.head_object, kwargs, Bucket=bucket,

@rabernat
Copy link
Author

Hi @martindurant -- could you give some guidance on the best way forward to address this issue? I'm keen to fix this, since it has major performance impacts on datasets stored in S3.

@martindurant
Copy link
Member

Can you try with the small change above? I am prepared to include it in the codebase, but we still do need a more rigorous way to deal with this.

@rabernat
Copy link
Author

Hi Martin...Sorry I did not get a chance to test this before leaving for vacation last week. Thanks for your quick fix.

@martindurant
Copy link
Member

pleasure :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants