-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the 'dirs' attribute from GCSFileSystem when serializing #49
Conversation
This object can grow quite large and is only for caching performance.
Does |
I didn't check its exact contents. They were quite large though. I suspect it was more-or-less equivalent to each of the directories that correspond to the blocks of the array. |
So when there are tens of thousands of blocks, I expect this could become a performance bottleneck. |
Yes, especially since we serialize this mapping in each of the tens of thousands of tasks that load a single block. |
Correct, it's a list of all entries, with metadata, for any buckets that have been listed so far. |
However, if |
If then though, that listing will occur in a distributed fashion, and won't be bottlenecked on the client. |
I have cleaned #22 for merger, but it still has two failures: only with VCR, not in reality, and life may be far too short to figure out how to fix VCR. |
gcsfs/tests/test_core.py
Outdated
with gcs_maker() as gcs: | ||
import pickle | ||
gcs2 = pickle.loads(pickle.dumps(gcs)) | ||
gcs['abcdefg'] = b'1234567' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcs
is a GCSFileSystem, here, not a mapping. This should have been a new pickle test in test_mapping.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see you have that below - then this should not have been changed, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. My mistake. I've altered this test to use the traditional open
interface.
@martindurant Is there a specific use case in which the exhaustive dirs cache is extremely useful? Even generating this cache is causing major perf issues in my use case, and I'm considering benchmarking an implementation that forgoes this cache entirely in favor of on-demand calls using the the standard |
@asford : yes, when reading many files in the same bucket, which is pretty common. Every open file needs to know its size, so it's better to get all of them in one shot, rather than having to make HEAD calls for each one of them. |
To be sure, I'd be pleased to see prefix/delimiter usage in ls and related functions. s3fs may be a useful model here. |
Yes, I agree that this case can/should be optimized. Just to clarify, this would be a case in which you're reading many blobs that have a shared, delimited prefix? (I.e. Many blobs under the same directory.) In this context moving to a per-directory listing, and maintaining a per-directory cache, would be ideal. Would a layout with 100 directories, each with 10,000 files, be a reasonable benchmark case? I'd then like to measure
|
I am not certain what is the best benchmark scenario. The extremes would be no prefixes to a deep structure, and one file access, to all-file access. 100*10,000 files seems like an awful lot, the current implementation may never finish. |
This object can grow quite large and is only for caching performance.
I haven't yet figured out the testing bits with VCR