Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the 'dirs' attribute from GCSFileSystem when serializing #49

Merged
merged 7 commits into from
Jan 4, 2018

Conversation

mrocklin
Copy link
Contributor

This object can grow quite large and is only for caching performance.

I haven't yet figured out the testing bits with VCR

This object can grow quite large and is only for caching performance.
@rabernat
Copy link
Contributor

Does dirs contain the full list of targets in the store?

@mrocklin
Copy link
Contributor Author

I didn't check its exact contents. They were quite large though. I suspect it was more-or-less equivalent to each of the directories that correspond to the blocks of the array.

@rabernat
Copy link
Contributor

So when there are tens of thousands of blocks, I expect this could become a performance bottleneck.

@mrocklin
Copy link
Contributor Author

Yes, especially since we serialize this mapping in each of the tens of thousands of tasks that load a single block.

@martindurant
Copy link
Member

Correct, it's a list of all entries, with metadata, for any buckets that have been listed so far.

@martindurant
Copy link
Member

However, if dirs is not passed, it must be ensured that accessing some set of blocks in a mapping does not itself require a file listing - which would be even slower, to download all the same data from gcs. That will only be true if #22 is merged.

@mrocklin
Copy link
Contributor Author

However, if dirs is not passed, it must be ensured that accessing some set of blocks in a mapping does not itself require a file listing - which would be even slower, to download all the same data from gcs. That will only be true if #22 is merged.

If then though, that listing will occur in a distributed fashion, and won't be bottlenecked on the client.

@martindurant
Copy link
Member

I have cleaned #22 for merger, but it still has two failures: only with VCR, not in reality, and life may be far too short to figure out how to fix VCR.

with gcs_maker() as gcs:
import pickle
gcs2 = pickle.loads(pickle.dumps(gcs))
gcs['abcdefg'] = b'1234567'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcs is a GCSFileSystem, here, not a mapping. This should have been a new pickle test in test_mapping.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see you have that below - then this should not have been changed, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. My mistake. I've altered this test to use the traditional open interface.

@asford
Copy link
Collaborator

asford commented Jan 3, 2018

@martindurant Is there a specific use case in which the exhaustive dirs cache is extremely useful?

Even generating this cache is causing major perf issues in my use case, and I'm considering benchmarking an implementation that forgoes this cache entirely in favor of on-demand calls using the the standard prefix/delimiter API. Would you be willing to benchmark a pull with that functionality if I provide an implementation?

@martindurant
Copy link
Member

@asford : yes, when reading many files in the same bucket, which is pretty common. Every open file needs to know its size, so it's better to get all of them in one shot, rather than having to make HEAD calls for each one of them.

@martindurant
Copy link
Member

To be sure, I'd be pleased to see prefix/delimiter usage in ls and related functions. s3fs may be a useful model here.

@asford
Copy link
Collaborator

asford commented Jan 3, 2018

Yes, I agree that this case can/should be optimized. Just to clarify, this would be a case in which you're reading many blobs that have a shared, delimited prefix? (I.e. Many blobs under the same directory.) In this context moving to a per-directory listing, and maintaining a per-directory cache, would be ideal.

Would a layout with 100 directories, each with 10,000 files, be a reasonable benchmark case? I'd then like to measure time-to-ls of an individual file, time-to-glob a collection of files from a repo and time-to-first-byte of a individual file read.

test-bucket/
    test-dir-0/[blob-0, blob-1, ..., blob-10e3]
    test-dir-1/[blob-0, blob-1, ..., blob-10e3]
    ....
    test-dir-100/[blob-0, blob-1, ..., blob-10e3]

@asford asford mentioned this pull request Jan 4, 2018
7 tasks
@martindurant
Copy link
Member

I am not certain what is the best benchmark scenario. The extremes would be no prefixes to a deep structure, and one file access, to all-file access. 100*10,000 files seems like an awful lot, the current implementation may never finish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants