Remove the 'dirs' attribute from GCSFileSystem when serializing #49

mrocklin · 2017-12-12T16:31:52Z

This object can grow quite large and is only for caching performance.

I haven't yet figured out the testing bits with VCR

This object can grow quite large and is only for caching performance.

rabernat · 2017-12-12T16:36:21Z

Does dirs contain the full list of targets in the store?

mrocklin · 2017-12-12T16:38:01Z

I didn't check its exact contents. They were quite large though. I suspect it was more-or-less equivalent to each of the directories that correspond to the blocks of the array.

rabernat · 2017-12-12T16:39:11Z

So when there are tens of thousands of blocks, I expect this could become a performance bottleneck.

mrocklin · 2017-12-12T16:40:14Z

Yes, especially since we serialize this mapping in each of the tens of thousands of tasks that load a single block.

martindurant · 2017-12-12T16:40:35Z

Correct, it's a list of all entries, with metadata, for any buckets that have been listed so far.

martindurant · 2017-12-12T16:44:22Z

However, if dirs is not passed, it must be ensured that accessing some set of blocks in a mapping does not itself require a file listing - which would be even slower, to download all the same data from gcs. That will only be true if #22 is merged.

mrocklin · 2017-12-12T16:45:39Z

However, if dirs is not passed, it must be ensured that accessing some set of blocks in a mapping does not itself require a file listing - which would be even slower, to download all the same data from gcs. That will only be true if #22 is merged.

If then though, that listing will occur in a distributed fashion, and won't be bottlenecked on the client.

martindurant · 2017-12-12T17:07:04Z

I have cleaned #22 for merger, but it still has two failures: only with VCR, not in reality, and life may be far too short to figure out how to fix VCR.

martindurant · 2017-12-12T17:14:17Z

gcsfs/tests/test_core.py

    with gcs_maker() as gcs:
-        import pickle
-        gcs2 = pickle.loads(pickle.dumps(gcs))
+        gcs['abcdefg'] = b'1234567'


gcs is a GCSFileSystem, here, not a mapping. This should have been a new pickle test in test_mapping.py?

Oh, I see you have that below - then this should not have been changed, I think.

Right. My mistake. I've altered this test to use the traditional open interface.

asford · 2018-01-03T11:18:43Z

@martindurant Is there a specific use case in which the exhaustive dirs cache is extremely useful?

Even generating this cache is causing major perf issues in my use case, and I'm considering benchmarking an implementation that forgoes this cache entirely in favor of on-demand calls using the the standard prefix/delimiter API. Would you be willing to benchmark a pull with that functionality if I provide an implementation?

martindurant · 2018-01-03T15:03:46Z

@asford : yes, when reading many files in the same bucket, which is pretty common. Every open file needs to know its size, so it's better to get all of them in one shot, rather than having to make HEAD calls for each one of them.

martindurant · 2018-01-03T20:41:48Z

To be sure, I'd be pleased to see prefix/delimiter usage in ls and related functions. s3fs may be a useful model here.

asford · 2018-01-03T21:38:51Z

Yes, I agree that this case can/should be optimized. Just to clarify, this would be a case in which you're reading many blobs that have a shared, delimited prefix? (I.e. Many blobs under the same directory.) In this context moving to a per-directory listing, and maintaining a per-directory cache, would be ideal.

Would a layout with 100 directories, each with 10,000 files, be a reasonable benchmark case? I'd then like to measure time-to-ls of an individual file, time-to-glob a collection of files from a repo and time-to-first-byte of a individual file read.

test-bucket/
    test-dir-0/[blob-0, blob-1, ..., blob-10e3]
    test-dir-1/[blob-0, blob-1, ..., blob-10e3]
    ....
    test-dir-100/[blob-0, blob-1, ..., blob-10e3]

martindurant · 2018-01-04T15:58:37Z

I am not certain what is the best benchmark scenario. The extremes would be no prefixes to a deep structure, and one file access, to all-file access. 100*10,000 files seems like an awful lot, the current implementation may never finish.

Remove the 'dirs' attribute from GCSFileSystem when serializing

e79158f

This object can grow quite large and is only for caching performance.

mrocklin mentioned this pull request Dec 12, 2017

slow performance when storing datasets in gcsfs-backed zarr stores pydata/xarray#1770

Closed

add test for no keys in serialized form

2a4932a

martindurant reviewed Dec 12, 2017

View reviewed changes

fix core test

f671e1b

jhamman mentioned this pull request Dec 14, 2017

workflow for moving data to cloud pangeo-data/pangeo#48

Closed

mrocklin added 4 commits January 2, 2018 10:40

update vcr documentation

d2733bc

add pickle yaml

593a857

update test_map_pickle

a179679

revert bucket name

fbf88c4

asford mentioned this pull request Jan 4, 2018

Per-directory metadata cache. #57

Merged

7 tasks

martindurant merged commit 5f2bc63 into fsspec:master Jan 4, 2018

rabernat mentioned this pull request Mar 14, 2018

Why is xarray.to_zarr slow sometimes? pangeo-data/pangeo#150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the 'dirs' attribute from GCSFileSystem when serializing #49

Remove the 'dirs' attribute from GCSFileSystem when serializing #49

mrocklin commented Dec 12, 2017

rabernat commented Dec 12, 2017

mrocklin commented Dec 12, 2017

rabernat commented Dec 12, 2017

mrocklin commented Dec 12, 2017

martindurant commented Dec 12, 2017

martindurant commented Dec 12, 2017

mrocklin commented Dec 12, 2017

martindurant commented Dec 12, 2017

martindurant Dec 12, 2017

martindurant Dec 12, 2017

mrocklin Dec 12, 2017

asford commented Jan 3, 2018

martindurant commented Jan 3, 2018

martindurant commented Jan 3, 2018

asford commented Jan 3, 2018

martindurant commented Jan 4, 2018

Remove the 'dirs' attribute from GCSFileSystem when serializing #49

Remove the 'dirs' attribute from GCSFileSystem when serializing #49

Conversation

mrocklin commented Dec 12, 2017

rabernat commented Dec 12, 2017

mrocklin commented Dec 12, 2017

rabernat commented Dec 12, 2017

mrocklin commented Dec 12, 2017

martindurant commented Dec 12, 2017

martindurant commented Dec 12, 2017

mrocklin commented Dec 12, 2017

martindurant commented Dec 12, 2017

martindurant Dec 12, 2017

Choose a reason for hiding this comment

martindurant Dec 12, 2017

Choose a reason for hiding this comment

mrocklin Dec 12, 2017

Choose a reason for hiding this comment

asford commented Jan 3, 2018

martindurant commented Jan 3, 2018

martindurant commented Jan 3, 2018

asford commented Jan 3, 2018

martindurant commented Jan 4, 2018