Per-directory metadata cache. #57

asford · 2018-01-04T03:41:02Z

Test case updates to reflect new caching logic.
VCR-test update.
Docstring updates to clarify file interface.
Documentation updates for caching logic.
Profile vs previous cache implementation.
Integrate cache-deletion logic and tests from Remove the 'dirs' attribute from GCSFileSystem when serializing #49.
Update glob logic to restrict search to subdirs and prefixes.

@martindurant Work in progress solution for #24 and #21, likely supersedes #22. Would you mind taking a quick look at this pull for a sanity check? I've performed some initial integration and manual testing in my deployment and this implementation appears to resolve the primary performance issues I've encountered.

Refactors the GCSFileSystem to operate on a per-directory object metadata cache, rather than a full-bucket cache, to support file reads in buckets with multiple directory structures. This resolves performance issues due to full-bucket listing when reading a subset of keys from a bucket or when globing within a subdirectory of the bucket.

martindurant

Looks like a good start to me. There is quite a lot of code here, so the proof will have to be in the testing - but I generally agree with the approach.

I feel like there are more private functions and dict creation/extracting than necessary, but this is probably a style question, and maybe your way makes testing easier.

martindurant · 2018-01-04T14:37:57Z

gcsfs/core.py

@@ -64,19 +65,33 @@ def quote_plus(s):
    return s


+def norm_path(path):
+    """Canonicalize path by split and rejoining."""
+    # TODO Should canonical path include protocol?


Generally speaking, we should strip the protocol as early as possible within this library.

martindurant · 2018-01-04T14:39:39Z

gcsfs/core.py

@@ -159,6 +174,9 @@ class GCSFileSystem(object):
        (see description of authentication methods, above)
    consistency: 'none', 'size', 'md5'
        Check method when writing files. Can be overridden in open().
+    cache_timeout: float, seconds


I like this idea.

martindurant · 2018-01-04T15:12:26Z

gcsfs/core.py

+            items.extend(page.get("items", []))
+            next_page_token = page.get('nextPageToken', None)
+
+        result = {


Why the dict? As far as I can see, the only place that this is used, we immediately pick out the 'items' key.

This is repacking the result in the form of a de-paginated view of the standard GCS object listing. The prefixes list it used later to generated pseudo-directory listings for ls and info calls.

martindurant · 2018-01-04T15:16:20Z

gcsfs/core.py

@@ -389,7 +554,6 @@ def mkdir(self, bucket, acl='projectPrivate',
                   predefinedDefaultObjectAcl=default_acl,
                   json={"name": bucket})
        self.invalidate_cache(bucket)
-        self.invalidate_cache('')

    def rmdir(self, bucket):
        """Delete an empty bucket"""


I wonder, if you delete the last key within a given prefix, which calls invalidate cache on the parent, do we expect the apparent directory to disappear?
e.g.,

gcs.ls('bucket/')
['bucket/thing/']
gcs.ls('bucket/thing/')
['bucket/thing/key']
gcs.rm('bucket/thing/key')
gcs.ls('bucket/')
[] # directory should be gone

martindurant · 2018-01-04T15:17:53Z

gcsfs/core.py

@@ -398,65 +562,77 @@ def rmdir(self, bucket):
            for v in self.dirs[''][:]:
                if v['name'] == bucket:
                    self.dirs[''].remove(v)
-        self.dirs.pop(bucket, None)


We do want to remove the entry from a cached bucket listing, no? Also, the references to dirs should be renamed anyway.

Oh, I see - you expect invalidate, below, to do this. Still true about dirs.

martindurant · 2018-01-04T15:38:40Z

gcsfs/core.py

        if not bucket:
-            raise ValueError('Cannot walk all of GCS')
+            raise ValueError(
+                "walk path must include target bucket: %s" % path)


path is always empty here, so it is not very useful to report it. "Path must include at least a bucket" ?

I've included it in the logging in case there is some kind of malformed input string.

martindurant · 2018-01-04T15:42:15Z

gcsfs/core.py

        path = '/'.join([bucket, prefix])
-        files = self._list_bucket(bucket)
+
        if path.endswith('/'):


So by convention directories end with '/' and files do not? The user may expect walk('bucket/path') to get files below 'bucket/path/' too; also actual keys may end with '/', although I am not sure how that gets listed with the delimited. Should be a test for this.

This the current semantic. Walks targeting bucket/key will walk all objects below bucket/key/.

martindurant · 2018-01-04T15:44:28Z

gcsfs/core.py

-                        self._list_bucket(bucket)
+                        # Bucket may be present & viewable, but not owned by
+                        # the current project. Attempt to list.
+                        self._list_objects(path)


exists will be True whether path points to a directory or a file?

martindurant · 2018-01-04T15:46:09Z

gcsfs/core.py

+            # Return a pseudo dir for the bucket root
+            return {
+                'bucket': bucket,
+                'name': "/",


This may be what a directory entry looks like, but in user-facing methods, the name shoulod be expanded, in this case to 'bucket/'

martindurant · 2018-01-04T15:46:53Z

gcsfs/core.py

@@ -586,9 +788,9 @@ def rm(self, path, recursive=False):
            for p in self.walk(path):
                self.rm(p)


I wonder if you happen to know if there is a bulk-delete option in GCS?

It would be very nice if this were the case. Some of @jhamman 's benchmarks with Zarr spend a lot of time removing tiny files.

Unfortunately the GCS api doesn't have a bulk-delete operation. There are a number of possibilities to speed up object deletion. The easiest, and what's implemented in gsutil, would be to issue a number of concurrent delete requests. You probably have a better grounding with if/how this should be integrated into existing async runloop, but requests-futures would be an easy, standalone solution.

Refactor gcsfs to list file contents via prefixed bucket listing, rather than cached exhaustive bucket listing. In progress, but provides basic interface compatibility for walk, glob, ls, info. Intended to support re-addition of metadata caching via the _list_objects interface to provide prefix-specific listing caches. Update `info` to retrieve object info via object get. Add per-directory listing cache to GCSFS caching object metadata under the given directory. Resolves listing requests via cache, supporting walk/ls/glob/etc. Resolve `info` requests via cache if the parent directory has been listed, otherwise directly request object data. Updates cache invalidation logic to function on path prefixes, allowing object writes to invalidate their parent/sibling caches, rather than entire listing cache.

martindurant · 2018-02-02T16:58:18Z

Total time: 9.03903 s
File: /Users/mdurant/code/gcsfs/gcsfs/gcsfuse.py
Function: getattr at line 67

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    67                                               def getattr(self, path, fh=None):
    68       207          292      1.4      0.0          try:
    69       207      8959980  43284.9     99.1              info = self.gcs.info(''.join([self.root, path]))
    70       103          195      1.9      0.0          except FileNotFoundError:
    71       103         1122     10.9      0.0              raise FuseOSError(ENOENT)
    72       104          210      2.0      0.0          data = {'st_uid': 1000, 'st_gid': 1000}
    73       104           87      0.8      0.0          perm = 0o777
    74
    75       104          166      1.6      0.0          if info['storageClass'] == 'DIRECTORY' or 'bucket' in info['kind']:
    76         3            4      1.3      0.0              data['st_atime'] = 0
    77         3            3      1.0      0.0              data['st_ctime'] = 0
    78         3            2      0.7      0.0              data['st_mtime'] = 0
    79         3           16      5.3      0.0              data['st_mode'] = (stat.S_IFDIR | perm)
    80         3            2      0.7      0.0              data['st_size'] = 0
    81         3            3      1.0      0.0              data['st_blksize'] = 0
    82                                                   else:
    83       101        34912    345.7      0.4              data['st_atime'] = str_to_time(info['timeStorageClassUpdated'])
    84       101        21242    210.3      0.2              data['st_ctime'] = str_to_time(info['timeCreated'])
    85       101        20089    198.9      0.2              data['st_mtime'] = str_to_time(info['updated'])
    86       101          326      3.2      0.0              data['st_mode'] = (stat.S_IFREG | perm)
    87       101          129      1.3      0.0              data['st_size'] = info['size']
    88       101           90      0.9      0.0              data['st_blksize'] = 5 * 2**20
    89       101           85      0.8      0.0              data['st_nlink'] = 1
    90
    91       104           76      0.7      0.0          return data

Total time: 0.121938 s
File: /Users/mdurant/code/gcsfs/gcsfs/gcsfuse.py
Function: readdir at line 93

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    93                                               def readdir(self, path, fh):
    94         1            4      4.0      0.0          path = ''.join([self.root, path])
    95         1           34     34.0      0.0          print("List", path, fh, flush=True)
    96         1       121654 121654.0     99.8          files = self.gcs.ls(path)
    97         1          244    244.0      0.2          files = [os.path.basename(f.rstrip('/')) for f in files]
    98         1            2      2.0      0.0          return ['.', '..'] + files

Merged this branch into my more-fuse for this, added the following to GCSFS's init:

        import line_profiler
        self.prof = line_profiler.LineProfiler(self.getattr, self.gcs.ls,
                                               self.readdir)
        self.prof.enable()
        import atexit, sys
        atexit.register(lambda: self.prof.disable() or
                        self.prof.print_stats(sys.stdout))

and called as

> gcsfuse pangeo-data ~/gcs
# other terminal
> time CLICOLOR=0 /bin/ls ~/gcs/newmann-met-ensemble-netcdf
real	0m8.660s

(the message saying ls happened in the terminal running fuse comes quickly)

martindurant · 2018-02-05T15:14:02Z

Reposted from gitter:
slow directory listing is due to the fact, that although readdir() is returning a totaly reasonable list of files and caching the results correctly, ls then does getattr on those files **and others having filenames prepended with ._. Since these are not in the parent’s directory listing (because they don’t exist!) a HEAD is called on each of them.

This may be an osx-specific behaviour.

I think, if the parent directory is in the cache, trying to do _get_object() on a file not listed should be NotFound immediately, without trying the HEAD route (use that only where the parent directory isn’t listed).

Your comment # Should error on missing cache or reprobe? - yes, should raise

martindurant · 2018-02-05T17:45:57Z

On my branch

$ gcsfuse pangeo-data ~/gcs
# in other terminal
$ time ls ~/gcs/newmann-met-ensemble-netcdf
conus_ens_001.nc  ...

real	0m0.437s

martindurant · 2018-02-07T14:50:35Z

@asford , are you meaning to add more here, or is it only the testing that remains outstanding?

asford · 2018-02-09T20:01:34Z

@martindurant

Sorry for the slow updates on this, the dissertation has sunk it's claws into me...

I'm not intending on pushing any more logic into the pull, it's mostly just testing updates. I'll rebase 1526f51 into a separate pull with an associated issue.

martindurant · 2018-02-09T20:16:50Z

No worries.

This is an example of the currently failing case (raises exception on current master)

def test_small_flush(gcs):
    with gcs.open(fn, 'wb') as f:
        f.write(b'data')
        f.flush(force=False)

Add decorator-based method tracing to `gcsfuse.GCSFS` and `core.GCSFileSystem` interface methods. Add `--verbose` command-line option to `gcsfuse` to support debug logging control.

Prototype `per_dir_cache` integration for gcsfuse. Minimal fixup to gcsfuse to support directory listing.

Fix error in GCSFS::read() cache key resolution.

Resolve error when writing small partitions via dask.bag.to_textfiles. Error occurs when partition size is below minimum GCS multipart upload size. Close logic in dask.bytes.core calls flush(force=False), followed by flush(force=True) on GCSFile. Current logic initializes multipart upload on non-force flush and attempts to write a non-final block below the minimum GCS upload block size. Fixup logic to skip flush if buffer size is below minimum upload size on non-forced flush. This, incidentally, avoids initialization of multipart upload in cases where final file size will be below the minimum block size, which was resulting in duplicate uploads for small output partitions. Add tracing logic to GCSFile file operations for debugging. Update `_tracemethod` to perform, optional, traceback logging at `DEBUG-1` log level.

Updates `ls` to return non-prefix separated prefix search, needs to be verified? Should this be glob-like? Fix error from dask.bytes when read-only file is flushed. Fixup returning listing with "path" attribute.

Retry on requests failing due to `google.auth.exceptions.RefreshError`, partial resolution of fsspec#71.

Resolve error when writing small partitions via dask.bag.to_textfiles when partition size is below minimum GCS multipart upload size. Close logic in dask.bytes.core calls flush(force=False), followed by flush(force=True) on GCSFile. Current logic initializes multipart upload on non-force flush and then attempts to write a non-final block below the minimum GCS upload block size. Fixup logic to skip flush if buffer size is below minimum upload size on non-forced flush and instead issue a warning. This, incidentally, avoids initialization of multipart upload in cases where final file size will be below the minimum block size, which was resulting in duplicate uploads for small output partitions. Update core.py to lift GCS block size limits into module level constants. Replace use of constants in core.py with symbolic names.

From fsspec#73 review. Defer multipart upload if a simple upload may be at the specified block size on non-forced flush. Minor reorganization of `flush` logic to group error handling vs deferral. Relax block size restrictions on fetch, no longer aligning `range`-ed fetch requests to block boundaries. Fix minor logging error in `_fetch`.

asford · 2018-02-16T17:21:00Z

Noting issues from pangeo-data/pangeo#112:

Default directory cache lifetime should be extended, perhaps indefinite in default case?
Forced read through cache if file is missing in exists/info call may not be appropriate. Should respect cache? Different timeout?

Adds explict flag to control stacktrace debugging for traced methods. Reduces log size on test failures.

asford · 2018-02-16T18:48:36Z

Agreed and updated.

asford · 2018-02-16T18:54:28Z

@martindurant This is now is a tests-passing state. I've expanded to GCSFileSystem docstring to include the updated object details semantics.

martindurant · 2018-02-16T19:11:19Z

@asford , at some point we floated the idea of restricting the fields that are pulled down with list_objects; we can do that in a future PR (e.g., ls only gives just names or names+simple details, but info() gives either names+simple details or does a full call for all information), I want to mention it here in case you think any of your work inhibits doing that. I suspect it should be fine.

asford · 2018-02-16T19:18:41Z

It should be doable. I think we should maintain a cache of the raw results from GCS, which is what this pull implements, and then process the cached results as needed to produce the limited listings.

From review, cleanup `walk` implementation. Fix pseudodir creation on bucket-level `info` call. Remove `norm_path` todo.

martindurant · 2018-02-16T20:12:49Z

Soryr, you still seem to end up with a conflict - I expect it's small.

`flush` on an open, but read-only, file should be a no-op, not raise a ValueError, compare to builtin `open("read_only", "r").flush()`. Updates `flush` logic and adds test case covering flush behavior.

…ache

asford · 2018-02-16T22:51:09Z

I believe this is essentially ready for final review. I'm going to run some more integration-style testing in my environment, and I think there's some testing to be done on pangeo-data/pangeo#112.

asford · 2018-02-16T22:51:40Z

@martindurant Would you prefer to have this rebased to cleanup the commit history or merged as-is?

martindurant · 2018-02-17T00:23:34Z

I am happy to leave the commit history as is, whichever you prefer.

asford · 2018-02-17T00:36:16Z

Great! I'm then +1 to merge.

martindurant · 2018-02-17T02:15:45Z

@asford , having made this big effort, would you like to become a committer on this repo?

asford · 2018-02-18T05:28:51Z

Sounds great! I'd be glad to lend a hand in keeping this feature working; I suspect we'll find a few more bugs in the future.

mrocklin · 2018-02-18T06:20:07Z

@asford , having made this big effort, would you like to become a committer on this repo?

+1 !

martindurant · 2018-02-19T19:16:52Z

Sorry, just coming back to this now after a weekend away. @mrocklin , do you have the rights to add @asford , I don't think I do.

mrocklin · 2018-02-20T13:35:15Z

I've just sent @asford an invitation to join. Welcome @asford , we're lucky to have you!

@martindurant I've also just set it so that you have admin rights.

martindurant reviewed Jan 4, 2018

View reviewed changes

This was referenced Jan 4, 2018

Credentials should be resolved from metadata service, if available, in default case. #54

Closed

release #63

Closed

asford force-pushed the per_dir_cache branch from 5defc41 to 590cdae Compare January 26, 2018 01:54

Clear _listing_cache on pickle, remove self.dirs references.

8449a38

asford force-pushed the per_dir_cache branch from ee721c2 to 8449a38 Compare January 26, 2018 04:02

martindurant mentioned this pull request Feb 1, 2018

gcsfuse stalls? #66

Closed

asford force-pushed the per_dir_cache branch from b18f83a to 94d9bfc Compare February 14, 2018 02:35

asford added 8 commits February 14, 2018 05:14

Add per-method debug tracing.

2aa6755

Add decorator-based method tracing to `gcsfuse.GCSFS` and `core.GCSFileSystem` interface methods. Add `--verbose` command-line option to `gcsfuse` to support debug logging control.

Bugfix prototype gcsfuse/per_dir_cache integration.

b296fee

Prototype `per_dir_cache` integration for gcsfuse. Minimal fixup to gcsfuse to support directory listing.

Fix GCSFS::read cache access.

360219d

Fix error in GCSFS::read() cache key resolution.

Fixup core logging dispatch.

542c7dc

Add gcsfuse install extra.

a55d274

First-stab attempt at fixing test/interface errors.

7ad092c

Updates `ls` to return non-prefix separated prefix search, needs to be verified? Should this be glob-like? Fix error from dask.bytes when read-only file is flushed. Fixup returning listing with "path" attribute.

Retry on auth refresh error.

2ccf856

Retry on requests failing due to `google.auth.exceptions.RefreshError`, partial resolution of fsspec#71.

asford force-pushed the per_dir_cache branch from e4e24b4 to 2ccf856 Compare February 14, 2018 05:15

martindurant mentioned this pull request Feb 14, 2018

Manipulating Zarr metadata with GCSFS is slow pangeo-data/pangeo#112

Closed

asford mentioned this pull request Feb 16, 2018

Fix flush-on-small-block-size errors and lift block size constants. #73

Merged

asford added 2 commits February 16, 2018 16:50

Add assert over expected block size in chunk upload.

a15272c

Reduce verbosity of test tracing.

a4989a0

Adds explict flag to control stacktrace debugging for traced methods. Reduces log size on test failures.

Cleanup walk and fix bucket info calls.

fc857a3

From review, cleanup `walk` implementation. Fix pseudodir creation on bucket-level `info` call. Remove `norm_path` todo.

asford changed the title ~~[WIP] Per-directory metadata cache.~~ Per-directory metadata cache. Feb 16, 2018

update recordings

1f1f5e4

asford added 9 commits February 16, 2018 21:02

Flush on open read-only file should be no-op, not error.

579d38f

`flush` on an open, but read-only, file should be a no-op, not raise a ValueError, compare to builtin `open("read_only", "r").flush()`. Updates `flush` logic and adds test case covering flush behavior.

Allow file close after force-flush.

3af77a1

Merge remote-tracking branch 'small_file_flush' into per_dir_cache

83919c0

VCR updates.

51c5c21

Merge remote-tracking branch 'origin/small_file_flush' into per_dir_c…

94c14d1

…ache

VCR updates.

2129f8c

Merge VCR updates into per_dir_cache.

8a02558

Add log & assert to close-on-flushed-file.

ac0b4af

Merge branch 'small_file_flush' into per_dir_cache.

1f95116

martindurant merged commit 191d4cc into fsspec:master Feb 20, 2018

rabernat mentioned this pull request Feb 20, 2018

Workers are slow to kick in (slow transfer of zarr objects) pangeo-data/pangeo#117

Closed

asford mentioned this pull request Feb 26, 2018

ls takes a long time with buckets with lots of files #24

Closed

		@@ -586,9 +788,9 @@ def rm(self, path, recursive=False):
		for p in self.walk(path):
		self.rm(p)

Per-directory metadata cache. #57

Per-directory metadata cache. #57

Conversation

asford commented Jan 4, 2018 • edited Loading

martindurant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Feb 2, 2018

martindurant commented Feb 5, 2018

martindurant commented Feb 5, 2018

martindurant commented Feb 7, 2018

asford commented Feb 9, 2018

martindurant commented Feb 9, 2018

asford commented Feb 16, 2018 • edited Loading

asford commented Feb 16, 2018

asford commented Feb 16, 2018

martindurant commented Feb 16, 2018

asford commented Feb 16, 2018

martindurant commented Feb 16, 2018

asford commented Feb 16, 2018

asford commented Feb 16, 2018

martindurant commented Feb 17, 2018

asford commented Feb 17, 2018

martindurant commented Feb 17, 2018

asford commented Feb 18, 2018

mrocklin commented Feb 18, 2018

martindurant commented Feb 19, 2018

mrocklin commented Feb 20, 2018

asford commented Jan 4, 2018 •

edited

Loading

asford commented Feb 16, 2018 •

edited

Loading