Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-directory metadata cache. #57

Merged
merged 27 commits into from
Feb 20, 2018
Merged

Conversation

asford
Copy link
Collaborator

@asford asford commented Jan 4, 2018

  • Test case updates to reflect new caching logic.
  • VCR-test update.
  • Docstring updates to clarify file interface.
  • Documentation updates for caching logic.
  • Profile vs previous cache implementation.
  • Integrate cache-deletion logic and tests from Remove the 'dirs' attribute from GCSFileSystem when serializing #49.
  • Update glob logic to restrict search to subdirs and prefixes.

@martindurant Work in progress solution for #24 and #21, likely supersedes #22. Would you mind taking a quick look at this pull for a sanity check? I've performed some initial integration and manual testing in my deployment and this implementation appears to resolve the primary performance issues I've encountered.

Refactors the GCSFileSystem to operate on a per-directory object metadata cache, rather than a full-bucket cache, to support file reads in buckets with multiple directory structures. This resolves performance issues due to full-bucket listing when reading a subset of keys from a bucket or when globing within a subdirectory of the bucket.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start to me. There is quite a lot of code here, so the proof will have to be in the testing - but I generally agree with the approach.

I feel like there are more private functions and dict creation/extracting than necessary, but this is probably a style question, and maybe your way makes testing easier.

gcsfs/core.py Outdated
@@ -64,19 +65,33 @@ def quote_plus(s):
return s


def norm_path(path):
"""Canonicalize path by split and rejoining."""
# TODO Should canonical path include protocol?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, we should strip the protocol as early as possible within this library.

@@ -159,6 +174,9 @@ class GCSFileSystem(object):
(see description of authentication methods, above)
consistency: 'none', 'size', 'md5'
Check method when writing files. Can be overridden in open().
cache_timeout: float, seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea.

items.extend(page.get("items", []))
next_page_token = page.get('nextPageToken', None)

result = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the dict? As far as I can see, the only place that this is used, we immediately pick out the 'items' key.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repacking the result in the form of a de-paginated view of the standard GCS object listing. The prefixes list it used later to generated pseudo-directory listings for ls and info calls.

gcsfs/core.py Outdated
@@ -389,7 +554,6 @@ def mkdir(self, bucket, acl='projectPrivate',
predefinedDefaultObjectAcl=default_acl,
json={"name": bucket})
self.invalidate_cache(bucket)
self.invalidate_cache('')

def rmdir(self, bucket):
"""Delete an empty bucket"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, if you delete the last key within a given prefix, which calls invalidate cache on the parent, do we expect the apparent directory to disappear?
e.g.,

gcs.ls('bucket/')
['bucket/thing/']
gcs.ls('bucket/thing/')
['bucket/thing/key']
gcs.rm('bucket/thing/key')
gcs.ls('bucket/')
[] # directory should be gone

gcsfs/core.py Outdated
@@ -398,65 +562,77 @@ def rmdir(self, bucket):
for v in self.dirs[''][:]:
if v['name'] == bucket:
self.dirs[''].remove(v)
self.dirs.pop(bucket, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do want to remove the entry from a cached bucket listing, no? Also, the references to dirs should be renamed anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see - you expect invalidate, below, to do this. Still true about dirs.

gcsfs/core.py Outdated
if not bucket:
raise ValueError('Cannot walk all of GCS')
raise ValueError(
"walk path must include target bucket: %s" % path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path is always empty here, so it is not very useful to report it. "Path must include at least a bucket" ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included it in the logging in case there is some kind of malformed input string.

gcsfs/core.py Outdated
path = '/'.join([bucket, prefix])
files = self._list_bucket(bucket)

if path.endswith('/'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So by convention directories end with '/' and files do not? The user may expect walk('bucket/path') to get files below 'bucket/path/' too; also actual keys may end with '/', although I am not sure how that gets listed with the delimited. Should be a test for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This the current semantic. Walks targeting bucket/key will walk all objects below bucket/key/.

self._list_bucket(bucket)
# Bucket may be present & viewable, but not owned by
# the current project. Attempt to list.
self._list_objects(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exists will be True whether path points to a directory or a file?

gcsfs/core.py Outdated
# Return a pseudo dir for the bucket root
return {
'bucket': bucket,
'name': "/",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be what a directory entry looks like, but in user-facing methods, the name shoulod be expanded, in this case to 'bucket/'

@@ -586,9 +788,9 @@ def rm(self, path, recursive=False):
for p in self.walk(path):
self.rm(p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you happen to know if there is a bulk-delete option in GCS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very nice if this were the case. Some of @jhamman 's benchmarks with Zarr spend a lot of time removing tiny files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the GCS api doesn't have a bulk-delete operation. There are a number of possibilities to speed up object deletion. The easiest, and what's implemented in gsutil, would be to issue a number of concurrent delete requests. You probably have a better grounding with if/how this should be integrated into existing async runloop, but requests-futures would be an easy, standalone solution.

Refactor gcsfs to list file contents via prefixed bucket listing,
rather than cached exhaustive bucket listing. In progress, but provides
basic interface compatibility for walk, glob, ls, info. Intended to
support re-addition of metadata caching via the _list_objects interface
to provide prefix-specific listing caches.

Update `info` to retrieve object info via object get.

Add per-directory listing cache to GCSFS caching object metadata under
the given directory. Resolves listing requests via cache, supporting
walk/ls/glob/etc. Resolve `info` requests via cache if the parent
directory has been listed, otherwise directly request object data.
Updates cache invalidation logic to function on path prefixes, allowing
object writes to invalidate their parent/sibling caches, rather than
entire listing cache.
@martindurant
Copy link
Member

Total time: 9.03903 s
File: /Users/mdurant/code/gcsfs/gcsfs/gcsfuse.py
Function: getattr at line 67

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    67                                               def getattr(self, path, fh=None):
    68       207          292      1.4      0.0          try:
    69       207      8959980  43284.9     99.1              info = self.gcs.info(''.join([self.root, path]))
    70       103          195      1.9      0.0          except FileNotFoundError:
    71       103         1122     10.9      0.0              raise FuseOSError(ENOENT)
    72       104          210      2.0      0.0          data = {'st_uid': 1000, 'st_gid': 1000}
    73       104           87      0.8      0.0          perm = 0o777
    74
    75       104          166      1.6      0.0          if info['storageClass'] == 'DIRECTORY' or 'bucket' in info['kind']:
    76         3            4      1.3      0.0              data['st_atime'] = 0
    77         3            3      1.0      0.0              data['st_ctime'] = 0
    78         3            2      0.7      0.0              data['st_mtime'] = 0
    79         3           16      5.3      0.0              data['st_mode'] = (stat.S_IFDIR | perm)
    80         3            2      0.7      0.0              data['st_size'] = 0
    81         3            3      1.0      0.0              data['st_blksize'] = 0
    82                                                   else:
    83       101        34912    345.7      0.4              data['st_atime'] = str_to_time(info['timeStorageClassUpdated'])
    84       101        21242    210.3      0.2              data['st_ctime'] = str_to_time(info['timeCreated'])
    85       101        20089    198.9      0.2              data['st_mtime'] = str_to_time(info['updated'])
    86       101          326      3.2      0.0              data['st_mode'] = (stat.S_IFREG | perm)
    87       101          129      1.3      0.0              data['st_size'] = info['size']
    88       101           90      0.9      0.0              data['st_blksize'] = 5 * 2**20
    89       101           85      0.8      0.0              data['st_nlink'] = 1
    90
    91       104           76      0.7      0.0          return data

Total time: 0.121938 s
File: /Users/mdurant/code/gcsfs/gcsfs/gcsfuse.py
Function: readdir at line 93

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    93                                               def readdir(self, path, fh):
    94         1            4      4.0      0.0          path = ''.join([self.root, path])
    95         1           34     34.0      0.0          print("List", path, fh, flush=True)
    96         1       121654 121654.0     99.8          files = self.gcs.ls(path)
    97         1          244    244.0      0.2          files = [os.path.basename(f.rstrip('/')) for f in files]
    98         1            2      2.0      0.0          return ['.', '..'] + files

Merged this branch into my more-fuse for this, added the following to GCSFS's init:

        import line_profiler
        self.prof = line_profiler.LineProfiler(self.getattr, self.gcs.ls,
                                               self.readdir)
        self.prof.enable()
        import atexit, sys
        atexit.register(lambda: self.prof.disable() or
                        self.prof.print_stats(sys.stdout))

and called as

> gcsfuse pangeo-data ~/gcs
# other terminal
> time CLICOLOR=0 /bin/ls ~/gcs/newmann-met-ensemble-netcdf
real	0m8.660s

(the message saying ls happened in the terminal running fuse comes quickly)

@martindurant
Copy link
Member

Reposted from gitter:
slow directory listing is due to the fact, that although readdir() is returning a totaly reasonable list of files and caching the results correctly, ls then does getattr on those files **and others having filenames prepended with ._. Since these are not in the parent’s directory listing (because they don’t exist!) a HEAD is called on each of them.

This may be an osx-specific behaviour.

I think, if the parent directory is in the cache, trying to do _get_object() on a file not listed should be NotFound immediately, without trying the HEAD route (use that only where the parent directory isn’t listed).

Your comment # Should error on missing cache or reprobe? - yes, should raise

@martindurant
Copy link
Member

On my branch

$ gcsfuse pangeo-data ~/gcs
# in other terminal
$ time ls ~/gcs/newmann-met-ensemble-netcdf
conus_ens_001.nc  ...

real	0m0.437s

@martindurant
Copy link
Member

@asford , are you meaning to add more here, or is it only the testing that remains outstanding?

@asford
Copy link
Collaborator Author

asford commented Feb 9, 2018

@martindurant

Sorry for the slow updates on this, the dissertation has sunk it's claws into me...

I'm not intending on pushing any more logic into the pull, it's mostly just testing updates. I'll rebase 1526f51 into a separate pull with an associated issue.

@martindurant
Copy link
Member

No worries.

This is an example of the currently failing case (raises exception on current master)

def test_small_flush(gcs):
    with gcs.open(fn, 'wb') as f:
        f.write(b'data')
        f.flush(force=False)

Add decorator-based method tracing to `gcsfuse.GCSFS` and
`core.GCSFileSystem` interface methods. Add `--verbose` command-line
option to `gcsfuse` to support debug logging control.
Prototype `per_dir_cache` integration for gcsfuse. Minimal fixup to
gcsfuse to support directory listing.
Fix error in GCSFS::read() cache key resolution.
Resolve error when writing small partitions via dask.bag.to_textfiles.
Error occurs when partition size is below minimum GCS multipart upload
size.

Close logic in dask.bytes.core calls flush(force=False), followed by
flush(force=True) on GCSFile. Current logic initializes multipart upload
on non-force flush and attempts to write a non-final block below the
minimum GCS upload block size.

Fixup logic to skip flush if buffer size is below minimum upload size on
non-forced flush. This, incidentally, avoids initialization of multipart
upload in cases where final file size will be below the minimum block
size, which was resulting in duplicate uploads for small output
partitions.

Add tracing logic to GCSFile file operations for debugging. Update
`_tracemethod` to perform, optional, traceback logging at `DEBUG-1` log
level.
Updates `ls` to return non-prefix separated prefix search, needs to be
verified? Should this be glob-like? Fix error from dask.bytes when
read-only file is flushed. Fixup returning listing with "path"
attribute.
Retry on requests failing due to `google.auth.exceptions.RefreshError`,
partial resolution of fsspec#71.
Resolve error when writing small partitions via dask.bag.to_textfiles
when partition size is below minimum GCS multipart upload
size.

Close logic in dask.bytes.core calls flush(force=False), followed by
flush(force=True) on GCSFile. Current logic initializes multipart upload
on non-force flush and then attempts to write a non-final block
below the minimum GCS upload block size.

Fixup logic to skip flush if buffer size is below minimum upload size on
non-forced flush and instead issue a warning. This, incidentally, avoids
initialization of multipart upload in cases where final file size will
be below the minimum block size, which was resulting in duplicate
uploads for small output partitions.

Update core.py to lift GCS block size limits into module level
constants. Replace use of constants in core.py with symbolic names.
From fsspec#73 review. Defer multipart upload if a simple upload may be at the
specified block size on non-forced flush. Minor reorganization of
`flush` logic to group error handling vs deferral.

Relax block size restrictions on fetch, no longer aligning `range`-ed
fetch requests to block boundaries.

Fix minor logging error in `_fetch`.
@asford
Copy link
Collaborator Author

asford commented Feb 16, 2018

Noting issues from pangeo-data/pangeo#112:

  • Default directory cache lifetime should be extended, perhaps indefinite in default case?
  • Forced read through cache if file is missing in exists/info call may not be appropriate. Should respect cache? Different timeout?

Adds explict flag to control stacktrace debugging for traced methods.
Reduces log size on test failures.
@asford
Copy link
Collaborator Author

asford commented Feb 16, 2018

Agreed and updated.

@asford
Copy link
Collaborator Author

asford commented Feb 16, 2018

@martindurant This is now is a tests-passing state. I've expanded to GCSFileSystem docstring to include the updated object details semantics.

@martindurant
Copy link
Member

@asford , at some point we floated the idea of restricting the fields that are pulled down with list_objects; we can do that in a future PR (e.g., ls only gives just names or names+simple details, but info() gives either names+simple details or does a full call for all information), I want to mention it here in case you think any of your work inhibits doing that. I suspect it should be fine.

@asford
Copy link
Collaborator Author

asford commented Feb 16, 2018

It should be doable. I think we should maintain a cache of the raw results from GCS, which is what this pull implements, and then process the cached results as needed to produce the limited listings.

From review, cleanup `walk` implementation. Fix pseudodir creation on
bucket-level `info` call. Remove `norm_path` todo.
@asford asford changed the title [WIP] Per-directory metadata cache. Per-directory metadata cache. Feb 16, 2018
@martindurant
Copy link
Member

Soryr, you still seem to end up with a conflict - I expect it's small.

@asford
Copy link
Collaborator Author

asford commented Feb 16, 2018

I believe this is essentially ready for final review. I'm going to run some more integration-style testing in my environment, and I think there's some testing to be done on pangeo-data/pangeo#112.

@asford
Copy link
Collaborator Author

asford commented Feb 16, 2018

@martindurant Would you prefer to have this rebased to cleanup the commit history or merged as-is?

@martindurant
Copy link
Member

I am happy to leave the commit history as is, whichever you prefer.

@asford
Copy link
Collaborator Author

asford commented Feb 17, 2018

Great! I'm then +1 to merge.

@martindurant
Copy link
Member

@asford , having made this big effort, would you like to become a committer on this repo?

@asford
Copy link
Collaborator Author

asford commented Feb 18, 2018

Sounds great! I'd be glad to lend a hand in keeping this feature working; I suspect we'll find a few more bugs in the future.

@mrocklin
Copy link
Contributor

@asford , having made this big effort, would you like to become a committer on this repo?

+1 !

@martindurant
Copy link
Member

Sorry, just coming back to this now after a weekend away. @mrocklin , do you have the rights to add @asford , I don't think I do.

@martindurant martindurant merged commit 191d4cc into fsspec:master Feb 20, 2018
@mrocklin
Copy link
Contributor

I've just sent @asford an invitation to join. Welcome @asford , we're lucky to have you!

@martindurant I've also just set it so that you have admin rights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants