Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-directory metadata cache. #57

Merged
merged 27 commits into from
Feb 20, 2018
Merged

Commits on Jan 26, 2018

  1. Replace bucket listing with per-directory listing cache.

    Refactor gcsfs to list file contents via prefixed bucket listing,
    rather than cached exhaustive bucket listing. In progress, but provides
    basic interface compatibility for walk, glob, ls, info. Intended to
    support re-addition of metadata caching via the _list_objects interface
    to provide prefix-specific listing caches.
    
    Update `info` to retrieve object info via object get.
    
    Add per-directory listing cache to GCSFS caching object metadata under
    the given directory. Resolves listing requests via cache, supporting
    walk/ls/glob/etc. Resolve `info` requests via cache if the parent
    directory has been listed, otherwise directly request object data.
    Updates cache invalidation logic to function on path prefixes, allowing
    object writes to invalidate their parent/sibling caches, rather than
    entire listing cache.
    asford committed Jan 26, 2018
    Configuration menu
    Copy the full SHA
    590cdae View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    8449a38 View commit details
    Browse the repository at this point in the history

Commits on Feb 14, 2018

  1. Add per-method debug tracing.

    Add decorator-based method tracing to `gcsfuse.GCSFS` and
    `core.GCSFileSystem` interface methods. Add `--verbose` command-line
    option to `gcsfuse` to support debug logging control.
    asford committed Feb 14, 2018
    Configuration menu
    Copy the full SHA
    2aa6755 View commit details
    Browse the repository at this point in the history
  2. Bugfix prototype gcsfuse/per_dir_cache integration.

    Prototype `per_dir_cache` integration for gcsfuse. Minimal fixup to
    gcsfuse to support directory listing.
    asford committed Feb 14, 2018
    Configuration menu
    Copy the full SHA
    b296fee View commit details
    Browse the repository at this point in the history
  3. Fix GCSFS::read cache access.

    Fix error in GCSFS::read() cache key resolution.
    asford committed Feb 14, 2018
    Configuration menu
    Copy the full SHA
    360219d View commit details
    Browse the repository at this point in the history
  4. Fix flush-on-small-block-size errors.

    Resolve error when writing small partitions via dask.bag.to_textfiles.
    Error occurs when partition size is below minimum GCS multipart upload
    size.
    
    Close logic in dask.bytes.core calls flush(force=False), followed by
    flush(force=True) on GCSFile. Current logic initializes multipart upload
    on non-force flush and attempts to write a non-final block below the
    minimum GCS upload block size.
    
    Fixup logic to skip flush if buffer size is below minimum upload size on
    non-forced flush. This, incidentally, avoids initialization of multipart
    upload in cases where final file size will be below the minimum block
    size, which was resulting in duplicate uploads for small output
    partitions.
    
    Add tracing logic to GCSFile file operations for debugging. Update
    `_tracemethod` to perform, optional, traceback logging at `DEBUG-1` log
    level.
    asford committed Feb 14, 2018
    Configuration menu
    Copy the full SHA
    30be23c View commit details
    Browse the repository at this point in the history
  5. Fixup core logging dispatch.

    asford committed Feb 14, 2018
    Configuration menu
    Copy the full SHA
    542c7dc View commit details
    Browse the repository at this point in the history
  6. Add gcsfuse install extra.

    asford committed Feb 14, 2018
    Configuration menu
    Copy the full SHA
    a55d274 View commit details
    Browse the repository at this point in the history
  7. First-stab attempt at fixing test/interface errors.

    Updates `ls` to return non-prefix separated prefix search, needs to be
    verified? Should this be glob-like? Fix error from dask.bytes when
    read-only file is flushed. Fixup returning listing with "path"
    attribute.
    asford committed Feb 14, 2018
    Configuration menu
    Copy the full SHA
    7ad092c View commit details
    Browse the repository at this point in the history
  8. Retry on auth refresh error.

    Retry on requests failing due to `google.auth.exceptions.RefreshError`,
    partial resolution of fsspec#71.
    asford committed Feb 14, 2018
    Configuration menu
    Copy the full SHA
    2ccf856 View commit details
    Browse the repository at this point in the history

Commits on Feb 16, 2018

  1. Fix flush-on-small-block-size errors and lift block size constants.

    Resolve error when writing small partitions via dask.bag.to_textfiles
    when partition size is below minimum GCS multipart upload
    size.
    
    Close logic in dask.bytes.core calls flush(force=False), followed by
    flush(force=True) on GCSFile. Current logic initializes multipart upload
    on non-force flush and then attempts to write a non-final block
    below the minimum GCS upload block size.
    
    Fixup logic to skip flush if buffer size is below minimum upload size on
    non-forced flush and instead issue a warning. This, incidentally, avoids
    initialization of multipart upload in cases where final file size will
    be below the minimum block size, which was resulting in duplicate
    uploads for small output partitions.
    
    Update core.py to lift GCS block size limits into module level
    constants. Replace use of constants in core.py with symbolic names.
    asford committed Feb 16, 2018
    Configuration menu
    Copy the full SHA
    9a8537f View commit details
    Browse the repository at this point in the history
  2. Defer multipart if simple upload possible, relax read chunk size.

    From fsspec#73 review. Defer multipart upload if a simple upload may be at the
    specified block size on non-forced flush. Minor reorganization of
    `flush` logic to group error handling vs deferral.
    
    Relax block size restrictions on fetch, no longer aligning `range`-ed
    fetch requests to block boundaries.
    
    Fix minor logging error in `_fetch`.
    asford committed Feb 16, 2018
    2 Configuration menu
    Copy the full SHA
    05928e4 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a15272c View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    d67798b View commit details
    Browse the repository at this point in the history
  5. Update cache semantics, non-expire and no refresh on missing object.

    Updates GCSFileSystem cache configuration. Set cache as non-expiring in
    default configuration, but continue to allow configurable cache timeout.
    Do not bypass cache on `_get_object` calls if object is not present in
    cache listing.
    
    Updates GCSFileSystem docstring to include description of caching
    semantics.
    asford committed Feb 16, 2018
    Configuration menu
    Copy the full SHA
    ba83a10 View commit details
    Browse the repository at this point in the history
  6. Reduce verbosity of test tracing.

    Adds explict flag to control stacktrace debugging for traced methods.
    Reduces log size on test failures.
    asford committed Feb 16, 2018
    Configuration menu
    Copy the full SHA
    a4989a0 View commit details
    Browse the repository at this point in the history
  7. Cleanup walk and fix bucket info calls.

    From review, cleanup `walk` implementation. Fix pseudodir creation on
    bucket-level `info` call. Remove `norm_path` todo.
    asford committed Feb 16, 2018
    Configuration menu
    Copy the full SHA
    fc857a3 View commit details
    Browse the repository at this point in the history
  8. update recordings

    Martin Durant committed Feb 16, 2018
    Configuration menu
    Copy the full SHA
    1f1f5e4 View commit details
    Browse the repository at this point in the history
  9. Flush on open read-only file should be no-op, not error.

    `flush` on an open, but read-only, file should be a no-op, not raise
    a ValueError, compare to  builtin `open("read_only", "r").flush()`.
    Updates `flush` logic and adds test case covering flush behavior.
    asford committed Feb 16, 2018
    Configuration menu
    Copy the full SHA
    579d38f View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    3af77a1 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    83919c0 View commit details
    Browse the repository at this point in the history
  12. VCR updates.

    asford committed Feb 16, 2018
    Configuration menu
    Copy the full SHA
    51c5c21 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    94c14d1 View commit details
    Browse the repository at this point in the history
  14. VCR updates.

    asford committed Feb 16, 2018
    Configuration menu
    Copy the full SHA
    2129f8c View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    8a02558 View commit details
    Browse the repository at this point in the history
  16. 1 Configuration menu
    Copy the full SHA
    ac0b4af View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    1f95116 View commit details
    Browse the repository at this point in the history