Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to turn off local cache #10

Closed
pjbull opened this issue Aug 17, 2020 · 5 comments · Fixed by #314
Closed

Add ability to turn off local cache #10

pjbull opened this issue Aug 17, 2020 · 5 comments · Fixed by #314
Labels
enhancement New feature or request

Comments

@pjbull
Copy link
Member

pjbull commented Aug 17, 2020

Some use cases (especially if we enable streaming) may not want to keep much of a cache around at all.

We should have an option to turn off the cache entirely, which always will grab things from the server.

@pjbull pjbull changed the title Add ability to turn of local cache Add ability to turn off local cache Aug 22, 2020
@pjbull pjbull added the enhancement New feature or request label Oct 2, 2020
@DayalStrub
Copy link

Hey. Wondering what the status/thoughts are on this one. Seems like local cache is quite central to all the functionality currently, so might be tricky to implement.

Asking as we have instances where we have write only access to S3 and cache logic is an issue - another, probably a lot less common than streaming, example of where ability to turn off cache would be great.

@jayqi
Copy link
Member

jayqi commented Sep 7, 2021

Speculation: I wonder if perhaps force_overwrite_from_cloud could be a way to make this work. So the mechanism is: someone has force_overwrite_from_cloud=True, then those actions would skip trying to read anything to create or refresh the cache before writing. Maybe we could implement that without any significant redesign.

@DayalStrub
Copy link

... someone has force_overwrite_from_cloud=True, then those actions would skip trying to read anything to create or refresh the cache before writing. ...

@jayqi are you thinking that open could be rewritten so that force_overwrite_from_cloud=True leads it to skip any reading and updating cache/stats, etc.? This would not break the API, but would change the functionality. It would also sort out my use case, provided I stick to open. [Getting write_text and _bytes to work too might require allowing one to pass force_overwrite_from_cloud=True (optionally) at S3Client (or other equivalent) level and then make it trickle down to open, and the other relevant methods. This would maybe be needed for consistency, but would also break the API - though maybe with defaults that's OK.]

It's probably not a solution for streaming and cases when the cache itself is an issue rather than the current read required to write/open.

@pjbull
Copy link
Member Author

pjbull commented Dec 19, 2022

Good additional discussion in #233.

This is still under consideration. There is a set of functionality that depends on the cache, so those are worth enumerating. The cache gets created any time _refresh_cache gets called, which is the following scenarios:

  • __fspath__ - this is important for interoperability since some libraries use it to do file opening themselves; not sure that we can't not download the file and still make this work.
  • _dispatch_to_local_cache_path and therefore these downstream methods:
    • stat - not an issue since it is implemented by all official clients, but maybe not plugins
    • read_bytes - we could implement this in memory in the Client and not cache if a variable is set, or implement this via open like we do with write_bytes
    • read_text - we could implement this in memory in the Client and not cache if a variable is set, or implement this via open like we do with write_text
  • open - we could if a setting is set delete the local file when the buffer is closed; we'd need to do this in the read scenario in addition to the write scenario, which is already patched.

Simplest implementation seems to be:

  • Send read_* through open
  • in open, patch buffer.close to delete cache if setting is set on read or write
  • in __fspath__ raise an error that local paths cannot be referenced if the cache is off
  • I think we probably want this as a env var that is read when clients are instantiated so you can set it both in the environment and programatically

A few additional observations:

  • Currently it should be the case that when the client is garbarge collected the cache is cleaned if it is a temp dir. It is a little curious this doesn't solve more of the cases of large caches that people report.
    def __del__(self) -> None:
    # make sure temp is cleaned up if we created it
    if self._cache_tmp_dir is not None:
    self._cache_tmp_dir.cleanup()
  • We could add similar functionality to CloudPath.__del__ so they are removed from local cache on garbage collection; this may happen more frequently than the client getting garbage collected and give us more leeway
  • We should definitely add an explicit remove_from_local_cache method (name tbd) to CloudPath and a empty_local_cache method to Client so there is an official API to remove the cache files if people want, both for an individual file and for a
  • download_to and upload_from do not cache files (because presumably you already are tracking where they are on your local file system), but Add caching to download_to and upload_from #292 requests that they do

Other options:

  • Bound the cache size; if new file puts cache over the limit, delete the least recently accessed file in the cache. This has the downside of incurring some additional overhead we could hit often to do size checking and then looking for least recently accessed, then removing. (Separately tracked in Add a limit to the size of the local cache #11)
  • Delete from cache just when uploads finish (downside: files downloaded for read actions may stay in the cache)
  • Implement the read_* methods with streaming if no cache so we only have a local copy when uploading (could be combined with above)

@pjbull
Copy link
Member Author

pjbull commented Feb 16, 2023

@DayalStrub Changes to options for managing the cache were just released in the latest version. See updated methods here: https://cloudpathlib.drivendata.org/stable/caching/#clearing-the-file-cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants