Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caching to download_to and upload_from #292

Open
mjkanji opened this issue Nov 23, 2022 · 3 comments
Open

Add caching to download_to and upload_from #292

mjkanji opened this issue Nov 23, 2022 · 3 comments

Comments

@mjkanji
Copy link

mjkanji commented Nov 23, 2022

Is there any reason why download_to and upload_from do not support the caching mechanism and, instead, download from the source/cloud directly again and again?

# explicitly instantiate a client that always uses the local cache
LOCAL_CACHE = Path.home() / "cloudpathlib_cache"
client = S3Client(local_cache_dir=LOCAL_CACHE)
local_path = Path("image.png")

ladi = client.CloudPath("s3://ladi/Images/FEMA_CAP/2020/70349")
flood_image = ladi / "DSC_0002_a89f1b79-786f-4dac-9dcc-609fb1a977b1.jpg"

flood_image.download_to(local_path)

LOCAL_CACHE.exists() # False, the cache was not created
flood_image._local.exists() # False, the image was not saved in the cache

# call .fspath to create cached copy
flood_image.fspath

# Now True
LOCAL_CACHE.exists()
flood_image._local.exists()

The only disadvantage of the download_to method using caching is that you have twice the storage use (the same file is saved in the cache and in local_path). But, on the other hand, repeatedly downloading the same data if it hasn't changed also seems rather wasteful to me, especially if it's a large folder. Copying to/from the cache seems much better.

Maybe we can add an argument such as force_download or use_caching to allow users to enable caching, while keeping the current behaviour the default ( to avoid unexpectedly using twice storage space for those who don't want it).

Alternatively, a cleverer approach that sidesteps the 2x storage use is to make download_to and upload_from obey the same mechanisms that files in the cache obey when deciding whether to re-download a file or not: check if something already exists in the path given, compared with the cloud version to see if it's outdated, and if not, don't download again.

Finally, what's the use case for this? I don't want to use open on a CloudPath object directly. For one, it's easier for the user to understand that the file is available locally and they can see it somewhere they define.

Basically, I'm proposing enabling something like the AWSCLI's s3 sync command within cloudpathlib, since AWS for whatever reason won't add it natively to boto3 itself (someone asked more than 7 years ago now). cloudpathlib already has the nuts and bolts for this built-in, and it's just a matter of enabling the feature.

@mjkanji mjkanji changed the title download_to and upload_from sidestep caching Add as s3 sync functionality to download_to and upload_from via caching Nov 23, 2022
@mjkanji mjkanji changed the title Add as s3 sync functionality to download_to and upload_from via caching Add aws s3 sync functionality to download_to and upload_from via caching Nov 23, 2022
@pjbull pjbull changed the title Add aws s3 sync functionality to download_to and upload_from via caching Add caching to download_to and upload_from Nov 23, 2022
@pjbull
Copy link
Member

pjbull commented Nov 23, 2022

Thanks for the comments, @mjkanji. Adding some sort of syncing is discussed in #14 and #134. Happy to continue that discussion in those issues. To me, whether or not we support that API is a separate issue from the caching one that you discuss. I can't guarantee

For this one, I think we can keep it about changing the download_to and upload_from APIs and implementation to support caching.

On that point, I don't think I'm yet convinced it's worth it to implement it on the library side. The intent behind download_to and upload_from is a way to explicitly force the CloudPath to talk to the backend server so that if you want to manage communication between the cloud and local machines explicitly, you are able to. Since in your example the code creates local_path then you know a file is at that local path. If you don't want to re-download a file, your code could check do things like explicitly check .exists method and .stat().mt_time to determine the correct behavior that you want. It's not really clear to me it belongs in these explicit functions in the library code versus implemented in user code. If you're already being explicit about downloading/uploading in your code, why not also be explicit about when you decide to do so?

Finally, what's the use case for this? I don't want to use open on a CloudPath object directly. For one, it's easier for the user to understand that the file is available locally and they can see it somewhere they define.

Just curious, why not? A core reason for this library is so you can call .open on a CloudPath just like on a Path.

@mjkanji
Copy link
Author

mjkanji commented Nov 23, 2022

Hi @pjbull! Thank you for your reply!

Just curious, why not? A core reason for this library is so you can call .open on a CloudPath just like on a Path.

As for why I didn't want to use open directly, for one, it's just easier to work with local files sometimes. I'd rather download and then use the local copy. It's more 'tangible'. It's one of those physical books are better than e-books types of preferences.

I'm also working with large sets of partitioned Parquet/Arrow datasets. Arrow allows for many clever optimizations (such as only reading the parts of a file that are relevant, in the case of a filter) and I wasn't really sure about how a CloudPath object would work with that setup. And so I was really more interested in using cloudpathlib as a boto3 replacement (with a nicer interface) for simple downloads and uploads and then let Arrow do its usual magic once the files are stored locally, without having to download the data every time a new session starts.

(I've since confirmed that I can use a CloudPath wrapped in open to read the files correctly and the local copy in the cache will then allow me to use all the regular Arrow/Parquet magic, as expected, so perhaps I just needed to get a better understanding of the library.)

I also see that there has been some discussion around dealing with large cloud data sets in #264 and #96 and I'm just now starting to learn about the dizzying amount of options for, and minutia about, file systems!

Coming back to caching for download_from, I really don't see the difference between caching and syncing. I would define a sync as a comparison of two PathLike objects with a conditional transfer of data based on:

  1. existence (does the item exist on both sides?) and
  2. freshness (was the item more recently updated on one end?)

(Please feel free to correct my definition or conception of what a sync constitutes.)

Then, to my mind, caching, as implemented by this library, is just a specific form of sync, which is arbitrarily only triggered when you called a CloudPath using open. So, I don't really see them as separate concerns (which might totally not be a valid/uninformed opinion).

With all that said, why implement it natively (via an argument that lets you control this behavior)? Because, for one, it's more intuitive. For example, when I read the caching page in your docs, I immediately thought this is the aws s3 sync that boto3 is missing. So, I was actually quite surprised to see that when I downloaded files with a client specifying an explicit cache, the cache folder was just not being created. (I spent a bit of time actually wondering if I was doing something wrong, reading the references, etc.) I suppose, at the least, the Caching page in the docs and the reference for the CloudPath.download_to() method should be updated to clarify this, because to my mind at least, if a library says it supports caching but doesn't actually implement it on the most intuitive method (download) for fetching files, it's just confusing.

The same thing for why not explicitly check .exists method and .stat().mt_time in my own code. It's nicer to just work with something that has this built-in. For example, the same logic could apply to caching as it's implemented currently (I could just write some boilerplate code to do those checks before passing the path to open myself, but it's just nicer that cloudpathlib will do it for me.

@pjbull
Copy link
Member

pjbull commented Dec 19, 2022

syncing is a separate issue (and there is already an issue for it #134). For example, we never delete files on sync, and we don't check for and propagate metadata changes.

We'd definitely accept an update to the docs to note that download_to and upload_from do not cache the files (since presumably, you know where your local copies are if you are calling these methods).

If we did consider an implementation here, I think as you point out it would be a copy_to_cache kwarg on those methods that additionally writes the file to the cache.

I'm not particularly inclined to add that complexity, because I think keeping the caching as isolated as possible is important (see #10). Additionally, I think that nearly everything a user wants to do from an upload/download perspective can be done efficiently with the currently supported APIs, so pitfalls there may be documentation issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants