-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
download_file with cache=True uses cache unconditionally #3961
Comments
There are only two hard things about computer science: cache invalidation, naming things, and off-by-one errors. SMALL ASIDE: This seems like such a general problem, that I wonder if we shouldn't try to "leverage the wider community" if possible. There are a few HTTP caching packages for Python already (though I didn't find any that implement anything like these heuristics in the small amount of looking I did). If we find something sufficient, I'd say we try to use it. If not, and we're refactoring anyway, I think this is something that may be best developed as an independent package to try to get a little more traction and maintainership help outside of astropy. |
There would definitely be use for this outside of astropy's download functions in SunPy, mainly for more complex webservice (astroquery like) stuff we have. |
😆 |
I had a similar thought--that either something like this must be out there or if it's not (or too buried in other code or otherwise difficult to adapt) this seems like a useful side-project in its own right. I think to make the changes I suggested above directly in Astropy wouldn't be too hard--an afternoon hack--and may be worth doing anyways if we can't find something better. But it may indeed be worth spinning off into its own module, or integrating into something existing like |
I meant to add, this is related to, but different from #1162. That issues suggests adding an expiration time to cached files, which could also be included in this scheme. In fact it turns out that's what |
On the other hand CacheControl seems reasonably powerful. It allows pluggable storage backends (including a persistent store similar to what we use) and can use a combination of time-based and ETags to determine staleness. Not sure exactly what their heuristic is, but there is some mention in the docs that this is configurable. Might be worth playing around with... |
We should improve the caching mechanism for
download_file
so that it can still check the server (where possible) for updates to the remote file. This might also mean changing the options available forcache
as to whether or not to use the cache unconditionally (or perhaps adding a separate option ofcache_only
wherecache_only=True
would force use of the cache where possible, and otherwise download the file).However, when
cache=True
andcache_only=False
(which I think should be the default), we should always check the remote server to see if the remote content has updated, by looking at a combination of theETag
andLast-Modified
headers. However, it should be noted that not all servers implement these headers, and some implement them unreliably. This paper describes an evidence-based scheme for best determining if these headers (when present, or not present) indicate whether a file has changed. TL;DR, they found the best scheme (in terms of reliability of correctly indicating a change, and avoiding unnecessary downloads) is as follows. If any of these conditions are true, the file can be redownloaded:In all other cases the file can be assumed unchanged. We may wish to mix this up a bit, for example to prefer the cached copy if we can't reliably determine if the remote content has changed.
Another technique, they noted, for improving reliability is to keep track of the reliability of specific servers. For example, if a server reports the content changed (via a different Last-Modified or ETag), but the downloaded content ends up being identical to the content we already had cached, we could mark that server (via its FQDN) as unreliable. On the other hand, we can't as easily catch cases where we don't download some content because the server (wrongly) indicated that the content changed. In that case we can go ahead and use the cached copy, but if the user is definitely expecting that that file should have changed, they can switch to using
cache=False
. When usingcache=False
we should still compare downloaded data to the existing data (just the hashes, that is), to determine server reliability. So we would always keep an up-to-date flag in the cache as to which servers have reliable ETags, if nothing else.Both the server reliability flags, and storing each URL's ETag and Last-Modified headers will necessitate a change to the download cache database. I think the new format should include a version number (where caches with a missing version can be assumed version 0--the current version). The new format could be something like:
In all cases, there should also be better log messages about how the cache is being used--when a file is being loaded from the cached, or downloaded, etc.
This might also be a good opportunity for a little code refactoring. For example, it might be nice if the caching mechanism were, itself, implemented as a context manager of some kind, though I haven't worked out the details.
The text was updated successfully, but these errors were encountered: