Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download_file with cache=True uses cache unconditionally #3961

Open
embray opened this issue Jul 14, 2015 · 8 comments
Open

download_file with cache=True uses cache unconditionally #3961

embray opened this issue Jul 14, 2015 · 8 comments

Comments

@embray
Copy link
Member

embray commented Jul 14, 2015

We should improve the caching mechanism for download_file so that it can still check the server (where possible) for updates to the remote file. This might also mean changing the options available for cache as to whether or not to use the cache unconditionally (or perhaps adding a separate option of cache_only where cache_only=True would force use of the cache where possible, and otherwise download the file).

However, when cache=True and cache_only=False (which I think should be the default), we should always check the remote server to see if the remote content has updated, by looking at a combination of the ETag and Last-Modified headers. However, it should be noted that not all servers implement these headers, and some implement them unreliably. This paper describes an evidence-based scheme for best determining if these headers (when present, or not present) indicate whether a file has changed. TL;DR, they found the best scheme (in terms of reliability of correctly indicating a change, and avoiding unnecessary downloads) is as follows. If any of these conditions are true, the file can be redownloaded:

  1. ETag is same, but timestamp is changed
  2. ETag is changed, but timestamp is same
  3. Both ETag and timestamp changed
  4. ETag is changed, but timestamp is missing
  5. ETag is missing, but timestamp is same (this one is surprising, but turns out to more often than not indicate a poorly configured server)
  6. ETag is missing, but timestamp is changed
  7. Both ETag and timestamp are missing

In all other cases the file can be assumed unchanged. We may wish to mix this up a bit, for example to prefer the cached copy if we can't reliably determine if the remote content has changed.

Another technique, they noted, for improving reliability is to keep track of the reliability of specific servers. For example, if a server reports the content changed (via a different Last-Modified or ETag), but the downloaded content ends up being identical to the content we already had cached, we could mark that server (via its FQDN) as unreliable. On the other hand, we can't as easily catch cases where we don't download some content because the server (wrongly) indicated that the content changed. In that case we can go ahead and use the cached copy, but if the user is definitely expecting that that file should have changed, they can switch to using cache=False. When using cache=False we should still compare downloaded data to the existing data (just the hashes, that is), to determine server reliability. So we would always keep an up-to-date flag in the cache as to which servers have reliable ETags, if nothing else.

Both the server reliability flags, and storing each URL's ETag and Last-Modified headers will necessitate a change to the download cache database. I think the new format should include a version number (where caches with a missing version can be assumed version 0--the current version). The new format could be something like:

{
    'version': 1,
    'servers': {
        'fully.qualified.domain.name': {
            'reliable-etag': True/False
            # This is a dict, to allow for future server metadata
        },
    'files': {
        'http://full.url/of/file': {
            'download-dir': '/path/to/saved/file',
            'hash': 'hash-string',
            'hash-algorithm': (maybe?),
            'etag': 'etag of cached file',
            'last-modified': 'last modified timestamp of cached file'
        }
    }
}

In all cases, there should also be better log messages about how the cache is being used--when a file is being loaded from the cached, or downloaded, etc.

This might also be a good opportunity for a little code refactoring. For example, it might be nice if the caching mechanism were, itself, implemented as a context manager of some kind, though I haven't worked out the details.

@mdboom
Copy link
Contributor

mdboom commented Jul 14, 2015

There are only two hard things about computer science: cache invalidation, naming things, and off-by-one errors.

SMALL ASIDE: This seems like such a general problem, that I wonder if we shouldn't try to "leverage the wider community" if possible. There are a few HTTP caching packages for Python already (though I didn't find any that implement anything like these heuristics in the small amount of looking I did). If we find something sufficient, I'd say we try to use it. If not, and we're refactoring anyway, I think this is something that may be best developed as an independent package to try to get a little more traction and maintainership help outside of astropy.

@Cadair
Copy link
Member

Cadair commented Jul 14, 2015

There would definitely be use for this outside of astropy's download functions in SunPy, mainly for more complex webservice (astroquery like) stuff we have.

@embray
Copy link
Member Author

embray commented Jul 14, 2015

There are only two hard things about computer science: cache invalidation, naming things, and off-by-one errors.

😆

@embray
Copy link
Member Author

embray commented Jul 14, 2015

I had a similar thought--that either something like this must be out there or if it's not (or too buried in other code or otherwise difficult to adapt) this seems like a useful side-project in its own right.

I think to make the changes I suggested above directly in Astropy wouldn't be too hard--an afternoon hack--and may be worth doing anyways if we can't find something better. But it may indeed be worth spinning off into its own module, or integrating into something existing like requests (which I believe astroquery already uses...)

@embray
Copy link
Member Author

embray commented Jul 14, 2015

I meant to add, this is related to, but different from #1162. That issues suggests adding an expiration time to cached files, which could also be included in this scheme. In fact it turns out that's what requests-cache already does (it does not, however, use any of the other heuristics I've outlined).

@embray
Copy link
Member Author

embray commented Jul 14, 2015

On the other hand CacheControl seems reasonably powerful. It allows pluggable storage backends (including a persistent store similar to what we use) and can use a combination of time-based and ETags to determine staleness. Not sure exactly what their heuristic is, but there is some mention in the docs that this is configurable. Might be worth playing around with...

@pllim
Copy link
Member

pllim commented Oct 28, 2019

@aarchiba , this will be fixed by #9182 with the new cache='update' option, right?

@aarchiba
Copy link
Contributor

aarchiba commented Jul 6, 2022

@aarchiba , this will be fixed by #9182 with the new cache='update' option, right?

Sort of. cache="update" will re-download the file, and isn't clever about checking for changes. But for things where we want it to stay in the cache but pick up new versions, it does achieve that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants