download_file with cache=True uses cache unconditionally #3961

embray · 2015-07-14T17:01:31Z

We should improve the caching mechanism for download_file so that it can still check the server (where possible) for updates to the remote file. This might also mean changing the options available for cache as to whether or not to use the cache unconditionally (or perhaps adding a separate option of cache_only where cache_only=True would force use of the cache where possible, and otherwise download the file).

However, when cache=True and cache_only=False (which I think should be the default), we should always check the remote server to see if the remote content has updated, by looking at a combination of the ETag and Last-Modified headers. However, it should be noted that not all servers implement these headers, and some implement them unreliably. This paper describes an evidence-based scheme for best determining if these headers (when present, or not present) indicate whether a file has changed. TL;DR, they found the best scheme (in terms of reliability of correctly indicating a change, and avoiding unnecessary downloads) is as follows. If any of these conditions are true, the file can be redownloaded:

ETag is same, but timestamp is changed
ETag is changed, but timestamp is same
Both ETag and timestamp changed
ETag is changed, but timestamp is missing
ETag is missing, but timestamp is same (this one is surprising, but turns out to more often than not indicate a poorly configured server)
ETag is missing, but timestamp is changed
Both ETag and timestamp are missing

In all other cases the file can be assumed unchanged. We may wish to mix this up a bit, for example to prefer the cached copy if we can't reliably determine if the remote content has changed.

Another technique, they noted, for improving reliability is to keep track of the reliability of specific servers. For example, if a server reports the content changed (via a different Last-Modified or ETag), but the downloaded content ends up being identical to the content we already had cached, we could mark that server (via its FQDN) as unreliable. On the other hand, we can't as easily catch cases where we don't download some content because the server (wrongly) indicated that the content changed. In that case we can go ahead and use the cached copy, but if the user is definitely expecting that that file should have changed, they can switch to using cache=False. When using cache=False we should still compare downloaded data to the existing data (just the hashes, that is), to determine server reliability. So we would always keep an up-to-date flag in the cache as to which servers have reliable ETags, if nothing else.

Both the server reliability flags, and storing each URL's ETag and Last-Modified headers will necessitate a change to the download cache database. I think the new format should include a version number (where caches with a missing version can be assumed version 0--the current version). The new format could be something like:

{
    'version': 1,
    'servers': {
        'fully.qualified.domain.name': {
            'reliable-etag': True/False
            # This is a dict, to allow for future server metadata
        },
    'files': {
        'http://full.url/of/file': {
            'download-dir': '/path/to/saved/file',
            'hash': 'hash-string',
            'hash-algorithm': (maybe?),
            'etag': 'etag of cached file',
            'last-modified': 'last modified timestamp of cached file'
        }
    }
}

In all cases, there should also be better log messages about how the cache is being used--when a file is being loaded from the cached, or downloaded, etc.

This might also be a good opportunity for a little code refactoring. For example, it might be nice if the caching mechanism were, itself, implemented as a context manager of some kind, though I haven't worked out the details.

The text was updated successfully, but these errors were encountered:

mdboom · 2015-07-14T18:39:48Z

There are only two hard things about computer science: cache invalidation, naming things, and off-by-one errors.

SMALL ASIDE: This seems like such a general problem, that I wonder if we shouldn't try to "leverage the wider community" if possible. There are a few HTTP caching packages for Python already (though I didn't find any that implement anything like these heuristics in the small amount of looking I did). If we find something sufficient, I'd say we try to use it. If not, and we're refactoring anyway, I think this is something that may be best developed as an independent package to try to get a little more traction and maintainership help outside of astropy.

Cadair · 2015-07-14T18:42:47Z

There would definitely be use for this outside of astropy's download functions in SunPy, mainly for more complex webservice (astroquery like) stuff we have.

embray · 2015-07-14T18:55:26Z

There are only two hard things about computer science: cache invalidation, naming things, and off-by-one errors.

😆

embray · 2015-07-14T18:58:18Z

I had a similar thought--that either something like this must be out there or if it's not (or too buried in other code or otherwise difficult to adapt) this seems like a useful side-project in its own right.

I think to make the changes I suggested above directly in Astropy wouldn't be too hard--an afternoon hack--and may be worth doing anyways if we can't find something better. But it may indeed be worth spinning off into its own module, or integrating into something existing like requests (which I believe astroquery already uses...)

embray · 2015-07-14T19:13:29Z

I meant to add, this is related to, but different from #1162. That issues suggests adding an expiration time to cached files, which could also be included in this scheme. In fact it turns out that's what requests-cache already does (it does not, however, use any of the other heuristics I've outlined).

embray · 2015-07-14T19:17:30Z

On the other hand CacheControl seems reasonably powerful. It allows pluggable storage backends (including a persistent store similar to what we use) and can use a combination of time-based and ETags to determine staleness. Not sure exactly what their heuristic is, but there is some mention in the docs that this is configurable. Might be worth playing around with...

pllim · 2019-10-28T20:20:25Z

@aarchiba , this will be fixed by #9182 with the new cache='update' option, right?

aarchiba · 2022-07-06T17:23:44Z

@aarchiba , this will be fixed by #9182 with the new cache='update' option, right?

Sort of. cache="update" will re-download the file, and isn't clever about checking for changes. But for things where we want it to stay in the cache but pick up new versions, it does achieve that.

embray added utils Affects-release Package-novice Effort-medium Enhancement labels Jul 14, 2015

embray mentioned this issue Aug 26, 2015

Adding astropy.utils.data.is_url_in_cache method and test #4095

Closed

embray mentioned this issue Dec 21, 2015

Automatic download of IERS A table for times not covered by IERS B #3275

Closed

embray mentioned this issue Jan 6, 2016

Implement auto-downloading of IERS-A data #4436

Merged

pllim mentioned this issue Sep 30, 2016

limit cache size? #5368

Open

pllim added Package-expert and removed Package-novice labels Dec 21, 2017

mhvk mentioned this issue Sep 13, 2019

Possible improvements to how IERS is handled #9227

Closed

7 tasks

aarchiba mentioned this issue Sep 17, 2019

Cache preload #9182

Merged

pllim removed the Affects-release label Mar 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

download_file with cache=True uses cache unconditionally #3961

download_file with cache=True uses cache unconditionally #3961

embray commented Jul 14, 2015

mdboom commented Jul 14, 2015

Cadair commented Jul 14, 2015

embray commented Jul 14, 2015

embray commented Jul 14, 2015

embray commented Jul 14, 2015

embray commented Jul 14, 2015

pllim commented Oct 28, 2019

aarchiba commented Jul 6, 2022

download_file with cache=True uses cache unconditionally #3961

download_file with cache=True uses cache unconditionally #3961

Comments

embray commented Jul 14, 2015

mdboom commented Jul 14, 2015

Cadair commented Jul 14, 2015

embray commented Jul 14, 2015

embray commented Jul 14, 2015

embray commented Jul 14, 2015

embray commented Jul 14, 2015

pllim commented Oct 28, 2019

aarchiba commented Jul 6, 2022