New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Cache #760

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
3 participants
@jrleeman
Member

jrleeman commented Mar 1, 2018

Closes #190 by providing a caching system for the example data and for future implementation of county outlines. A few notes:

  • The cache_version.txt mechanism allows us to change the cache version if we replace old data (say new county outlines) and have the old cache get blown away after the new install. We don't plan to change old test data, but this gives us a bit of future proofing.

  • The system looks to see if you're on a git install (i.e have ../staticdata) and uses that as the data location if possible. After that it uses the OS cache path. Either of these are superseded by the user setting the METPY_CACHE_LOCATION environment variable.

  • There is a mechanism for the user to manually blow away their cache.

  • Data are pulled from the tagged version on git unless this is a dev version and 'dirty', then data are pulled from master.

  • The get_test_data function still works the same

  • Tests won't pass without a git checkout because the cache has no way to get files (like nids) that are tested with a glob like fixture. Not a problem because if you're running the full tests, you are on a git clone. It solves the problem of examples and static data like county outlines.

  • Question: Should clear_cache be disabled if on a git clone? It would blow away the nids files with no way to get them other than git checkout.

  • Question: As noted in line 131 - should we be catching that 404 and returning a more useful message, or is that good enough?

  • Question: Testing is going to be... interesting. Suggestions welcome.

@jrleeman jrleeman added this to the 0.8 milestone Mar 1, 2018

@jrleeman jrleeman requested a review from dopplershift as a code owner Mar 1, 2018

# Check if there is a cache version file. If not, this is a new cache, so create it.
cache_version_file_path = os.path.join(self.cache_home, 'cache_version.txt')
if not os.path.exists(cache_version_file_path ):

This comment has been minimized.

@stickler-ci

stickler-ci Mar 1, 2018

E202 whitespace before ')'

This comment has been minimized.

@jrleeman
# Check if there is a cache version file. If not, this is a new cache, so create it.
cache_version_file_path = os.path.join(self.cache_home, 'cache_version.txt')
if not os.path.exists(cache_version_file_path ):
with open(cache_version_file_path , 'w') as f:

This comment has been minimized.

@stickler-ci

stickler-ci Mar 1, 2018

E203 whitespace before ','

This comment has been minimized.

@jrleeman
@dopplershift

I don't think this review is exhaustive, but addressing these comments should move it in the right direction. In general, this is looking pretty good, just some organizational aspects that I think should make testing easier.

return cache_path
def get_os_cache_path(self):

This comment has been minimized.

@dopplershift

dopplershift Mar 27, 2018

Member

Let's make this a @staticmethod

if not os.path.exists(cache_subpath):
os.makedirs(cache_subpath)
# TODO: Error handling if this isn't a valid remote file?

This comment has been minimized.

@dopplershift

dopplershift Mar 27, 2018

Member

Probably should catch the error and raise a better error message.

class CacheData(object):
def __init__(self):
self.cache_home = self.get_cache_location()
self.cache_version = 1

This comment has been minimized.

@dopplershift

dopplershift Mar 27, 2018

Member

I think it makes sense to put the version as a class attribute, so put cache_version = 1 above __init__.

This comment has been minimized.

@jrleeman
with open(cache_version_file_path, 'w') as f:
f.write('{}\n'.format(self.cache_version))
def get_cache_location(self):

This comment has been minimized.

@dopplershift

dopplershift Mar 27, 2018

Member

I'm thinking making this a property (as shown above) makes sense. Also, let's pull out the makedirs here so that it just does the one thing.

@@ -39,4 +37,103 @@ def get_test_data(fname, as_file_obj=True):
return path
__all__ = ('get_test_data', 'is_string_like', 'iterable')
class CacheData(object):
def __init__(self):

This comment has been minimized.

@dopplershift

dopplershift Mar 27, 2018

Member

How about this takes a path argument, defaulting to None. Then:

self.cache_home = self.default_location if path is None else path
if not os.path.exists(self.cache_home):
    os.makedirs(self.cache_home)

Taking it explicitly makes this a more re-usable class, and I think it will allow us to test better.

This comment has been minimized.

@jrleeman
github_data_path = 'https://github.com/Unidata/MetPy/raw/{}/testdata'.format(gh_tag)
cache_fname = os.path.join(self.cache_home, fname)
cache_subpath = os.path.split(cache_fname)[0]

This comment has been minimized.

@dopplershift

dopplershift Mar 27, 2018

Member

os.path.dirname()

os.makedirs(cache_subpath)
# TODO: Error handling if this isn't a valid remote file?
urllib.request.urlretrieve(github_data_path + '/' + fname, cache_fname)

This comment has been minimized.

@dopplershift

dopplershift Mar 27, 2018

Member

Does this work on Python 2? Also, I think we can mock out this function (or whatever) for testing.

return file_path
def download_data(self, fname):

This comment has been minimized.

@dopplershift

dopplershift Mar 27, 2018

Member

Maybe factor out the URL generation into a separate function? It would allow us to test the logic separately.

This comment has been minimized.

@jrleeman
def clear_cache(self):
"""Remove all cached data for MetPy."""
shutil.rmtree(self.cache_home)

This comment has been minimized.

@dopplershift

dopplershift Mar 27, 2018

Member

For testing, we can mock out shultil.rmtree and just check that it's called with an acceptable parameter.

# Check if there is a cache version file. If not, this is a new cache, so create it.
cache_version_file_path = os.path.join(self.cache_home, 'cache_version.txt')
if not os.path.exists(cache_version_file_path ):

This comment has been minimized.

@dopplershift

dopplershift Mar 27, 2018

Member

If the file doesn't exist, we should probably just invoke the code to re-initialize the cache (which should probably be its own method). You could accomplish that by initializing current_cache_version to 0, and only reading the current version if the file exists.

This comment has been minimized.

@jrleeman

@jrleeman jrleeman force-pushed the jrleeman:caching branch from d5ec9a1 to c355c0c Apr 10, 2018

Implements cache versioning, custom locations, and is MetPy version aware.
"""
self.cache_version = 1

This comment has been minimized.

@stickler-ci

stickler-ci Apr 10, 2018

F821 undefined name 'self'

self.initialize_cache()
def cache_outdated(self):

This comment has been minimized.

@stickler-ci

stickler-ci Apr 10, 2018

E303 too many blank lines (2)

"""
# If the cache file is gone or not equal to the cache version locally, it's outdated.
if not os.path.exists(cache_version_file_path):

This comment has been minimized.

@stickler-ci

stickler-ci Apr 10, 2018

F821 undefined name 'cache_version_file_path'

if not os.path.exists(cache_version_file_path):
return True
with open(cache_version_file_path, 'r') as f:

This comment has been minimized.

@stickler-ci

stickler-ci Apr 10, 2018

F821 undefined name 'cache_version_file_path'

def initialize_cache(self):
"""Create a new cache."""
self.cache_home = self.get_cache_location()
with open(cache_version_file_path, 'w') as f:

This comment has been minimized.

@stickler-ci

stickler-ci Apr 10, 2018

F821 undefined name 'cache_version_file_path'

urllib.request.urlretrieve(github_url, cache_fname)
def github_data_url(self, fname):

This comment has been minimized.

@stickler-ci

stickler-ci Apr 10, 2018

E303 too many blank lines (2)

@jrleeman jrleeman force-pushed the jrleeman:caching branch from c355c0c to 778d2d9 Apr 10, 2018

# SPDX-License-Identifier: BSD-3-Clause
r"""Tests the operation of MetPy's data caching code."""
from metpy.cbook import CacheData

This comment has been minimized.

@stickler-ci

stickler-ci Apr 10, 2018

F401 'metpy.cbook.CacheData' imported but unused

@dopplershift dopplershift self-assigned this Apr 11, 2018

@jrleeman jrleeman modified the milestones: 0.8, 0.9 May 4, 2018

@dopplershift dopplershift removed this from the 0.9 milestone Aug 17, 2018

@jrleeman

This comment has been minimized.

Member

jrleeman commented Aug 17, 2018

This will be superseded by #915 as much of this is now rolled into pooch and we'll go that direction.

@jrleeman jrleeman closed this Aug 17, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment