From d049b75b157473994471b6c7547d096794a2b7b6 Mon Sep 17 00:00:00 2001 From: Jeffrey Milloy Date: Tue, 18 May 2021 10:25:32 -0400 Subject: [PATCH] DOC: flesh out cache documentation. * Demonstrate output caching at the top. This is the primary usage of the cache. * Adds Disk Cache section. * Adds S3 Cache section. * Adds Advanced Usage section with caching other objects, cached_property, and cache expiration. --- doc/source/cache.md | 192 +++++++++++++++++++++++++++++++++----------- 1 file changed, 146 insertions(+), 46 deletions(-) diff --git a/doc/source/cache.md b/doc/source/cache.md index a9977ef70..c781eb4cd 100644 --- a/doc/source/cache.md +++ b/doc/source/cache.md @@ -2,71 +2,171 @@ This document describes the caching methodology used in PODPAC, and how to control it. PODPAC uses a central cache shared by all nodes. Retrieval from the cache is based on the node's definition (`node.json`), the coordinates, and a key. -Each node has a **Cache Control** (`cache_ctrl`) defined by default, and the **Cache Control** may contain multiple **Cache Stores** (.e.g 'disk', 'ram'). A **Cache Store** may also have a specific **Cache Container**. +Each node has a **Cache Control** (`cache_ctrl`) defined by default, and the **Cache Control** may contain multiple **Cache Stores** (.e.g 'disk', 'ram'). -## Default Cache +## Caching Outputs -By default, every node caches their outputs to memory (RAM). These settings can be controlled using `podpac.settings`. +By default, PODPAC caches evaluated node outputs to memory (RAM). When a node is evaluated with the same coordinates, the output is retrieved from the cache. -**Settings and their Defaults:** +The following example demonstrates that the output was retrieved from the cache on teh second evaluation: -* DEFAULT_CACHE : list - * Defines a default list of cache stores in priority order. Defaults to `['ram']`. Can include ['ram', 'disk', 's3']. - * This can be over-written on an individual node by specifying `cache_ctrl` when creating the node. E.g. `node = podpac.Node(cache_ctrl=['disk'])` - * Authors of nodes may require certain caches always be available. For example, the `podpac.datalib.smap.SMAPDateFolder` node always requires a 'disk' cache, and will add it. -* DISK_CACHE_DIR : str - * Subdirectory to use for the disk cache. Defaults to ``'cache'`` in the podpac root directory. -* S3_CACHE_DIR : str - * Subdirectory to use for S3 cache (within the specified S3 bucket). Defaults to ``'cache'``. -* CACHE_OUTPUT_DEFAULT : bool - * Automatically cache node outputs to the default cache store(s). Outputs for nodes with `cache_output=False` will not be cached. Defaults to ``True``. -* RAM_CACHE_ENABLED: bool - * Enable caching to RAM. Note that if disabled, some nodes may fail. Defaults to ``True``. -* DISK_CACHE_ENABLED: bool - * Enable caching to disk. Note that if disabled, some nodes may fail. Defaults to ``True``. -* S3_CACHE_ENABLED: bool - * Enable caching to S3. Note that if disabled, some nodes may fail. Defaults to ``True``. +```python +[.] import podpac +[.] import podpac.datalib +[.] coords = podpac.Coordinates([podpac.clinspace(40, 39, 16), + podpac.clinspace(-100, -90, 16), + '2015-01-01T00', ['lat', 'lon', 'time']]) +[.] smap = podpac.datalib.smap.SMAP() +[.] o = smap1.eval(coords) +[.] smap._from_cache +False +[.] o = smap1.eval(coords) +[.] smap._from_cache +True +``` -## Clearing Cache -To globally clear cache use: +Importantly, different instances of the same node share a cache. The following example demonstrates that a different instance of a node will retrieve output from the cache as well: ```python -podpac.utils.clear_cache(mode) +[.] smap2 = podpac.datalib.smap.SMAP() +[.] o = smap2.eval(coords) +[.] smap2._from_cache +True ``` -where `mode` can be 'ram', 'disk', or 's3'. This will clean the entire cache store. -To clear cache for an individual node: +### Configure Output Caching -## Examples +Automatic caching of outputs can be controlled globally and in individual nodes. For example, to globally disable caching outputs: -To globally disable automatic caching of outputs use: ```python -import podpac podpac.settings["CACHE_OUTPUT_DEFAULT"] = False -podpac.settings.save() ``` -To overwrite this behavior for a particular node (i.e. making sure outputs are cached) use: +To disable output caching for a particular node: + ```python -smap = podpac.datalib.smap.SMAP(cache_output=True) +smap = podpac.datalib.smap.SMAP(cache_output=False) ``` -Different instances of the same node share a cache. For example: +## Disk Cache + +In addition to caching to memory (RAM), PODPAC provides a disk cache that persists across processes. For example, when the disk cache is used, a script that evaluates a node can be run multiple times and will retrieve node outputs from the disk cache on subsequent runs. + +Each node has a `cache_ctrl` that specifies which cache stores to use, in priority order. For example, to use the RAM cache and the disk cache: + ```python -[.] import podpac -[.] import podpac.datalib -[.] coords = podpac.Coordinates([podpac.clinspace(40, 39, 16), - podpac.clinspace(-100, -90, 16), - '2015-01-01T00', ['lat', 'lon', 'time']]) -[.] smap1 = podpac.datalib.smap.SMAP() -[.] o = smap1.eval(coords) -[.] smap1._from_cache -False -[.] del smap1 -[.] smap2 = podpac.datalib.smap.SMAP() -[.] o = smap2.eval(coords) -[.] smap2._from_cache -True +smap = podpac.datalib.smap.SMAP(cache_ctrl=['ram', 'disk']) +``` + +The default cache control can be set globally in the settings: + +```python +podpac.settings["DEFAULT_CACHE"] = ['ram', 'disk'] +``` + +### Configure Disk Caching + +The disk cache directory can be set using the `DISK_CACHE_DIR` setting. + +## S3 Cache + +PODPAC also provides caching to the cloud using AWS S3. Configure the S3 bucket and cache subdirectory using the `S3_BUCKET_NAME` and `S3_CACHE_DIR` settings. + +## Clearing the Cache + +To clear the entire cache use: + +```python +podpac.utils.clear_cache() +``` + +To clear the cache for a particular node: + +```python +smap.clear_cache() +``` + +You can also clear a particular cache store, for example clear the disk cache leaving the RAM cache in place: + +```python +# node +smap.clear_cache('disk') + +# entire cache +podpac.utils.clear_cache('disk') +``` + +## Cache Limits + +PODPAC provides a limit for each cache store in the podpac settings. + +``` +RAM_CACHE_MAX_BYTES +DISK_CACHE_MAX_BYTES +S3_CACHE_MAX_BYTES +``` + +When a cache store is full, new entries are ignored cached. + + +## Advanced Usage + +### Caching Other Objects + +Nodes can cache other data and objects using a cache key and, optionally, coordinates. The following example caches and retrieves data using the key `my_data`. + +```python +[.] smap.put_cache(10, 'my_data') +[.] smap.get_cache('my_data') +10 +``` + +In general, the node cache can be managed using the `Node.put_cache`, `Node.get_cache`, `Node.has_cache`, and `Node.rem_cache` methods. + + +### Cache Expiration + +Cached entries can optionally have an expiration date, after which the entry is considered invalid and automatically removed. + +To specify an expiration date + +```python +# specific datetime +node.put_cache(10, 'my_data', expires='2021-01-01T12:00:00') + +# timedelta, in 12 hours +node.put_cache(10, 'my_data', expires='12,h') +``` + +### Cached Node Properties + +PODPAC provides a `cached_property` decorator that enhances the builtin `property` decorator. + +By default, the `cached_property` stores the value as a private attribute in the object. To use the PODPAC cache so that the property persists across objects or processes according to the node node `cache_ctrl`: + +```python +class MyNode(podpac.Node): + @podpac.cached_property(use_cache_ctrl=True) + def my_cached_property(self): + return 10 ``` + +### Updating Existing Entries + +By default, a existing cache entries will be overwritten with new data. + +```python +[.] smap.put_cache(10, 'my_data') +[.] smap.put_cache(20, 'my_data') +[.] smap.get_cache('my_data') +20 +``` + +To prevent overwriting existing cache entries, use `overwrite=False`: + +```python +[.] smap.put_cache(100, 'my_data', overwrite=False) +podpac.core.node.NodeException: Cached data already exists for key 'my_data' and coordinates None +``` \ No newline at end of file