# Caching

This notebook illustrates the use of the climetlab cache and highlight some cache configuration settings.

The relevant Climetlab documentation is located at https://climetlab.readthedocs.io/en/latest/guide/caching.html

Relevant CliMetLab settings are:
- cache-directory 
- maximum-cache-disk-usage 
- maximum-cache-size

In [1]:
import climetlab as cml
URL1 = "https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.SP.list.v04r00.csv"
URL2 = "https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.NI.list.v04r00.csv"

Using ``cml.load_source("url",...)`` stores the data in the climetlab cache.  

In [12]:
data = cml.load_source("url", URL1)
data.to_pandas()

  return pandas.read_csv(self.path, **pandas_read_csv_kwargs)


Unnamed: 0,SID,SEASON,NUMBER,BASIN,SUBBASIN,NAME,ISO_TIME,NATURE,LAT,LON,...,BOM_GUST_PER,REUNION_GUST,REUNION_GUST_PER,USA_SEAHGT,USA_SEARAD_NE,USA_SEARAD_SE,USA_SEARAD_SW,USA_SEARAD_NW,STORM_SPEED,STORM_DIR
0,,Year,,,,,,,degrees_north,degrees_east,...,second,kts,second,ft,nmile,nmile,nmile,nmile,kts,degrees
1,1897005S10135,1897,1,SP,EA,NOT_NAMED,1897-01-04 12:00:00,NR,-10.1000,135.300,...,,,,,,,,,9,246
2,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 15:00:00,NR,-10.2755,134.902,...,,,,,,,,,8,246
3,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 18:00:00,NR,-10.4406,134.523,...,,,,,,,,,8,246
4,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 21:00:00,NR,-10.5853,134.182,...,,,,,,,,,7,247
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75777,2022139S15169,2022,18,SP,MM,GINA,2022-05-21 18:00:00,NR,-20.6,171.1,...,,,,,,,,,4,55
75778,2022139S15169,2022,18,SP,MM,GINA,2022-05-21 21:00:00,NR,-20.5147,171.309,...,,,,,,,,,4,75
75779,2022139S15169,2022,18,SP,MM,GINA,2022-05-22 00:00:00,NR,-20.5,171.5,...,,,,,,,,,3,107
75780,2022139S15169,2022,18,SP,MM,GINA,2022-05-22 03:00:00,NR,-20.601,171.616,...,,,,,,,,,3,148


Next call to the same code does not redownload the data.

In [17]:
data = cml.load_source("url", URL1)
data.to_pandas()

  return pandas.read_csv(self.path, **pandas_read_csv_kwargs)


Unnamed: 0,SID,SEASON,NUMBER,BASIN,SUBBASIN,NAME,ISO_TIME,NATURE,LAT,LON,...,BOM_GUST_PER,REUNION_GUST,REUNION_GUST_PER,USA_SEAHGT,USA_SEARAD_NE,USA_SEARAD_SE,USA_SEARAD_SW,USA_SEARAD_NW,STORM_SPEED,STORM_DIR
0,,Year,,,,,,,degrees_north,degrees_east,...,second,kts,second,ft,nmile,nmile,nmile,nmile,kts,degrees
1,1897005S10135,1897,1,SP,EA,NOT_NAMED,1897-01-04 12:00:00,NR,-10.1000,135.300,...,,,,,,,,,9,246
2,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 15:00:00,NR,-10.2755,134.902,...,,,,,,,,,8,246
3,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 18:00:00,NR,-10.4406,134.523,...,,,,,,,,,8,246
4,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 21:00:00,NR,-10.5853,134.182,...,,,,,,,,,7,247
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75777,2022139S15169,2022,18,SP,MM,GINA,2022-05-21 18:00:00,NR,-20.6,171.1,...,,,,,,,,,4,55
75778,2022139S15169,2022,18,SP,MM,GINA,2022-05-21 21:00:00,NR,-20.5147,171.309,...,,,,,,,,,4,75
75779,2022139S15169,2022,18,SP,MM,GINA,2022-05-22 00:00:00,NR,-20.5,171.5,...,,,,,,,,,3,107
75780,2022139S15169,2022,18,SP,MM,GINA,2022-05-22 03:00:00,NR,-20.601,171.616,...,,,,,,,,,3,148


The downloaded data is actually store in a cache directory, managed by CliMetLab, using a small database. Data is also unzipped if needed within the cache directory.

The cache can be observed and manipulated:
- Within python using ``cml.cache``
- With command line interface ``climetlab cache`` and ``climetlab decache``
- Using the web interface GUI (in progress: summer of code project https://github.com/ecmwf-lab/climetlab-script-web)
- NOT by playing directly with the cache files (same logic as a web browser cache).

In [4]:
cml.cache

In [5]:
!climetlab cache

Cache directory:            [34m/tmp/climetlab-mafp[0m
Cache size:                 [34m33.1 MiB[0m
Number of entries in cache: [34m1[0m
Most recently accessed:     [34m33 seconds ago[0m
Least recently accessed:    [34m33 seconds ago[0m
Youngest entry:             [34m33 seconds ago[0m
Oldest entry:               [34m8 minutes ago[0m


In [6]:
!climetlab cache --all

[34m/tmp/climetlab-mafp/url-e29d836ce6a5ea6e24cbb4398dd11140a280205e0248a9ad468bde15e1727667.SP.list.v04r00.csv[0m
  creation_date: [32m2022-11-28 14:09:06.098063[0m
  last_access: [32m2022-11-28 14:16:56.333723[0m
  accesses: [32m3[0m
  type: [32mfile[0m
  size: [32m34750568[0m
  owner: [32murl[0m
  args: [32m{'url': 'https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.SP.list.v04r00.csv', 'parts': None}[0m
  expires: [32mNone[0m
  extra: [32mNone[0m
  flags: [32m0[0m
  owner_data: [32m{'date': 'Mon, 28 Nov 2022 14:09:07 GMT', 'server': 'Apache', 'strict-transport-security': 'max-age=31536000', 'last-modified': 'Sun, 27 Nov 2022 08:55:35 GMT', 'etag': '"2124068-5ee6feb0a7d0a"', 'accept-ranges': 'bytes', 'content-length': '34750568', 'content-type': 'text/csv', 'access-control-allow-origin': '*', 'access-control-allow-headers': 'X-Requested-With, Content-Type', 'connection': 'close'}[0m
  pare

In [7]:
!climetlab cache --newer 1d

[32mEntries newer than '2022-11-27 14:17:48'.[0m
Cache directory:            [34m/tmp/climetlab-mafp[0m
Cache size:                 [34m33.1 MiB[0m
Number of entries in cache: [34m1[0m
Most recently accessed:     [34m51 seconds ago[0m
Least recently accessed:    [34m51 seconds ago[0m
Youngest entry:             [34m51 seconds ago[0m
Oldest entry:               [34m9 minutes ago[0m


In [8]:
!climetlab cache --help

usage: cache [-h] [--json] [--all] [--path] [--sort KEY] [--reverse]
             [--match STRING] [--newer DATE] [--older DATE] [--accessed]
             [--larger SIZE] [--smaller SIZE]

Cache command to inspect the CliMetLab cache. The selection arguments are the
same as for the ``climetlab decache`` deletion command. Examples: climetlab
cache --all

optional arguments:
  -h, --help      show this help message and exit
  --json          produce a JSON output
  --all
  --path          print the path of cache directory and exit
  --sort KEY      sort output according to increasing values of KEY.
  --reverse       reverse the order of the sort, from larger to smaller
  --match STRING  TODO
  --newer DATE    TODO
  --older DATE    TODO
  --accessed      use the date of last access instead of the creation date
  --larger SIZE   consider only cache entries that are larger than SIZE bytes
  --smaller SIZE  consider only cache entries that are smaller than SIZE bytes

SIZE can be expressed 

In [9]:
# Delete cached data newer than 1d
# !climetlab decache --newer 1d

# Configuring CliMetLab cache settings

In [10]:
!climetlab settings cache-directory 
!climetlab settings maximum-cache-disk-usage 
!climetlab settings maximum-cache-size  

/tmp/climetlab-mafp
90
None


# Concurrent cache use

If the cache is full, the older data is automatically deleted (with a log message). 
When multiple scripts are using the same cache this may lead to a file being deleted (because the cache is full), even if it is currently in use by another script.
 




In [11]:
import climetlab as cml
cml.settings.set("maximum-cache-size", "50M")

# Take home message

. End-Users do not need to manage the data. Data is downloaded on demand, with minimal duplication.

. The climetlab cache is a **cache**: it is managed by climetlab and automatically cleaned up.

. Multiple users should not share the same cache directory.

Let us reset the default climetlab cache configuration, just in case.

In [10]:
cml.settings.reset()