# Intermediate Result Caching
Demonstrates how to use utilities for caching intermediate data processing results.

    Copyright (C) 2021 Geoffrey Guy Messier

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.

In [1]:
%load_ext autoreload
%autoreload 1

In [8]:
import numpy as np
import pandas as pd
import datetime, copy, imp
import time
import os
import re
import matplotlib.pyplot as plt


from tqdm.auto import tqdm, trange
from tqdm.notebook import tqdm
tqdm.pandas()

import sys
sys.path.insert(0, '../util/')

from data_cache import CacheResult

## Load Data Set

In [3]:
dataFileStr = '../data/MLBHospitalData.hd5'
dat = pd.read_hdf(dataFileStr,key='Data')

## Perform Analysis on the Data
This routine is meant to represent a stage of your data analysis that generates an intermediate result.  Even though it runs very quickly on this data set, imagine that it's a very slow routine that you don't want to run every time.  This makes it a perfect candidate for caching.

In [5]:
# Count the data features for each individual.
def timeline_summary(tbl,startDate='NoDate',endDate='NoDate'):
    if startDate != 'NoDate' and endDate != 'NoDate':
        tbl = tbl.loc[ (tbl.Date >= startDate) & (tbl.Date <= endDate) ]
        
    return pd.Series({
        'NumGoodTestResult': (tbl.Event == 'GoodTestResult').sum(),
        'NumStay': (tbl.Event == 'Stay').sum(),
        'NumBadTestResult': (tbl.Event == 'BadTestResult').sum(),
        'NumVitalsCrash': (tbl.Event == 'VitalsCrash').sum(),
        'Tenure': (tbl.Date.max()-tbl.Date.min()).days
    })

In [6]:
ftr = dat.groupby(level=0).progress_apply(timeline_summary)

  0%|          | 0/915 [00:00<?, ?it/s]

## This time with caching...
Thanks to Caleb John for providing this caching code.  It makes use of Python decorators.  Google it if you're unfamiliar with how they work.

In [9]:
help(CacheResult)

Help on function CacheResult in module data_cache:

CacheResult(func, *args, path=None, filename=None, **kwargs)
    Wraps around a function that generates a datastructure and caches that
    datastructure to disk.  For subsequent calls, the datastructure is read
    from the cache file rather than being regenerated. Delete the cached file 
    to regenerate the data structure.
    
    NOTE: You will need to delete cache files every time you make a code change
    to the function that generates the datastructure.
    
    Separate cache files are generated for calls to the generator function with 
    different arguments.  The argument values are worked into the cache file name.
    
    It is good practice to incorporate a TQDM progress bar in your generator function.
    That way, you get visual feedback regarding whether or not you're generating new
    results or using cached results.
    
    Give an HDF file suffix (.h5, .hdf, .hd5) to save the cache as HDF, otherwise
    pickle

In [10]:
@CacheResult
def cached_preprocessing_example(tbl):
    return tbl.groupby(level=0).progress_apply(timeline_summary)

In [11]:
cachePathStr = '/Users/gmessier/data/plwh/cache/'
ftr = cached_preprocessing_example(dat,path=cachePathStr)

  0%|          | 0/915 [00:00<?, ?it/s]

Note how you don't see the progress bar when you run it the second time since it reads the result from the cache file instead.

In [12]:
ftr = cached_preprocessing_example(dat,path=cachePathStr)