# Caching
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

A Dataflow can be cached into a file on disk during a local run by calling `df_cached = df.cache(directory_path)` on the Dataflow object `df`. Doing so, we will run all steps in `df` and save the cached data to the specified `directory_path`. The returned Dataflow `df_cached` has a Caching Step added at the end. Any run on Dataflow `df_cached` will reuse the cached data. And steps in `df_cached` before Caching Step will not be run again.

Caching avoids running transforms multiple times, which can make local runs more efficient. Here are common places to use Caching:
- after reading data from remote
- after expensive transforms, such as Sort
- after transforms that change the shape of data, such as Sampling, Filter and Summarize

Caching Step will be ignored during scale-out run invoked by `to_spark_dataframe()`.

In [1]:
# read data and apply transforms
import azureml.dataprep as dprep
df = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv')
df = df.take_sample(probability=0.2, seed=7)
df = df.skip(1)
df = df.sort_asc(columns='schnam10')
df = df.keep_columns(['stnam', 'fipst', 'leaid', 'leanm10', 'ncessch', 'schnam10'])
df.head(5)

Unnamed: 0,stnam,fipst,leaid,leanm10,ncessch,schnam10
0,ALABAMA,1,102100,Limestone County,10210000797,Ardmore High Sch
1,ALABAMA,1,101920,Jefferson County,10192000691,Brighton Middle Sch
2,ALABAMA,1,101920,Jefferson County,10192000720,Bryan Elem Sch
3,ALABAMA,1,102010,Lauderdale County,10201000766,Cloverdale Jr High Sch
4,ALABAMA,1,101920,Jefferson County,10192000701,Erwin Elem Sch


In [2]:
# choose a directory to store cache data
import os
from pathlib import Path
cache_dir = str(Path(os.getcwd(), 'dataflow-cache'))
cache_dir

'C:\\DPrep\\target\\Python\\debug\\azureml-dataprep\\docs\\dataflow-cache'

In [3]:
# cache the dataflow
df_cached = df.cache(directory_path=cache_dir)

In [4]:
# check steps in df_cached
[s.step_type for s in df_cached.get_steps()]

['Microsoft.DPrep.GetFilesBlock',
 'Microsoft.DPrep.ParseDelimitedBlock',
 'Microsoft.DPrep.DropColumnsBlock',
 'Microsoft.DPrep.TakeSampleBlock',
 'Microsoft.DPrep.SkipBlock',
 'Microsoft.DPrep.SortBlock',
 'Microsoft.DPrep.KeepColumnsBlock',
 'Microsoft.DPrep.CacheBlock']

In [5]:
# check the stored cache data
os.listdir(cache_dir)

['13652398-6e74-4e6a-91b5-dfee82fd7445',
 '3886425a-9ef5-4b0a-8745-f1103551f9b6.cacheIndex',
 '58479c05-f0ab-435b-b434-a7a89d9302d1',
 '7ecc948c-d166-4f15-ba82-7d0f0e197aa5',
 '855bd9c2-33c8-4613-925d-40269834be33.cacheIndex',
 '9e7bc988-501a-48c1-9978-10078e6bc53c.cacheIndex']

In [6]:
# run against df_cached will reuse the cache data and skip running all the previous steps again
df_cached.head(5)

Unnamed: 0,stnam,fipst,leaid,leanm10,ncessch,schnam10
0,ALABAMA,1,102100,Limestone County,10210000797,Ardmore High Sch
1,ALABAMA,1,101920,Jefferson County,10192000691,Brighton Middle Sch
2,ALABAMA,1,101920,Jefferson County,10192000720,Bryan Elem Sch
3,ALABAMA,1,102010,Lauderdale County,10201000766,Cloverdale Jr High Sch
4,ALABAMA,1,101920,Jefferson County,10192000701,Erwin Elem Sch


In [7]:
df1 = df_cached.take(10)
df2 = df_cached.skip(10).take(10)

# run against df1 and df2 will reuse the cache data as well
dataframe1 = df1.to_pandas_dataframe()
dataframe2 = df2.to_pandas_dataframe()

In [8]:
# clean up cache data
import shutil
shutil.rmtree(path=cache_dir)