# Caching


### Overview
Understanding Spark caching



### STEP 1: Generate some large data

Under `data/twinkle` directory we have created some large data files for you. 

You can generate more data if you'd like.
```bash
    $    cd data/twinkle
    $    ./create-data-files.sh
```

In [None]:
try:
    spark
except NameError:
    # initialize Spark Session
    import os
    import sys
    top_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
    if top_dir not in sys.path:
        sys.path.append(top_dir)

    from init_spark import init_spark
    spark = init_spark()

print('Spark UI running on port ' + spark.sparkContext.uiWebUrl.split(':')[2])
spark

## Step 2: Log Level
Also set the logging level to INFO (so Spark will print out job execution times on console)

In [None]:
sc.setLogLevel("INFO")
print("set log level to INFO")

## STEP 3: Recording Caching times
Download and inspect the Excel worksheet : [caching-worksheet](caching-worksheet.xlsx).   
We are going to fill in the values here to understand how caching performs.


## STEP 4: Load Data

Load a big file (e.g 500M.data)

In [None]:
# data_location = 'data/twinkle/500M.data'
data_location = 's3://elephantscale-public/data/text/twinkle/100M.data'
# data_location = 'https://elephantscale-public.s3.amazonaws.com/data/text/twinkle/100M.data'

f = spark.read.text(data_location)

print(f)

**=> Count the number of lines in this file**    

In [None]:
f.count()
# output might look like
# Job 1 finished: count at <console>:30, took __3.792822__ s

In [None]:
f.count()
f.count()

**=> Do the same count() operation a few times until the execution time 'stablizes'**  
**=> Record the time in spreadsheet.**  
**=> Can you explain the behavior of count() execution time ?**


## STEP 5:  Cache

In [None]:
f.cache()

**=> Run the `count()` again. Notice the time.   Can you explain this behavior ?  :-)** 

**=> Run count() a few more times and note the execution times.**  
**=> Record the time in spreadsheet.**  
**=> Do the timings make sense?** 

In [None]:
f.count()

In [None]:
f.count()
f.count()

## STEP 6:  Understanding Cache storage

Go to spark shell UI @ port 4040  
**=> Inspect 'storage' tab**  

In [None]:
print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

### Group discussion

* mechanics of caching
* implications of caching vs memory

### Further Reading

* [Understanding Spark Caching by Sujee Maniyam](http://sujee.net/2015/01/22/understanding-spark-caching/)