# Lab 3.3 :  Caching

### Overview
Understanding Spark caching

### Depends On 
None

### Run time
15-20 mins


## Step 1: Generate 'Large' data set
If you haven't generated the large data earlier, you can do so here.

In [None]:
! ls -lSrh ../data/text/twinkle
# sorted by size (smallest --> largest)

If the above command does not show generated data, use the command below to generate data

In [None]:
! echo "creating data..."

! [ ! -r ../data/text/twinkle/500M.data ] && cd  ../data/text/twinkle &&   ./create-data-files.sh

! echo "DONE"

! ls -lSrh ../data/text/twinkle
# sorted by size (smallest --> largest)

## Step 2 - Init Spark

In [None]:
try:
    spark
except NameError:
    # initialize Spark Session
    import os
    import sys
    top_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
    if top_dir not in sys.path:
        sys.path.append(top_dir)

    from init_spark import init_spark
    spark = init_spark()

print('Spark UI running on port ' + spark.sparkContext.uiWebUrl.split(':')[2])
spark

## STEP 3: Recording Caching times
### Download and inspect the Excel worksheet : [caching-worksheet](caching-worksheet.xlsx). 
- On the Jupyter file browser locate the worksheet under __`~/dev/spark-labs/03-rdd/`__ directory
- Right click on the __`caching-worksheet.xlsx`__ and select download

We are going to fill in the values here to understand how caching performs.

It looks like this:
<img src="../assets/images/3.6a.png" style="border: 5px solid grey; max-width:100%;"/>


## STEP 4: Load Data

Load a big file (e.g 500M.data)

In [None]:
f = spark.read.text("../data/text/twinkle/500M.data")

## for cloud accounts use this
# f = spark.read.text("s3://elephantscale-public/data/text/twinkle/500M.data")
# f = spark.read.text("https://elephantscale-public.s3.amazonaws.com/data/text/twinkle/500M.data")

print("read file ", f)

**=> Count the number of lines in this file**    

In [None]:
%%time

print(f.count())

# output might look like
# Job 1 finished: count at <console>:30, took __3.792822__ s

**=> Observe time taken on Spark UI**  
**=> Record the time in spread sheet**  
**=> Run 'count' below a couple of times and observe the time**  
**=> Can you explain the behavior of count() execution time ?**

In [None]:
%%time

print (f.count())


In [None]:
%%time

print (f.count())


## STEP 5:  Cache

**=> Cache the file using  `cache()` action.**

In [None]:
%%time 

f.cache()
print ("done caching")

**=> Run the `count()` again. Notice the time.   Can you explain this behavior ?  :-)** 

In [None]:
%%time

print (f.count())

**=> Run count() a few more times and note the execution times.**  
**=> Record the time in spreadsheet.**  
**=> Do the timings make sense?** 

In [None]:
%%time

print (f.count())

In [None]:
%%time

print (f.count())

### Discussions

- if you are reading the data only twice, is caching worth the cost?
- What is the **minimal number of times** you have to reuse the data for caching to be worth it?

## STEP 6:  Understanding Cache storage

Go to spark shell UI @ port 4040+

**=> Inspect 'storage' tab**  

<img src="../assets/images/caching-2.png" style="border: 5px solid grey; max-width:100%;"/>

### Questions

**=> Can you see the cached data?  What is the memory size?**  
**=> Can you explain the behavior?**



## Step 7 : Caching Binary Data

Let's see how binary data caching is handled

In [None]:
## Create some random data
print("creating 100M random data")
!dd if=/dev/urandom of=/data/100M-rand  bs=1M count=100
print("done")

In [None]:
df  = spark.read.text("../data/100M-rand")
df.cache()
print("df count ", df.count())

Now check the 'Storage' tab in Spark Shell UI (port 4040).  

Here is a sample output.

<img src="../assets/images/caching-3.png" style="border: 5px solid grey; max-width:100%;"/>

### Discussion

**==> Discuss your findings**



## Step 8 : Reducing memory footprint 

There are various levels of memory caching.  Here are a couple:  

* Raw caching (`data.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)`)  
* Serialized Caching (`data.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)`)


**=> Try both options `f.persist(....)` .  Monitor memory consumption in storage tab**

**Note: Caching level can not be changed after an RDD cached.  You have to 'uncache / unpersist' the RDD and then cache it again**

In [None]:
import pyspark

data = spark.read.text("../data/text/twinkle/500M.data")
## for cloud accounts use this
# f = spark.read.text("s3://elephantscale-public/data/text/twinkle/500M.data")

data.persist(pyspark.StorageLevel.MEMORY_ONLY) # same as data.cache()
## TODO :  measure the storage footprint using 'storage' tab

In [None]:
data.unpersist()
## TODO :  measure the storage footprint using 'storage' tab

### Group discussion

* mechanics of caching
* implications of caching vs memory

### Further Reading

* [Understanding Spark Caching by Sujee Maniyam](http://sujee.net/2015/01/22/understanding-spark-caching/)