<link rel='stylesheet' href='../assets/css/main.css'/>

[<< back to main index](../README.md)

# Lab :  Caching


### Overview
Understanding Spark caching

### Depends On 
None

### Run time
15-20 mins


### STEP 1: 'Large' data set

Under `/data/text/twinkle` directory we have created some large data files for you. 

<img src="../assets/images/3.1a.png" style="border: 5px solid grey; max-width:100%;"/>


#### Optional Step
You can generate more data if you'd like.
```bash
    $    cd /data/text/twinkle
    $    ./create-data-files.sh
```

In [None]:
## Identify Spark UI port
print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

## Step 2: Log Level
Also set the logging level to INFO (so Spark will print out job execution times on console)

In [None]:
sc.setLogLevel("INFO")
print("set log level to INFO")

## STEP 3: Recording Caching times
### Download and inspect the Excel worksheet : [caching-worksheet](caching-worksheet.xlsx).   

We are going to fill in the values here to understand how caching performs.

It looks like this:
<img src="../assets/images/3.6a.png" style="border: 5px solid grey; max-width:100%;"/>


## STEP 4: Load Data

Load a big file (e.g 500M.data)

In [None]:
f = spark.read.text("/data/text/twinkle/500M.data")
print("read file ", f)

**=> Count the number of lines in this file**    

In [None]:
print(f.count())

# output might look like
# Job 1 finished: count at <console>:30, took __3.792822__ s

**=> Observe time taken on Spark UI**  
**=> Record the time in spread sheet**  
**=> Run 'count' below a couple of times and observe the time**  
**=> Can you explain the behavior of count() execution time ?**

In [None]:
print (f.count())
print (f.count())


## STEP 5:  Cache

**=> Cache the file using  `cache()` action.**

In [None]:
f.cache()
print ("done caching")

In [None]:
**=> Run the `count()` again. Notice the time.   Can you explain this behavior ?  :-)** 

In [None]:
print (f.count())

**=> Run count() a few more times and note the execution times.**  
**=> Record the time in spreadsheet.**  
**=> Do the timings make sense?** 

In [None]:
print (f.count())
print (f.count())

## STEP 6:  Understanding Cache storage

Go to spark shell UI @ port 4040  
**=> Inspect 'storage' tab**  

<img src="../assets/images/3.6b.png" style="border: 5px solid grey; max-width:100%;"/>

**=> Can you see the cached data?  What is the memory size?**  
**=> What are the implications?** 



## Step 8 : Caching RDD vs. Dataframe
We will load the same data using RDD API and Dataframe API will compare cache performance.

In [None]:
## Create some random data
print("creating 100M random data")
!dd if=/dev/urandom of=/data/100M-rand  bs=1M count=100
print("done")

In [None]:
#RDD
rdd = sc.textFile("/data/100M-rand")
rdd.cache()
print("rdd count " , rdd.count())  # force caching

df  = spark.read.text("/data/100M-rand")
df.cache()
print("df count ", df.count())

Now check the 'Storage' tab in Spark Shell UI (port 4040).  

Here is a sample output.

<img src="../assets/images/3.6c-rdd-ds-cache.png" style="border: 5px solid grey; max-width:100%;"/>

** ==> Discuss your findings **



### Step 9 : Reducing memory footprint 

There are various levels of memory caching.  Here are a couple:  

* Raw caching (`data.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)`)  
* Serialized Caching (`data.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)`)


**=> Try both options `f.persist(....)` .  Monitor memory consumption in storage tab**

**NOte: Caching level can not be changed after an RDD cached.  You have to 'uncache / unpersist' the RDD and then cache it again**

In [None]:
import pyspark

data = spark.read.text("/data/text/twinkle/500M.data")
data.persist(pyspark.StorageLevel.MEMORY_ONLY) # same as data.cache()
## TODO :  measure the storage footprint using 'storage' tab

In [None]:
data.unpersist()
data.persist(pyspark.StorageLevel.MEMORY_ONLY_SER)
## TODO :  measure the storage footprint using 'storage' tab

### Group discussion

* mechanics of caching
* implications of caching vs memory

### Further Reading

* [Understanding Spark Caching by Sujee Maniyam](http://sujee.net/2015/01/22/understanding-spark-caching/)