# Caching

### Overview

Understanding Spark caching

### Depends On 

None

### Run time

20-30 mins


## Step 1: Generate 'Large' data set

Let's generate some 'big enough' data set

In [None]:
! echo "creating data..."

! [ ! -r ../data/twinkle/500M.data ] && cd  ../data/twinkle &&   ./create-data-files.sh

! echo "DONE"

! ls -lSrh ../data/twinkle
# sorted by size (smallest --> largest)

## Step 2 - Init Spark

In [None]:
try:
    spark
except NameError:
    import findspark
    findspark.init()  # uses SPARK_HOME
    print("Spark found in : ", findspark.find())

    import pyspark
    from pyspark import SparkConf
    from pyspark.sql import SparkSession

    # use a unique tmep dir for warehouse dir, so we can run multiple spark sessions in one dir
    import tempfile
    tmpdir = tempfile.TemporaryDirectory()

    config = ( SparkConf()
             .setAppName("TestApp")
             .setMaster("local[*]")
             .set('executor.memory', '2g')
             .set('spark.sql.warehouse.dir', tmpdir.name)
             .set("some_property", "some_value") # another example
             )

    spark = SparkSession.builder.config(conf=config).getOrCreate()
    sc = spark.sparkContext

print('Spark UI running on port ' + spark.sparkContext.uiWebUrl.split(':')[2])

## STEP 3: Recording Caching times

**Download and inspect the Excel worksheet : `08-caching/caching-worksheet.xlsx`**

We are going to fill in the values here to understand how caching performs.

It looks like this:
<img src="../assets/images/caching-1.png" style="border: 5px solid grey; max-width:100%;"/>


## STEP 4: Load Data

Load a big file (e.g 500M.data)

In [None]:
f = spark.read.text("../data/twinkle/500M.data")
print("read file ", f)

**=> Count the number of lines in this file**    

In [None]:
%%time

print(f.count())


**=> Observe time taken on Spark UI**  
**=> Record the time in spread sheet**  
**=> Run 'count' below a couple of times and observe the time**  
**=> Can you explain the behavior of count() execution time ?**

In [None]:
%%time

print (f.count())


In [None]:
%%time

print (f.count())


## STEP 5:  Cache

**=> Cache the file using  `cache()` action.**

In [None]:
%%time 

f.cache()
print ("done caching")

**=> Run the `count()` again. Notice the time.   Can you explain this behavior ?  :-)** 

In [None]:
%%time

print (f.count())

**=> Run count() a few more times and note the execution times.**  
**=> Record the time in spreadsheet.**  
**=> Do the timings make sense?** 

In [None]:
%%time

print (f.count())

In [None]:
%%time

print (f.count())

## STEP 6:  Understanding Cache storage

Go to spark shell UI @ port 4040  
**=> Inspect 'storage' tab**  

<img src="../assets/images/caching-2.png" style="border: 5px solid grey; max-width:100%;"/>

**=> Can you see the cached data?  What is the memory size?**  
**=> What are the implications?** 



## Step 7 : Caching RDD vs. Dataframe
We will load the same data using RDD API and Dataframe API will compare cache performance.

In [None]:
## Create some random data
print("creating 100M random data")
!dd if=/dev/urandom of=../data/100M.data  bs=1M count=100
print("done")

In [None]:
#RDD
rdd = sc.textFile("../data/100M.data")
rdd.cache()
print("rdd count " , rdd.count())  # force caching

df  = spark.read.text("../data/100M.data")
df.cache()
print("df count ", df.count())

Now check the 'Storage' tab in Spark Shell UI (port 4040).  

**Do you see any noticeable difference?**


## Step-8: BONUS Lab : Try Caching HDFS Data

Now that we have a fairly good idea of caching mechanics, Let's try caching with HDFS data.  Use the following code to get you started.

Start Spark in yarn mode

```bash
$  pyspark  --yarn
```

And try the following code

In [None]:
# Adjust the data path accordingly
df1 = spark.read.csv('/user/me/transactions/csv')

# time this
df1.count()

# cache
df1.cache()

# and measure time again
df1.count()