# PySpark RDD Persistence Tutorial
PySpark Cache and Persist are optimization techniques to improve the performance of the RDD jobs that are iterative and interactive. In this PySpark RDD Tutorial section, I will explain how to use persist() and cache() methods on RDD with examples.

Though PySpark provides computation 100 x times faster than traditional Map Reduce jobs, If you have not designed the jobs to reuse the repeating computations you will see degrade in performance when you are dealing with billions or trillions of data. Hence, we need to look at the computations and use optimization techniques as one of the ways to improve performance.

Using cache() and persist() methods, PySpark provides an optimization mechanism to store the intermediate computation of an RDD so they can be reused in subsequent actions.

When you persist or cache an RDD, each worker node stores it’s partitioned data in memory or disk and reuses them in other actions on that RDD. And Spark’s persisted data on nodes are fault-tolerant meaning if any partition is lost, it will automatically be recomputed using the original transformations that created it.

## Advantages of Persisting RDD
- Cost efficient – PySpark computations are very expensive hence reusing the computations are used to save cost.
- Time efficient – Reusing the repeated computations saves lots of time.
- Execution time – Saves execution time of the job which allows us to perform more jobs on the same cluster.

### RDD Cache
PySpark RDD <strong>cache()</strong> method by default saves RDD computation to storage level `<strong>MEMORY_ONLY</strong>` meaning it will store the data in the JVM heap as unserialized objects.

PySpark cache() method in RDD class internally calls persist() method which in turn uses sparkSession.sharedState.cacheManager.cacheQuery to cache the result set of RDD. Let’s look at an example.

In [2]:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
rdd = spark.sparkContext.parallelize(data)
cachedRdd = rdd.cache()

In [3]:
cachedRdd

ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:274

### RDD Persist
PySpark persist() method is used to store the RDD to one of the storage levels <strong>MEMORY_ONLY</strong>,<strong>MEMORY_AND_DISK</strong>, <strong>MEMORY_ONLY_SER</strong>, <strong>MEMORY_AND_DISK_SER</strong>, <strong>DISK_ONLY</strong>, <strong>MEMORY_ONLY_2</strong>,<strong>MEMORY_AND_DISK_2</strong> and more.

PySpark persist has two signature first signature doesn’t take any argument which by default saves it to <strong>MEMORY_ONLY</strong> storage level and the second signature which takes StorageLevel as an argument to store it to different storage levels.



In [8]:
import pyspark
rddPersist = rdd.persist(pyspark.StorageLevel.MEMORY_ONLY)
rddPersist

ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:274

### RDD Unpersist
PySpark automatically monitors every persist() and cache() calls you make and it checks usage on each node and drops persisted data if not used or by using least-recently-used (LRU) algorithm. You can also manually remove using unpersist() method. unpersist() marks the RDD as non-persistent, and remove all blocks for it from memory and disk.

In [9]:
rddPersist2 = rddPersist.unpersist()

unpersist(Boolean) with boolean as argument blocks until all blocks are deleted.

## Persistence Storage Levels
All different storage level PySpark supports are available at org.apache.spark.storage.StorageLevel class. Storage Level defines how and where to store the RDD.

<strong>MEMORY_ONLY</strong> – This is the default behavior of the RDD cache() method and stores the RDD as deserialized objects to JVM memory. When there is no enough memory available it will not save to RDD of some partitions and these will be re-computed as and when required. This takes more storage but runs faster as it takes few CPU cycles to read from memory.

<strong>MEMORY_ONLY_SER</strong> – This is the same as MEMORY_ONLY but the difference being it stores RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) then MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in order to deserialize.

<strong>MEMORY_ONLY_2</strong> – Same as MEMORY_ONLY storage level but replicate each partition to two cluster nodes.

<strong>MEMORY_ONLY_SER_2</strong> – Same as MEMORY_ONLY_SER storage level but replicate each partition to two cluster nodes.

<strong>MEMORY_AND_DISK</strong> – In this Storage Level, The RDD will be stored in JVM memory as a deserialized objects. When required storage is greater than available memory, it stores some of the excess partitions in to disk and reads the data from disk when it required. It is slower as there is I/O involved.

<strong>MEMORY_AND_DISK_SER</strong> – This is same as MEMORY_AND_DISK storage level difference being it serializes the RDD objects in memory and on disk when space not available.

<strong>MEMORY_AND_DISK_2</strong> – Same as MEMORY_AND_DISK storage level but replicate each partition to two cluster nodes.

<strong>MEMORY_AND_DISK_SER_2</strong> – Same as MEMORY_AND_DISK_SER storage level but replicate each partition to two cluster nodes.

<strong>DISK_ONLY</strong> – In this storage level, RDD is stored only on disk and the CPU computation time is high as I/O involved.

<strong>DISK_ONLY_2</strong> – Same as DISK_ONLY storage level but replicate each partition to two cluster nodes.