# RDD Persistence

RDD persistence helps you to save the frequently-used RDDs or the output for future uses. By storing the intermediate RDDs or output in the cluster, you prevent the repeated task of evaluating the same results again and again. This helps in the efficient management of time and resources available.

There are two ways to persist RDDs in Spark:
- cache( )
- persist( )

The cache( ) method is used when you want to store all the data in the in-memory. This helps in speeding up your queries as the RDD is readily available in the memory for use.

However, the in-memory storage is limited and different tasks running over the cluster may require separate storage or computing space within the in-memory. This issue is resolved by the persist( ) method. This method allows you to store the intermediate RDDs over both, disk and in-memory storage. There are four different storage levels available in the persist( ) method:

- MEMORY_ONLY
- MEMORY_AND_DISK
- DISK_ONLY
- OFF_HEAP

In [None]:
# loading SparkContext and SparkConf from PySpark

from pyspark import SparkConf,SparkContext

# loading PySpark for cache() and persist()
import pyspark

conf = SparkConf().setMaster("YARN").setAppName("RDD Persistence") 
sc = SparkContext().getOrCreate(conf = conf)

In [None]:
# loading data in a RDD
text=sc.textFile("/common_folder/pyspark_data/Bike-Sharing-Dataset/*.csv")
text.take(5)

In [None]:
# caching the data in memory
text.cache()

In [None]:
# unpersisting the data from memory
text.unpersist()

In [None]:
# persisting the data in disk
text.persist(pyspark.StorageLevel.DISK_ONLY)

In [None]:
# unpersisting the data from memory
text.unpersist()