# Persist and cache

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

In [None]:

from pyspark.sql import SparkSession

logFile = "/var/log/dpkg.log" # this should be a large file to show the effect of caching
spark = SparkSession.builder.appName("cache demo").getOrCreate()
logData = spark.read.text(logFile).cache()

# Storage levels

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.StorageLevel.html

StorageLevel decides how RDD should be stored. In Apache Spark, StorageLevel decides whether RDD should be stored in the memory or should it be stored over the disk, or both. It also decides whether to serialize RDD and whether to replicate RDD partitions.

### Blooper alert
The caching is effective when data has to be copied between nodes. In this notebook and setup, we use a single node, so we will not see the benefit.

Also, to see the time saved, a large table has to be used.

In [None]:
%time

numAs = logData.filter(logData.value.contains('a')).count()

# logData is now cached. 
numBs = logData.filter(logData.value.contains('b')).count()

In [None]:
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

spark.stop()

## Cache vs. Persist

The Cache() saves the RDD only to RAM memory.<br>
Persist() has many storage options (check the docs)