# Persist and cache

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

See [Best practice for cache(), count(), and take()](https://kb.databricks.com/en_US/scala/best-practice-cache-count-take
)

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("cache demo").getOrCreate()

# Storage levels

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.StorageLevel.html

StorageLevel decides how RDD should be stored. In Apache Spark, StorageLevel decides whether RDD should be stored in the memory or should it be stored over the disk, or both. It also decides whether to serialize RDD and whether to replicate RDD partitions.

### Note
*The caching is more effective when data has to be copied between nodes. In this notebook and setup, we use a single node, so we might not see the benefit.*

Also, to see the time saved, a large table has to be used, or a lengthy computation.

# Let's see it working

In the code below, we create a vector of N random values, convert it to DF and then sort it.

Run the code one step at a time and watch the execution times.
## NOTE: running this notebook on Databricks server shows the differences more clearly!

In [None]:
%%time
# generate a pretty large dataframe
from pyspark.sql.types import*
from pyspark.mllib.random import RandomRDDs

u = RandomRDDs.normalRDD(spark, 4*1000*1000, numPartitions=10)
u = u.map(lambda x: (x,)) # convert to tuple so we can transorm into DF
schema = StructType([  StructField('c1', FloatType(), False)]) 

In [None]:
%%time
df = spark.createDataFrame(u, schema)

When is the data stored to the cache?

cache() is a tranform, so only when an action is activated the data will be saved.

For example. if we call `take(6)`, only the first (of 10 partitions) is accessed, so only that partition will be cached.

In [None]:
%%time 
sorted_df = df.sort(df['c1'])
#sorted_df.cache()
_ = sorted_df.count() # force an action, so the 'cache()' transform will be activated

Now **run the cell above again several times**. We care about the "wall time" which includes both the CPU and time to load the data.

You should see more or less the same values (a few seconds on my own laptop)


Here is another calculation based on `df` . Not realy a life changer

In [None]:
%%time
filtered = df.sort(df['c1']).filter(df['c1'] > 0.5)
filtered.count()

At last, run this cell to cache `df`, then run the above cell again.

In the first run after calling cache(), the time is still the same. (why?)

Now run it again (second time after calling cache() ). What do you see?

In [None]:
df.cache()

## Cache vs. Persist

The cache() saves the RDD only to RAM memory.<br>
persist() has many storage options (check the docs)