# Caching and Persistence Example

This example briefly shows how caching and persist work.

Spark Web UI (localhost:4040) should be opened to visualize how RDDs are cached.

Materialize the tweets dataset. Check the storage tab from Web UI to verify that no RDD has been cached.

In [53]:
tweets = spark.sparkContext.textFile("/work/data/covid-tweets.json")
tweets.count()

78011

Let's see how many partitions are:

In [54]:
tweets.getNumPartitions()

2

## Cache

Cache the dataset. Check Web UI to see if something changed in the storage tab.

In [56]:
tweets.cache()

/work/data/covid-tweets.json MapPartitionsRDD[88] at textFile at <unknown>:0

The RDD will be cached once it is materialized. Materialize tweets by running an action.

In [57]:
tweets.count()

78011

## Persist

Persist can cache an RDD in different ways defined by levels available at StorageLevel

Let's cache in memory and disk

In [58]:
from pyspark import StorageLevel

tweets_mem_disk = spark.sparkContext.textFile("/work/data/covid-tweets.json")

tweets_mem_disk.persist(StorageLevel.MEMORY_AND_DISK_2)

/work/data/covid-tweets.json MapPartitionsRDD[92] at textFile at <unknown>:0

Materialize to actually cache the RDD

In [59]:
tweets_mem_disk.count()

78011

### Times

Transformation without caching

In [60]:
def transform_without_caching():
    tweets = spark.sparkContext.textFile("/work/data/covid-tweets.json")
    filtered = tweets.filter(lambda tweet: len(tweet) > 190)
    print(filtered.count())
%time transform_without_caching()

24838
CPU times: user 7.14 ms, sys: 3.2 ms, total: 10.3 ms
Wall time: 244 ms


Transformation time with a cached RDD

In [61]:
tweets = spark.sparkContext.textFile("/work/data/covid-tweets.json")
tweets.cache()
tweets.count()

78011

In [62]:
def transform_with_caching():
    filtered = tweets.filter(lambda tweet: len(tweet) > 190)
    print(filtered.count())
%time transform_with_caching()

24838
CPU times: user 7.23 ms, sys: 2.33 ms, total: 9.56 ms
Wall time: 178 ms
