# Caching and Persistence Example

This example briefly shows how caching and persist work.

Spark Web UI (localhost:4040) should be opened to visualize how RDDs are cached.

Materialize the tweets dataset. Check the storage tab from Web UI to verify that no RDD has been cached.

In [1]:
tweets = spark.sparkContext.textFile("/work/data/covid-tweets.json")
tweets.count()

                                                                                

78011

In [10]:
%pip install pandas

Collecting pandas
  Using cached pandas-1.3.5.tar.gz (4.7 MB)
  Installing build dependencies ... [?25l|^C
[?25canceled
[31mERROR: Operation cancelled by user[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [15]:
def parse(tweet_str):
    import json
    return json.loads(tweet_str)

parsed = tweets.map(parse)
parsed.take(2)

[{'Arrival_Time': 1424686735175,
  'Creation_Time': 1424686733176178965,
  'Device': 'nexus4_1',
  'Index': 35,
  'Model': 'nexus4',
  'User': 'g',
  'gt': 'stand',
  'x': 0.0014038086,
  'y': 0.00050354,
  'z': -0.0124053955},
 {'Arrival_Time': 1424686735378,
  'Creation_Time': 1424686733382813486,
  'Device': 'nexus4_1',
  'Index': 76,
  'Model': 'nexus4',
  'User': 'g',
  'gt': 'stand',
  'x': -0.0039367676,
  'y': 0.026138306,
  'z': -0.01133728}]

In [23]:
parsed.map(lambda tweet: (tweet['User'], tweet['Model'])).collect()

                                                                                

[('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'nexus4'),
 ('g', 'ne

Let's see how many partitions are:

In [2]:
tweets.getNumPartitions()

2

## Cache

Cache the dataset. Check Web UI to see if something changed in the storage tab.

In [4]:
tweets.cache()

/work/data/covid-tweets.json MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [7]:
upper = tweets.map(lambda t: t.upper())
upper.cache()
upper.count()

                                                                                

78011

The RDD will be cached once it is materialized. Materialize tweets by running an action.

## Persist

Persist can cache an RDD in different ways defined by levels available at StorageLevel

Let's cache in memory and disk

In [8]:
from pyspark import StorageLevel

tweets_mem_disk = spark.sparkContext.textFile("/work/data/covid-tweets.json")

tweets_mem_disk.persist(StorageLevel.MEMORY_AND_DISK_2)

/work/data/covid-tweets.json MapPartitionsRDD[8] at textFile at NativeMethodAccessorImpl.java:0

Materialize to actually cache the RDD

In [9]:
tweets_mem_disk.count()

[Stage 4:>                                                          (0 + 2) / 2]                                                                                

78011

### Times

Transformation without caching

In [17]:
def transform_without_caching():
    tweets = spark.sparkContext.textFile("/work/data/covid-tweets.json")
    filtered = tweets.filter(lambda tweet: len(tweet) > 190)
    print(filtered.count())
%time transform_without_caching()

24838
CPU times: user 8.88 ms, sys: 3.47 ms, total: 12.4 ms
Wall time: 371 ms


Transformation time with a cached RDD

In [11]:
tweets = spark.sparkContext.textFile("/work/data/covid-tweets.json")
tweets.cache()
tweets.count()

                                                                                

78011

In [18]:
def transform_with_caching():
    filtered = tweets.filter(lambda tweet: len(tweet) > 190)
    print(filtered.count())
%time transform_with_caching()

24838
CPU times: user 11.2 ms, sys: 1.76 ms, total: 12.9 ms
Wall time: 234 ms
