# Persist & Cache
Trying to understand how to use persist/cache in pyspark

http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

**Check persist in the UI:** https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-caching-and-persistence.html 

**Explain using persist:** https://stackoverflow.com/questions/44156365/when-to-cache-a-dataframe

Spark gives 5 types of Storage level: MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER, DISK_ONLY

cache() will use MEMORY_AND_DISK. If you want to use something else, use persist(StorageLevel.<*type*>).

By default persist() will store the data in the JVM heap as unserialized objects.

*Note: In **Python**, stored objects will always be serialized with the `Pickle` library, so it does not matter whether you choose a serialized level. The available storage levels in Python include MEMORY_ONLY, MEMORY_AND_DISK (cache), DISK_ONLY, and its '_2' versions.*


**Important:** cache/persist is lazy and so nothing happens until peforming an action (like count)

Read the data and putting in a df

In [1]:
pp = "s3a://client-gsk-gsk-us-9979/toolbox-output/idmap/20200416170301KapilSharma/idmap/idspace=epsilon/"
df = spark.read.parquet(pp)
df.show(5)

+--------------------+----------+----------+--------+------------+
|                 psn|first_seen| last_seen|num_days|     segment|
+--------------------+----------+----------+--------+------------+
|1900084E030E0C004...|2020-01-23|2020-01-23|       1|       advil|
|220022940313EE005...|2019-11-05|2019-11-05|       1|topicalCreme|
|180053FE031130000...|2019-08-15|2020-01-23|       3|intAnalgesic|
|1D00F786030E0C002...|2019-08-15|2020-01-23|       3|intAnalgesic|
|1200B6D4031129000...|2019-08-15|2020-01-23|       3|intAnalgesic|
+--------------------+----------+----------+--------+------------+
only showing top 5 rows



**Which storage level is using 'df'?**

We can see below, that df is not chached at all

In [2]:
df.storageLevel

StorageLevel(False, False, False, False, 1)

In [22]:
print(df.storageLevel)

Serialized 1x Replicated


Ok, the data is in 'df' but is not all the data (remember that spark is lazy !)

Now I want to filter the data an assign it to df2, but now the new df is going to be small.

If I don't use cache (or persist), then the filter is not going to 'stay' and it will still use the whole data to do the grouping below, which is very expensive operation.

In [3]:
df.createOrReplaceTempView("idmap_out_epsilon")

In [4]:
%%time
df2 = spark.sql("""
    select 
        * 
    from 
        idmap_out_epsilon 
    where 
        psn like '2A00467D030E0C00829114' 
    order by
        segment""").cache()

CPU times: user 1.56 ms, sys: 0 ns, total: 1.56 ms
Wall time: 2.49 s


### What storage level is using 'df2'?

We can see that it is: StorageLevel(True, True, False, True, 1)

Which means:
* MEMORY_ONLY -> True
* MEMORY_ONLY_SER -> True
* MEMORY_AND_DISK
* MEMORY_AND_DISK_SER -> True
* DISK_ONLY


In [5]:
df2.storageLevel

StorageLevel(True, True, False, True, 1)

In [48]:
print(df2.storageLevel)

Memory Serialized 1x Replicated


In [65]:
df2.unpersist()        # To disable the cache i did before

DataFrame[psn: string, first_seen: date, last_seen: date, num_days: bigint, segment: string]

In [66]:
# Using persist ...
from pyspark import StorageLevel
# df2.persist(storageLevel=StorageLevel(False, True, False, True, 1)).storageLevel
df2.unpersist()  
df2.persist(storageLevel=StorageLevel.MEMORY_AND_DISK).storageLevel
df2.cache().storageLevel

StorageLevel(True, True, False, False, 1)

In [67]:
print(df2.storageLevel)

Disk Memory Serialized 1x Replicated


| StorageLevel (name)| StorageLevel(bool) |Description | Comment |
| --- | --- | --- | --- |
| |StorageLevel(True, True, False, True, 1) | Disk Memory Deserialized 1x Replicated| This is the one when using cache() |
| |StorageLevel(False, True, False, True, 1) | Memory Deserialized 1x Replicated| |
|DISK_ONLY | StorageLevel(True, False, False, False, 1) | Disk Serialized 1x Replicated||
|MEMORY_AND_DISK| StorageLevel(True, True, False, False, 1) | Disk Memory Serialized 1x Replicated||
|MEMORY_ONLY | StorageLevel(False, True, False, False, 1) | Memory Serialized 1x Replicated||
|OFF_HEAP | StorageLevel(True, True, True, False, 1) | Disk Memory OffHeap Serialized 1x Replicated||

Notice the 'cache' key word - This is making the access to the data really fast and don't need to write and read it from disk.

In [29]:
%%time
df2.limit(5).toPandas()

CPU times: user 9.91 ms, sys: 811 µs, total: 10.7 ms
Wall time: 33.8 s


Unnamed: 0,psn,first_seen,last_seen,num_days,segment
0,2A00467D030E0C00829114,2019-11-05,2020-01-23,2,advil
1,2A00467D030E0C00829114,2019-08-15,2019-08-15,1,advil
2,2A00467D030E0C00829114,2020-01-23,2020-01-23,1,advil
3,2A00467D030E0C00829114,2019-08-15,2020-01-23,3,advil
4,2A00467D030E0C00829114,2019-08-15,2020-01-23,2,advil


In [8]:
df2.createOrReplaceTempView("lim_epsi")

In [9]:
%%time
spark.sql("""
    select 
        count(distinct *) 
    from 
        lim_epsi 
    """).show()

+-------------------------------------------------------------+
|count(DISTINCT psn, first_seen, last_seen, num_days, segment)|
+-------------------------------------------------------------+
|                                                           38|
+-------------------------------------------------------------+

CPU times: user 1.79 ms, sys: 0 ns, total: 1.79 ms
Wall time: 95.9 ms


In [10]:
%%time
spark.sql("""
    select 
        psn, min(first_seen), max(last_seen)
    from 
        lim_epsi 
    group by
        psn
    """).toPandas()

CPU times: user 4.26 ms, sys: 104 µs, total: 4.36 ms
Wall time: 278 ms


Unnamed: 0,psn,min(first_seen),max(last_seen)
0,2A00467D030E0C00829114,2019-08-15,2020-01-23


In [8]:
help (df2.persist)

Help on method persist in module pyspark.sql.dataframe:

persist(storageLevel=StorageLevel(True, True, False, False, 1)) method of pyspark.sql.dataframe.DataFrame instance
    Sets the storage level to persist the contents of the :class:`DataFrame` across
    operations after the first time it is computed. This can only be used to assign
    a new storage level if the :class:`DataFrame` does not have a storage level set yet.
    If no storage level is specified defaults to (C{MEMORY_AND_DISK}).
    
    .. note:: The default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0.
    
    .. versionadded:: 1.3

