- Title: Persist DataFrame in Spark
- Slug: spark-persist-dataframe
- Date: 2021-02-28 23:43:44
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, persist, big data, cache, checkpoint
- Author: Ben Du

## Tips & Trap

1. `DataFrame.cache` caches/persists a DataFrame to the default storage level (`MEMORY_AND_DISK`)
    while `DataFrame.persist` is more flexible on storage leve.
    Notice that `DataFrame.persist()` is equivalent to `DataFrame.cache()`. 
    To sum up,
    `DataFrame.persist` is preferred over `DataFrame.cache`. 
    In addition,
    `DataFrame.persist` is perferred over `DataFrame.checkpoint`.

2. The definition of the class `pyspark.StorageLevel` is as below.

        :::python
        class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication = 1)
            ...

    And it has the following pre-defined instances.

    - DISK_ONLY = StorageLevel(True, False, False, False, 1)

    - DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)

    - MEMORY_AND_DISK = StorageLevel(True, True, False, False, 1)

    - MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)

    - MEMORY_AND_DISK_SER = StorageLevel(True, True, False, False, 1)

    - MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, False, 2)

    - MEMORY_ONLY = StorageLevel(False, True, False, False, 1)

    - MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)

    - MEMORY_ONLY_SER = StorageLevel(False, True, False, False, 1)

    - MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, False, 2)

    - OFF_HEAP = StorageLevel(True, True, True, False, 1)

3. The method `DataFrame.persist` returns itself,
    which means that you can chain methods after it.

2. Persist a DataFrame which is used multiple times and expensive to recompute.
    Remembe to unpersist it too when the DataFrame is no longer needed. 
    Even Spark evict data from memory using the LRU (least recently used) strategy
    when the caching layer becomes full,
    it is still beneficial to unpersist data as soon as it is no used any more to reduce memory usage.

3. Persisting too many DataFrames into memory can cause memory issues.
    There are a few ways to address memory issues caused by this.
    - Increase memory.
    - Persist only the most reused DataFrames into memory.
    - Persist other DataFrame into disk. 
    Generally speaking,


In [None]:
import pandas as pd
import findspark
findspark.init("/opt/spark-3.0.0-bin-hadoop3.2/")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType, StringType, StructType
spark = SparkSession.builder.appName("PySpark UDF").enableHiveSupport().getOrCreate()

In [6]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel

val spark = SparkSession
    .builder()
    .master("local[2]")
    .appName("Spark SQL Parser")
    .getOrCreate()
import spark.implicits._

org.apache.spark.sql.SparkSession$implicits$@5c525748

In [8]:
val df = Seq(
    (1L, "a", "foo", 3.0),
    (2L, "b", "bar", 4.0),
    (3L, "c", "foo", 5.0),
    (4L, "d", "bar", 7.0)
).toDF("col1", "col2", "col3", "col4")
df.show

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|   a| foo| 3.0|
|   2|   b| bar| 4.0|
|   3|   c| foo| 5.0|
|   4|   d| bar| 7.0|
+----+----+----+----+



null

Persist `df` to memory.

In [9]:
df.persist(StorageLevel.MEMORY_ONLY)

[col1: bigint, col2: string ... 2 more fields]

Verify that `df` has been persisted to memory.

In [12]:
df.storageLevel

StorageLevel(memory, deserialized, 1 replicas)

## References

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html

https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/storage/StorageLevel.html

[Caching Spark DataFrame — How & When](https://medium.com/swlh/caching-spark-dataframe-how-when-79a8c13254c0)

[PySpark - StorageLevel](https://www.tutorialspoint.com/pyspark/pyspark_storagelevel.htm)
