- Title: Persist DataFrame in Spark
- Slug: spark-persist-dataframe
- Date: 2021-01-12 10:43:17
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, persist, big data
- Author: Ben Du

## Tips & Trap

1. The method `DataFrame.persist` returns itself,
    which means that you can chain methods after it.

2. Persist a DataFrame which is used multiple times and expensive to recompute.
    Remembe to unpersist it too when the DataFrame is no longer needed. 
    Even Spark evict data from memory using the LRU (least recently used) strategy
    when the caching layer becomes full,
    it is still beneficial to unpersist data as soon as it is no used any more to reduce memory usage.

3. Persisting too many DataFrames (into memory) can cause memory issues.
    You should only persist DataFrames 
    which helps boosting the performance of the Spark application.

In [6]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel

val spark = SparkSession
    .builder()
    .master("local[2]")
    .appName("Spark SQL Parser")
    .getOrCreate()
import spark.implicits._

org.apache.spark.sql.SparkSession$implicits$@5c525748

In [8]:
val df = Seq(
    (1L, "a", "foo", 3.0),
    (2L, "b", "bar", 4.0),
    (3L, "c", "foo", 5.0),
    (4L, "d", "bar", 7.0)
).toDF("col1", "col2", "col3", "col4")
df.show

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|   a| foo| 3.0|
|   2|   b| bar| 4.0|
|   3|   c| foo| 5.0|
|   4|   d| bar| 7.0|
+----+----+----+----+



null

Persist `df` to memory.

In [9]:
df.persist(StorageLevel.MEMORY_ONLY)

[col1: bigint, col2: string ... 2 more fields]

Verify that `df` has been persisted to memory.

In [12]:
df.storageLevel

StorageLevel(memory, deserialized, 1 replicas)

## References

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html

https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/storage/StorageLevel.html

[Caching Spark DataFrame — How & When](https://medium.com/swlh/caching-spark-dataframe-how-when-79a8c13254c0)
