- cache() - a PySpark optimization method that stores a DataFrame (or RDD) in memory (executor RAM) after the first action triggers its computation.
- It avoids recomputing the DataFrame for subsequent actions, improving performance.

- Key Points: -
    - Stores data in executor memory (not driver memory)
    - Lazy eveluation: cache() takes effect only after an action (e.g., show(), count()),
    - If executor memory is insufficient, Spark may evict partitions and recompute them.
    - cache() is shorthand for persist (StorageLevel.MEMORY_AND_DISK).

- Benefits of cache(): -
    - 1. Speeds up jobs when reusing the same DataFrame.
    - 2. Saves recomputation time in iterative algorithms (ML, Graph Processing).
    - 3. Useful for Exploratory Data Analysis (EDA).
    - 4. Optimizes performance in joins where a DataFrame is reused.

- Basic Architechture:
| -- | -- | -- |
| Driver | -> | Executors |
| sends tasks | -- | Executes tasks |
| no cacheing | -- | caches data in memory |

In [None]:
from pyspark.sql import SparkSession, Row

spark = SparkSession.builder.appName("cacheExample").getOrCreate()

In [None]:
data = [
    (1, "Manta", 75000, "IT", 24),
    (2, "Dipankar", 30000, "Post Master", 27),
    (3, "Souvik", 60000, "Army Officer", 27),
    (4, "Soukarjya", 45000, "BDO", 26),
    (5, "Arvind", 35000, "Business Data Analyst", 28),
    (6, "Prodipta", 25000, "Data Analyst", 28),
    (7, "Padma", 20000, "Data Analyst", 27),
    (8, "Panta", 125000, "Business Analyst", 27)
]

df = spark.createDataFrame(data, ["id", "name", "salary", "department", "age"])

# show full DataFrame
df.show()

#### Cache the DataFrame

In [None]:
# apply cache()
df.cache()

In [None]:
# trigger an action to materialize the cache
df.count()

- Explanation:
    1. after the first action (count), Spark stores partitions of 'df' in executor memory.
    2. Future actions reuse the cache data

#### Perform Actions on Cached Data

In [None]:
# show DataFrame (faster after caching)
df.show()

In [None]:
# filter records where age>27
df.filter(df.age > 28).show()

In [None]:
# group by age and coumt
df.groupBy("age").count().show()

In [None]:
# complex query example
df.select("name", "age") \
    . filter(df.age >= 30) \
    .orderBy("age") \
    .show()

In [None]:
# check if DataFrame is cached
print("Is DataFrame cached? ", df.is_cached)

#### Remove Cache (Unpersist)

In [None]:
# free up executor memory
df.unpersist()

In [None]:
# confirm cache removal
print("Is DataFrame cached after unpersist? ", df.is_cached)

#### When to use cache()? (Real-world scenarios)
- Reusing DataFrame in multiple actions/transforms
- EDA workflows with repeated filtering/grouping
- ML pipelines with repeated training dataaccess
- Repeated joins with the same lookup DataFrame
- Iterative graph algorithms (e.g. PageRank)

#### Best Practices
- Trigger an action after cache() to store data
- Monitor Spark UI > Storage tab for cache status.
- Use unpersist() when caching is no longer needed.
- Use persist() if fine-grained control (disk/memory) is required.
- Avoid caching large DataFrames unnecessarily to prevent OOM issues.

### Summary:
- cache() stores data in executor memory for faster access.
- reduces recomputation time in Spark workflows.
- requires an action (e.g., count()) to trigger caching.
- use unpersist() to manage memory effectively.