# Storage Levels in Apache Spark (As of Spark 3.4)

## Overview

Storage levels in Spark determine how RDDs or DataFrames are stored in memory and/or disk to optimize performance.
### Storage Level Types
Storage Level	Space Used	CPU Time	In Memory	On Disk	Serialized
MEMORY_ONLY	High	Low	Yes	No	No
MEMORY_ONLY_SER	Low	High	Yes	No	Yes
MEMORY_AND_DISK	High	Medium	Some	Some	Some
MEMORY_AND_DISK_SER	Low	High	Some	Some	Yes
DISK_ONLY	Low	High	No	Yes	Yes

## Storage Level Details
- DISK_ONLY: Stores data only on disk (slowest access), serialized format saves space.
- DISK_ONLY_2 / DISK_ONLY_3: Disk storage with 2x / 3x replication.
- MEMORY_AND_DISK: Uses memory first, spills to disk if needed.
- MEMORY_AND_DISK_2: Same as above, but replicated 2x for resilience.
- MEMORY_AND_DISK_DESER (default): Same as MEMORY_AND_DISK, but deserialized for fast access.
- MEMORY_ONLY: CPU-efficient, but memory-intensive (fastest retrieval).
- MEMORY_ONLY_2: Same as above, with 2x replication for fault tolerance.

## Serialization vs Deserialization
- Serialized (SER): Saves memory (compact data), but CPU-intensive for access.
- Deserialized (DESER): Faster access (as JVM objects), but memory-intensive.

## When to Use What?

- Fastest access: MEMORY_ONLY
- Memory saving: MEMORY_ONLY_SER
- Balanced (spills to disk): MEMORY_AND_DISK
- Disk-efficient (for large data): DISK_ONLY

In [0]:
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType

In [0]:
customer_schema  = StructType(
    [
        StructField('customer_id',FloatType()),
        StructField('customer_name',StringType()),
        StructField('email',StringType()),
        StructField('city',StringType())
    ]
)
product_schema  = StructType(
    [
        StructField('product_id',FloatType()),
        StructField('product_name',StringType()),
        StructField('category',StringType()),
        StructField('price',FloatType())
     ]
)
sales_schema = StructType(
    [
        StructField('sale_id',IntegerType()),
        StructField('product_id',IntegerType()),
        StructField('customer_id',IntegerType()),
        StructField('store_id',IntegerType()),
        StructField('quantity',IntegerType()),
        StructField('sale_date',StringType())
    ]
)
inventory_schema = StructType(
    [
        StructField('store_id',FloatType()),
        StructField('product_id',FloatType()),
        StructField('stock_quantity',StringType())
    ]
)
stores_schema = StructType(
    [
        StructField('store_id',FloatType()),
        StructField('store_name',StringType()),
        StructField('city',StringType())
    ]
)

In [0]:
input_path = "/Volumes/cgi_dev/naval/dataset/"
input_s3="s3://datamaster/dataset/"
df_customer = spark.read.csv(f"{input_path}Customers.csv",schema = customer_schema,header=True)
df_inventory = spark.read.csv(f"{input_path}Inventory.csv",schema = inventory_schema,header=True)
df_store = spark.read.csv(f"{input_path}Stores.csv",schema = stores_schema,header=True)
df_product = spark.read.csv(f"{input_path}Products.csv",schema = product_schema,header=True)
df_sales = spark.read.csv(f"{input_s3}skewed_sales_data.csv",schema = sales_schema,header=True)

# Use Case: Identify Top-Selling Products by City & Optimize Inventory
Scenario: A retail company wants to analyze top-selling products in each city and compare it with current inventory levels to prevent stockouts. Since sales and inventory data are used multiple times in the workflow, caching will be used to optimize performance.

In [0]:
from pyspark import StorageLevel

In [0]:
df_sales_enriched = df_sales \
    .join(df_product, "product_id", "inner") \
    .join(df_store, "store_id", "inner") \
    .join(df_customer, "customer_id", "inner") \
    .select("sale_id", "product_id", "product_name", "category", 
            "price", "customer_id", "customer_name", df_customer.city, 
            "store_id", "store_name", "quantity", "sale_date")
    
#df_sales_enriched.cache()
df_sales_enriched.persist(StorageLevel.DISK_ONLY)
df_sales_enriched.count()

In [0]:
# 1sec
df_sales_enriched.display()

In [0]:
df_sales_enriched.unpersist()

In [0]:
spark.catalog.clearCache()

In [0]:
df_sales_enriched.display()

In [0]:
# 2 min
df_sales_enriched.display()

In [0]:
df_sales_enriched.explain(True)

In [0]:
df_top_selling = df_sales_enriched \
    .groupBy("city", "product_id", "product_name") \
    .agg(F.sum("quantity").alias("total_quantity_sold")) \
    .orderBy("city", F.desc("total_quantity_sold"))

In [0]:
df_top_selling.explain()

In [0]:
display(df_top_selling)

In [0]:
display(df_top_selling)

In [0]:
df_inventory_status = df_top_selling \
    .join(df_inventory, ["product_id"], "left") \
    .select("city", "product_name", "total_quantity_sold", "stock_quantity")

In [0]:
df_inventory_status.explain()

In [0]:
df_inventory_status.explain()

In [0]:
display(df_inventory_status)

In [0]:
display(df_inventory_status)