### Setup

Make sure you have the files available from previous demos.

In [0]:
# This cell sets all the configuration parameters to connect to Azure Data Lake
spark.conf.set("fs.azure.account.auth.type.<account_name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<account_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net", "****************************")
spark.conf.set("fs.azure.account.oauth2.client.secret.<account_name>.dfs.core.windows.net", "*******************************")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<account_name>.dfs.core.windows.net", "https://login.microsoftonline.com/************************/oauth2/token")

Verify that cloud storage is accessible

In [0]:
dbutils.fs.ls("abfss://pyspark@warnerdatalake.dfs.core.windows.net/")

[FileInfo(path='abfss://pyspark@warnerdatalake.dfs.core.windows.net/exports/', name='exports/', size=0, modificationTime=1740581924000),
 FileInfo(path='abfss://pyspark@warnerdatalake.dfs.core.windows.net/imports/', name='imports/', size=0, modificationTime=1740581918000)]

Let's load the transactions data

In [0]:
from pyspark.sql import functions as F

# Path to transactions data
parquet_path = "abfss://pyspark@warnerdatalake.dfs.core.windows.net//imports//transactions_data.parquet"

# Load transactions data
df_transactions = spark.read.parquet(parquet_path)

df_transactions.limit(5).display()

transaction_id,customer_id,transaction_date,amount,category
1,3065,2025-03-17,76.1,Clothes
2,3274,2025-02-18,91.91,Clothes
3,130,2025-01-10,11.81,Accessories
4,320,2025-03-06,20.37,Furniture
5,6480,2025-03-22,12.31,Beauty


And apply a transformation

In [0]:
# Filtering for high-value transactions)
df_filtered = df_transactions.filter(F.col("amount") > 50)

# Display sample data
df_filtered.limit(5).display()

transaction_id,customer_id,transaction_date,amount,category
1,3065,2025-03-17,76.1,Clothes
2,3274,2025-02-18,91.91,Clothes
7,4569,2025-01-07,56.95,Electronics
8,7229,2025-02-18,94.8,Furniture
10,3791,2025-03-02,99.6,Accessories


We can easily cache this dataframe

In [0]:
# Cache() stores the DataFrame in memory but still allows Spark to evict it when memory is full
df_filtered.cache() # .

# Trigger an action to load data into cache
df_filtered.count()  # Forces computation & caching


500220

In [0]:
# Verify if the DataFrame is cached
print(f"Is DataFrame Cached? {df_filtered.is_cached}")


Is DataFrame Cached? True


Check the plan

In [0]:
df_filtered.explain(True)


== Parsed Logical Plan ==
'Filter '`>`('amount, 50)
+- Relation [transaction_id#929L,customer_id#930,transaction_date#931,amount#932,category#933] parquet

== Analyzed Logical Plan ==
transaction_id: bigint, customer_id: int, transaction_date: date, amount: decimal(10,2), category: string
Filter (amount#932 > cast(cast(50 as decimal(2,0)) as decimal(10,2)))
+- Relation [transaction_id#929L,customer_id#930,transaction_date#931,amount#932,category#933] parquet

== Optimized Logical Plan ==
InMemoryRelation [transaction_id#929L, customer_id#930, transaction_date#931, amount#932, category#933], StorageLevel(disk, memory, deserialized, 1 replicas)
   +- *(1) Filter (isnotnull(amount#932) AND (amount#932 > 50.00))
      +- *(1) ColumnarToRow
         +- FileScan parquet [transaction_id#929L,customer_id#930,transaction_date#931,amount#932,category#933] Batched: true, DataFilters: [isnotnull(amount#932), (amount#932 > 50.00)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[abfss://pyspa

Compare to uncached

In [0]:
df_uncached = df_transactions.filter(F.col("amount") > 60)

df_uncached.explain(True)  # Show execution plan

== Parsed Logical Plan ==
'Filter '`>`('amount, 60)
+- Relation [transaction_id#929L,customer_id#930,transaction_date#931,amount#932,category#933] parquet

== Analyzed Logical Plan ==
transaction_id: bigint, customer_id: int, transaction_date: date, amount: decimal(10,2), category: string
Filter (amount#932 > cast(cast(60 as decimal(2,0)) as decimal(10,2)))
+- Relation [transaction_id#929L,customer_id#930,transaction_date#931,amount#932,category#933] parquet

== Optimized Logical Plan ==
Filter (isnotnull(amount#932) AND (amount#932 > 60.00))
+- Relation [transaction_id#929L,customer_id#930,transaction_date#931,amount#932,category#933] parquet

== Physical Plan ==
*(1) Filter (isnotnull(amount#932) AND (amount#932 > 60.00))
+- *(1) ColumnarToRow
   +- FileScan parquet [transaction_id#929L,customer_id#930,transaction_date#931,amount#932,category#933] Batched: true, DataFilters: [isnotnull(amount#932), (amount#932 > 60.00)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[abfss://p

Persist( ) offers more control

In [0]:
from pyspark import StorageLevel

# Persist DataFrame in MEMORY_AND_DISK
df_persisted = df_transactions.filter(F.col("amount") > 50).persist(StorageLevel.MEMORY_AND_DISK)

#MEMORY_ONLY → Stores data only in RAM (fastest but risky if memory runs out).
#MEMORY_AND_DISK → Uses RAM first, then disk (best for large datasets).
#DISK_ONLY → Stores only on disk (slowest but safest for big data).

# Trigger computation
df_persisted.count()

# Verify persist status
print(f"Is DataFrame Persisted? {df_persisted.is_cached}")


Is DataFrame Persisted? True


Clean up once you are done

In [0]:
df_filtered.unpersist()
df_persisted.unpersist()


DataFrame[transaction_id: bigint, customer_id: int, transaction_date: date, amount: decimal(10,2), category: string]