# PySpark Advanced Applications - Day 4
## Delta Lake, Data Validation, and Structured Streaming

Welcome to Day 4 of the PySpark workshop! Today, we'll explore advanced applications of PySpark, focusing on Delta Lake, data validation, and an introduction to structured streaming.

## Day 4 Agenda

Today we'll cover:
1. **Delta Lake for Reliable Data Lakes**
2. **Data Validation and Quality Framework**

Let's continue our PySpark journey with these advanced topics!

## Setup and Data Loading

First, let's initialize our environment and load data for today's exercises.

In [0]:
# Check our Spark version
print(f"Spark Version: {spark.version}")

# Create paths for our workshop data
workshop_path = "/Volumes/workspace/default/spark_workshop"
raw_data_path = f"{workshop_path}/raw_data"
processed_path = f"{workshop_path}/processed"
delta_path = f"{workshop_path}/delta"

print("Spark environment initialized!")
print(f"Workshop path: {workshop_path}")


Spark Version: 4.0.0
Spark environment initialized!
Workshop path: /Volumes/workspace/default/spark_workshop


In [0]:
# Load data for today's exercises
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Create a schema for the BigMart Sales data
sales_schema = StructType([
    StructField("Item_Identifier", StringType(), False),
    StructField("Item_Weight", DoubleType(), True),
    StructField("Item_Fat_Content", StringType(), True),
    StructField("Item_Visibility", DoubleType(), True),
    StructField("Item_Type", StringType(), True),
    StructField("Item_MRP", DoubleType(), True),
    StructField("Outlet_Identifier", StringType(), False),
    StructField("Outlet_Establishment_Year", IntegerType(), True),
    StructField("Outlet_Size", StringType(), True),
    StructField("Outlet_Location_Type", StringType(), True),
    StructField("Outlet_Type", StringType(), True),
    StructField("Item_Outlet_Sales", DoubleType(), True)
])

# Read the sales data
sales_df = spark.read.format('csv')\
                  .option('header', True)\
                  .schema(sales_schema)\
                  .load(f'{workshop_path}/BigMart Sales.csv')

## Data Transformation


In [0]:
# Prepare a cleaned version similar to what we did in Day 2
from pyspark.sql.functions import col, when, trim, upper, regexp_replace, coalesce, lit, avg

# Create a copy of the DataFrame for cleaning
clean_sales_df = sales_df

# Standardize text fields
clean_sales_df = clean_sales_df.withColumn(
    "Item_Fat_Content", 
    upper(trim(col("Item_Fat_Content")))
)

# Normalize categorical values
clean_sales_df = clean_sales_df.withColumn(
    "Item_Fat_Content",
    when(col("Item_Fat_Content").isin("LOW FAT", "LF"), "LOW_FAT")
    .when(col("Item_Fat_Content").isin("REG", "REGULAR"), "REGULAR")
    .otherwise(col("Item_Fat_Content"))
)

In [0]:
# Calculate average weight by item type
avg_weight_by_type = clean_sales_df.filter(col("Item_Weight").isNotNull()) \
                        .groupBy("Item_Type") \
                        .agg(avg("Item_Weight").alias("Avg_Weight"))

In [0]:
# Join back to original data to fill missing weights
clean_sales_df = clean_sales_df.join(
    avg_weight_by_type,
    "Item_Type",
    "left"
)

In [0]:
# Fill missing Item_Weight with calculated average by type
clean_sales_df = clean_sales_df.withColumn(
    "Item_Weight",
    coalesce(col("Item_Weight"), col("Avg_Weight"))
)

In [0]:
# Fill remaining missing weights with overall average
overall_avg_weight = clean_sales_df.filter(col("Item_Weight").isNotNull()) \
                         .agg(avg(col("Item_Weight")).alias("overall_avg")).collect()[0]["overall_avg"]

In [0]:
clean_sales_df = clean_sales_df.withColumn(
    "Item_Weight",
    coalesce(col("Item_Weight"), lit(overall_avg_weight))
).drop("Avg_Weight")  # Drop the temporary average column

# Fill missing Outlet_Size with 'Medium' (assuming it's the most common)
clean_sales_df = clean_sales_df.withColumn(
    "Outlet_Size",
    coalesce(col("Outlet_Size"), lit("Medium"))
)

# Display the cleaned data
display(clean_sales_df.limit(5))

Item_Type,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Dairy,FDA15,9.3,LOW_FAT,0.016047301,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
Soft Drinks,DRC01,5.92,REGULAR,0.019278216,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
Meat,FDN15,17.5,LOW_FAT,0.016760075,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
Fruits and Vegetables,FDX07,19.2,REGULAR,0.0,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38
Household,NCD19,8.93,LOW_FAT,0.0,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


## Delta Lake for Reliable Data Lakes

[Delta Lake](https://delta.io/) is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to Apache Spark and big data workloads. It's a key component in the Databricks Lakehouse architecture.

### Key Features of Delta Lake:

1. **ACID Transactions**: Ensures data consistency and reliability
2. **Schema Enforcement**: Prevents bad data from corrupting your tables
3. **Schema Evolution**: Allows schema changes without breaking existing queries
4. **Time Travel**: Query historical versions of your data
5. **Audit History**: Track all changes to your data
6. **Upserts and Deletes**: Support for merge, update, and delete operations
7. **Optimization**: File compaction and Z-order indexing

Let's explore these features:

In [0]:
delta_path

'/Volumes/workspace/default/spark_workshop/delta'

In [0]:
# Create a Delta table from our cleaned sales data
delta_table_path = f"{delta_path}/sales"

# Write data to Delta format
clean_sales_df.write.format("delta").mode("overwrite").save(delta_table_path)


In [0]:
delta_table_path

'/Volumes/workspace/default/spark_workshop/delta/sales'

In [0]:
# Create a Delta table from our cleaned sales data
delta_table_path = f"{delta_path}/sales"

# Write data to Delta format
clean_sales_df.write.format("delta").mode("overwrite").save(delta_table_path)

# Read the Delta table
delta_df = spark.read.format("delta").load(delta_table_path)
print(f"Delta table created with {delta_df.count()} rows")
display(delta_df.limit(5))

Delta table created with 8523 rows


Item_Type,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Promotion_Type
Dairy,FDA15,9.3,LOW_FAT,0.016047301,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,
Soft Drinks,DRC01,5.92,REGULAR,0.019278216,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,
Meat,FDN15,17.5,LOW_FAT,0.016760075,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,
Fruits and Vegetables,FDX07,19.2,REGULAR,0.0,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38,
Household,NCD19,8.93,LOW_FAT,0.0,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,


In [0]:
delta_df.write.saveAsTable("main.default.cleane_sales_data")

In [0]:
%sql
SELECT `Outlet_Type`, SUM(`Item_Outlet_Sales`) AS `Total_Sales`
FROM `main`.`default`.`cleane_sales_data`
GROUP BY `Outlet_Type`
ORDER BY `Total_Sales` DESC
LIMIT 5

Outlet_Type,Total_Sales
Supermarket Type1,12917342.262999993
Supermarket Type3,3453926.0514
Supermarket Type2,1851822.8300000008
Grocery Store,368034.266


### ACID Transactions with Delta Lake

Delta Lake ensures ACID properties, which is critical for data reliability:
- **Atomicity**: All operations succeed or fail together
- **Consistency**: Data is valid according to defined rules
- **Isolation**: Concurrent operations don't interfere with each other
- **Durability**: Committed changes remain even after system failures

In [0]:
from delta.tables import *
from pyspark.sql.functions import *

deltaTable = DeltaTable.forName(spark, "main.default.cleane_sales_data")

# Declare the predicate by using a SQL-formatted string.
deltaTable.update(
  condition = "Outlet_Type = 'Grocery Store'",
  set = { "Outlet_Type": "'Supermarket Type4'" }
)

DataFrame[num_affected_rows: bigint]

In [0]:
%sql
SELECT `Outlet_Type`, SUM(`Item_Outlet_Sales`) AS `Total_Sales`
FROM `main`.`default`.`cleane_sales_data`
GROUP BY `Outlet_Type`
ORDER BY `Total_Sales` DESC
LIMIT 5

Outlet_Type,Total_Sales
Supermarket Type1,12917342.262999993
Supermarket Type3,3453926.0514
Supermarket Type2,1851822.8300000008
Supermarket Type4,368034.266


### Schema Enforcement and Evolution

Delta Lake provides schema enforcement to prevent bad data from corrupting your tables, and schema evolution to allow changes to the schema as your data evolves.

### Time Travel with Delta Lake

Delta Lake maintains a transaction log that allows you to access previous versions of your data, enabling point-in-time analysis, rollbacks, and auditing.

In [0]:
display(deltaTable.history())

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
2,2025-09-16T07:11:51.000Z,8048247156126318,aiwithap@gmail.com,OPTIMIZE,"Map(predicate -> [], auto -> true, clusterBy -> [], zOrderBy -> [], batchId -> 0)",,List(2678599246042750),0916-062658-e6wrfwf7-v2n,1.0,SnapshotIsolation,False,"Map(numRemovedFiles -> 2, numRemovedBytes -> 238199, p25FileSize -> 203188, numDeletionVectorsRemoved -> 1, minFileSize -> 203188, numAddedFiles -> 1, maxFileSize -> 203188, p75FileSize -> 203188, p50FileSize -> 203188, numAddedBytes -> 203188)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
1,2025-09-16T07:11:49.000Z,8048247156126318,aiwithap@gmail.com,UPDATE,"Map(predicate -> [""(Outlet_Type#13737 = Grocery Store)""])",,List(2678599246042750),0916-062658-e6wrfwf7-v2n,0.0,WriteSerializable,False,"Map(numRemovedFiles -> 0, numRemovedBytes -> 0, numCopiedRows -> 0, numDeletionVectorsAdded -> 1, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 2201, numDeletionVectorsUpdated -> 0, scanTimeMs -> 898, numAddedFiles -> 1, numUpdatedRows -> 1083, numAddedBytes -> 33829, rewriteTimeMs -> 1303)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
0,2025-09-16T06:48:04.000Z,8048247156126318,aiwithap@gmail.com,CREATE TABLE AS SELECT,"Map(partitionBy -> [], clusterBy -> [], description -> null, isManaged -> true, properties -> {""delta.enableDeletionVectors"":""true""}, statsOnLoad -> true)",,List(2678599246042750),0916-062658-e6wrfwf7-v2n,,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 8523, numOutputBytes -> 204370)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13


In [0]:
# Get the history of the Delta table
display(delta_table.history())


In [0]:
from delta.tables import *

deltaTable = DeltaTable.forName(spark, "main.default.cleane_sales_data")
deltaHistory = deltaTable.history()
display(deltaHistory.where("version == 0"))

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
0,2025-09-16T06:48:04.000Z,8048247156126318,aiwithap@gmail.com,CREATE TABLE AS SELECT,"Map(partitionBy -> [], clusterBy -> [], description -> null, isManaged -> true, properties -> {""delta.enableDeletionVectors"":""true""}, statsOnLoad -> true)",,List(2678599246042750),0916-062658-e6wrfwf7-v2n,,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 8523, numOutputBytes -> 204370)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13


In [0]:
%sql
SELECT * FROM main.default.cleane_sales_data VERSION AS OF 0

Item_Type,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Promotion_Type
Dairy,FDA15,9.3,LOW_FAT,0.016047301,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,
Soft Drinks,DRC01,5.92,REGULAR,0.019278216,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,
Meat,FDN15,17.5,LOW_FAT,0.016760075,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,
Fruits and Vegetables,FDX07,19.2,REGULAR,0.0,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38,
Household,NCD19,8.93,LOW_FAT,0.0,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,
Baking Goods,FDP36,10.395,REGULAR,0.0,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088,
Snack Foods,FDO10,13.65,REGULAR,0.012741089,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528,
Snack Foods,FDP10,12.98787955465592,LOW_FAT,0.127469857,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636,
Frozen Foods,FDH17,16.2,REGULAR,0.016687114,96.9726,OUT045,2002,Medium,Tier 2,Supermarket Type1,1076.5986,
Frozen Foods,FDU28,19.2,REGULAR,0.09444959,187.8214,OUT017,2007,Medium,Tier 2,Supermarket Type1,4710.535,


In [0]:
# Time travel to a specific version
# Let's read the data as it was before we added the new columns (version 0)
original_df = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)
print("Schema at version 0:")
original_df.printSchema()  # Should not have Promotion_Type column

Schema at version 0:
root
 |-- Item_Type: string (nullable = true)
 |-- Item_Identifier: string (nullable = true)
 |-- Item_Weight: double (nullable = true)
 |-- Item_Fat_Content: string (nullable = true)
 |-- Item_Visibility: double (nullable = true)
 |-- Item_MRP: double (nullable = true)
 |-- Outlet_Identifier: string (nullable = true)
 |-- Outlet_Establishment_Year: integer (nullable = true)
 |-- Outlet_Size: string (nullable = true)
 |-- Outlet_Location_Type: string (nullable = true)
 |-- Outlet_Type: string (nullable = true)
 |-- Item_Outlet_Sales: double (nullable = true)



In [0]:
# Time travel using a valid timestamp from Delta table history
# Get the earliest commit timestamp
history_df = delta_table.history()
earliest_commit = history_df.orderBy("timestamp").first()["timestamp"]

# Format timestamp for SQL (ISO8601)
earliest_commit_str = earliest_commit.strftime('%Y-%m-%dT%H:%M:%S.%f')[:-3] + '+00:00'

# Use the earliest available timestamp in the time travel query
query = f"""
SELECT * FROM delta.`{delta_table_path}` TIMESTAMP AS OF '{earliest_commit_str}'
LIMIT 5
"""
display(spark.sql(query))

# Check how many versions we have
num_versions = len(history_df.collect())
print(f"The Delta table has {num_versions} versions")

Item_Type,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Dairy,FDA15,9.3,LOW_FAT,0.016047301,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
Soft Drinks,DRC01,5.92,REGULAR,0.019278216,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
Meat,FDN15,17.5,LOW_FAT,0.016760075,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
Fruits and Vegetables,FDX07,19.2,REGULAR,0.0,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38
Household,NCD19,8.93,LOW_FAT,0.0,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


The Delta table has 29 versions


### Delta Lake Optimizations

Delta Lake provides several optimizations to improve query performance:

1. **OPTIMIZE**: Compacts small files into larger ones
2. **ZORDER**: Colocates related data for faster filtering
3. **VACUUM**: Removes old versions to reclaim storage

In [0]:
# Optimize the Delta table to improve performance
delta_table.optimize().executeCompaction()
print("Compaction completed")

# Z-Order by columns that are frequently used for filtering
delta_table.optimize().executeZOrderBy("Item_Type", "Outlet_Type")
print("Z-Order optimization completed")


Compaction completed
Z-Order optimization completed


In [0]:
# Vacuum old files (using a short retention period for demo)
delta_table.vacuum(170)  # 170 hours retention
print("Vacuum completed")

Vacuum completed
