# PySpark Advanced Applications - Day 3
## Delta Lake, Data Validation, and Structured Streaming

Welcome to Day 3 of the PySpark workshop! Today, we'll explore advanced applications of PySpark, focusing on Delta Lake, data validation, and an introduction to structured streaming.

## Day 3 Agenda

Today we'll cover:
1. **Delta Lake for Reliable Data Lakes**
2. **Data Validation and Quality Framework**
3. **Introduction to Structured Streaming**
5. **Putting It All Together: End-to-End Project**

Let's continue our PySpark journey with these advanced topics!

## Setup and Data Loading

First, let's initialize our environment and load data for today's exercises.

In [0]:
# Check our Spark version
print(f"Spark Version: {spark.version}")

# Create paths for our workshop data
workshop_path = "/Volumes/workspace/default/spark_workshop"
raw_data_path = f"{workshop_path}/raw_data"
processed_path = f"{workshop_path}/processed"
delta_path = f"{workshop_path}/delta"

print("Spark environment initialized!")
print(f"Workshop path: {workshop_path}")


Spark Version: 4.0.0
Spark environment initialized!
Workshop path: /Volumes/workspace/default/spark_workshop


In [0]:
# Load data for today's exercises
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Create a schema for the BigMart Sales data
sales_schema = StructType([
    StructField("Item_Identifier", StringType(), False),
    StructField("Item_Weight", DoubleType(), True),
    StructField("Item_Fat_Content", StringType(), True),
    StructField("Item_Visibility", DoubleType(), True),
    StructField("Item_Type", StringType(), True),
    StructField("Item_MRP", DoubleType(), True),
    StructField("Outlet_Identifier", StringType(), False),
    StructField("Outlet_Establishment_Year", IntegerType(), True),
    StructField("Outlet_Size", StringType(), True),
    StructField("Outlet_Location_Type", StringType(), True),
    StructField("Outlet_Type", StringType(), True),
    StructField("Item_Outlet_Sales", DoubleType(), True)
])

# Read the sales data
sales_df = spark.read.format('csv')\
                  .option('header', True)\
                  .schema(sales_schema)\
                  .load(f'{workshop_path}/BigMart Sales.csv')

## Data Transformation


In [0]:
# Prepare a cleaned version similar to what we did in Day 2
from pyspark.sql.functions import col, when, trim, upper, regexp_replace, coalesce, lit, avg

# Create a copy of the DataFrame for cleaning
clean_sales_df = sales_df

# Standardize text fields
clean_sales_df = clean_sales_df.withColumn(
    "Item_Fat_Content", 
    upper(trim(col("Item_Fat_Content")))
)

# Normalize categorical values
clean_sales_df = clean_sales_df.withColumn(
    "Item_Fat_Content",
    when(col("Item_Fat_Content").isin("LOW FAT", "LF"), "LOW_FAT")
    .when(col("Item_Fat_Content").isin("REG", "REGULAR"), "REGULAR")
    .otherwise(col("Item_Fat_Content"))
)

In [0]:
# Calculate average weight by item type
avg_weight_by_type = clean_sales_df.filter(col("Item_Weight").isNotNull()) \
                        .groupBy("Item_Type") \
                        .agg(avg("Item_Weight").alias("Avg_Weight"))

In [0]:
# Join back to original data to fill missing weights
clean_sales_df = clean_sales_df.join(
    avg_weight_by_type,
    "Item_Type",
    "left"
)

In [0]:
# Fill missing Item_Weight with calculated average by type
clean_sales_df = clean_sales_df.withColumn(
    "Item_Weight",
    coalesce(col("Item_Weight"), col("Avg_Weight"))
)

In [0]:
# Fill remaining missing weights with overall average
overall_avg_weight = clean_sales_df.filter(col("Item_Weight").isNotNull()) \
                         .agg(avg(col("Item_Weight")).alias("overall_avg")).collect()[0]["overall_avg"]

In [0]:
clean_sales_df = clean_sales_df.withColumn(
    "Item_Weight",
    coalesce(col("Item_Weight"), lit(overall_avg_weight))
).drop("Avg_Weight")  # Drop the temporary average column

# Fill missing Outlet_Size with 'Medium' (assuming it's the most common)
clean_sales_df = clean_sales_df.withColumn(
    "Outlet_Size",
    coalesce(col("Outlet_Size"), lit("Medium"))
)

# Display the cleaned data
display(clean_sales_df.limit(5))

Item_Type,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Dairy,FDA15,9.3,LOW_FAT,0.016047301,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
Soft Drinks,DRC01,5.92,REGULAR,0.019278216,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
Meat,FDN15,17.5,LOW_FAT,0.016760075,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
Fruits and Vegetables,FDX07,19.2,REGULAR,0.0,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38
Household,NCD19,8.93,LOW_FAT,0.0,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


## Delta Lake for Reliable Data Lakes

[Delta Lake](https://delta.io/) is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to Apache Spark and big data workloads. It's a key component in the Databricks Lakehouse architecture.

### Key Features of Delta Lake:

1. **ACID Transactions**: Ensures data consistency and reliability
2. **Schema Enforcement**: Prevents bad data from corrupting your tables
3. **Schema Evolution**: Allows schema changes without breaking existing queries
4. **Time Travel**: Query historical versions of your data
5. **Audit History**: Track all changes to your data
6. **Upserts and Deletes**: Support for merge, update, and delete operations
7. **Optimization**: File compaction and Z-order indexing

Let's explore these features:

In [0]:
# Create a Delta table from our cleaned sales data
delta_table_path = f"{delta_path}/sales"

# Write data to Delta format
clean_sales_df.write.format("delta").mode("overwrite").save(delta_table_path)

# Read the Delta table
delta_df = spark.read.format("delta").load(delta_table_path)
print(f"Delta table created with {delta_df.count()} rows")
display(delta_df.limit(5))

Delta table created with 8523 rows


Item_Type,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Promotion_Type
Dairy,FDA15,9.3,LOW_FAT,0.016047301,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,
Soft Drinks,DRC01,5.92,REGULAR,0.019278216,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,
Meat,FDN15,17.5,LOW_FAT,0.016760075,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,
Fruits and Vegetables,FDX07,19.2,REGULAR,0.0,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38,
Household,NCD19,8.93,LOW_FAT,0.0,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,


### ACID Transactions with Delta Lake

Delta Lake ensures ACID properties, which is critical for data reliability:
- **Atomicity**: All operations succeed or fail together
- **Consistency**: Data is valid according to defined rules
- **Isolation**: Concurrent operations don't interfere with each other
- **Durability**: Committed changes remain even after system failures

In [0]:
# Import Delta Lake specific functions
from delta.tables import DeltaTable

# Get the Delta table instance
delta_table = DeltaTable.forPath(spark, delta_table_path)

# 1. Perform an update operation (atomic)
# Update Item_Fat_Content to standardized values
delta_table.update(
    condition=col("Item_Fat_Content") == "LOW_FAT",
    set={"Item_Fat_Content": lit("Low Fat")}
)

delta_table.update(
    condition=col("Item_Fat_Content") == "REGULAR",
    set={"Item_Fat_Content": lit("Regular")}
)

# Read the updated data
updated_df = spark.read.format("delta").load(delta_table_path)
print("Unique Item_Fat_Content values after update:")
display(updated_df.select("Item_Fat_Content").distinct())

Unique Item_Fat_Content values after update:


Item_Fat_Content
Low Fat
Regular


### Schema Enforcement and Evolution

Delta Lake provides schema enforcement to prevent bad data from corrupting your tables, and schema evolution to allow changes to the schema as your data evolves.

In [0]:
# Create a new DataFrame with a different schema
# This one has a new column and is missing one from the original
from pyspark.sql import Row

new_data = [
    Row(
        Item_Identifier="NEW001", 
        Item_Fat_Content="Low Fat", 
        Item_Type="Snacks", 
        Item_MRP=150.0, 
        Outlet_Identifier="OUT010", 
        Outlet_Type="Supermarket Type1", 
        Promotion_Type="BOGO"  # New column
    ),
    Row(
        Item_Identifier="NEW002", 
        Item_Fat_Content="Regular", 
        Item_Type="Dairy", 
        Item_MRP=85.0, 
        Outlet_Identifier="OUT017", 
        Outlet_Type="Supermarket Type2", 
        Promotion_Type="None"  # New column
    )
]

# Create DataFrame from the new data
new_df = spark.createDataFrame(new_data)
display(new_df)

Item_Identifier,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Type,Promotion_Type
NEW001,Low Fat,Snacks,150.0,OUT010,Supermarket Type1,BOGO
NEW002,Regular,Dairy,85.0,OUT017,Supermarket Type2,


In [0]:
# Try to write with default schema enforcement (will fail)
try:
    new_df.write.format("delta").mode("append").save(delta_table_path)
except Exception as e:
    print("Error with default schema enforcement:")
    print(str(e))


In [0]:
# Try to write with default schema enforcement (will fail)
try:
    new_df.write.format("delta").mode("append").save(delta_table_path)
except Exception as e:
    print("Error with default schema enforcement:")
    print(str(e))

# Write with schema evolution enabled
new_df.write.format("delta").mode("append").option("mergeSchema", "true").save(delta_table_path)
print("Write succeeded with schema evolution enabled")

# Read the updated table to see the new schema
evolved_df = spark.read.format("delta").load(delta_table_path)
evolved_df.printSchema()

# Check for our new data with the Promotion_Type column
display(evolved_df.filter(col("Item_Identifier").startswith("NEW")).select(
    "Item_Identifier", "Item_Type", "Promotion_Type"
))

Write succeeded with schema evolution enabled
root
 |-- Item_Type: string (nullable = true)
 |-- Item_Identifier: string (nullable = true)
 |-- Item_Weight: double (nullable = true)
 |-- Item_Fat_Content: string (nullable = true)
 |-- Item_Visibility: double (nullable = true)
 |-- Item_MRP: double (nullable = true)
 |-- Outlet_Identifier: string (nullable = true)
 |-- Outlet_Establishment_Year: integer (nullable = true)
 |-- Outlet_Size: string (nullable = true)
 |-- Outlet_Location_Type: string (nullable = true)
 |-- Outlet_Type: string (nullable = true)
 |-- Item_Outlet_Sales: double (nullable = true)
 |-- Promotion_Type: string (nullable = true)



Item_Identifier,Item_Type,Promotion_Type
NEW001,Snacks,BOGO
NEW002,Dairy,
NEW001,Snacks,BOGO
NEW002,Dairy,


### Time Travel with Delta Lake

Delta Lake maintains a transaction log that allows you to access previous versions of your data, enabling point-in-time analysis, rollbacks, and auditing.

In [0]:
# Get the history of the Delta table
display(delta_table.history())


version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
28,2025-09-15T18:01:29.000Z,8048247156126318,aiwithap@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(2678599246042750),0915-175131-8xcaaaqy-v2n,27.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 2, numOutputBytes -> 2090)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
27,2025-09-15T18:01:28.000Z,8048247156126318,aiwithap@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(2678599246042750),0915-175131-8xcaaaqy-v2n,26.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 2, numOutputBytes -> 2090)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
26,2025-09-15T18:01:08.000Z,8048247156126318,aiwithap@gmail.com,OPTIMIZE,"Map(predicate -> [], auto -> true, clusterBy -> [], zOrderBy -> [], batchId -> 0)",,List(2678599246042750),0915-175131-8xcaaaqy-v2n,25.0,SnapshotIsolation,False,"Map(numRemovedFiles -> 2, numRemovedBytes -> 282515, p25FileSize -> 202980, numDeletionVectorsRemoved -> 1, minFileSize -> 202980, numAddedFiles -> 1, maxFileSize -> 202980, p75FileSize -> 202980, p50FileSize -> 202980, numAddedBytes -> 202980)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
25,2025-09-15T18:01:05.000Z,8048247156126318,aiwithap@gmail.com,UPDATE,"Map(predicate -> [""(Item_Fat_Content#11988 = REGULAR)""])",,List(2678599246042750),0915-175131-8xcaaaqy-v2n,23.0,WriteSerializable,False,"Map(numRemovedFiles -> 1, numRemovedBytes -> 204162, numCopiedRows -> 0, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 1, numAddedChangeFiles -> 0, executionTimeMs -> 2258, conflictDetectionTimeMs -> 461, numDeletionVectorsUpdated -> 0, scanTimeMs -> 1120, numAddedFiles -> 1, numUpdatedRows -> 3006, numAddedBytes -> 79033, rewriteTimeMs -> 1137)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
24,2025-09-15T18:01:03.000Z,8048247156126318,aiwithap@gmail.com,OPTIMIZE,"Map(predicate -> [], auto -> true, clusterBy -> [], zOrderBy -> [], batchId -> 0)",,List(2678599246042750),0915-175131-8xcaaaqy-v2n,23.0,SnapshotIsolation,False,"Map(numRemovedFiles -> 2, numRemovedBytes -> 344095, p25FileSize -> 203482, numDeletionVectorsRemoved -> 1, minFileSize -> 203482, numAddedFiles -> 1, maxFileSize -> 203482, p75FileSize -> 203482, p50FileSize -> 203482, numAddedBytes -> 203482)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
23,2025-09-15T18:01:00.000Z,8048247156126318,aiwithap@gmail.com,UPDATE,"Map(predicate -> [""(Item_Fat_Content#11466 = LOW_FAT)""])",,List(2678599246042750),0915-175131-8xcaaaqy-v2n,22.0,WriteSerializable,False,"Map(numRemovedFiles -> 0, numRemovedBytes -> 0, numCopiedRows -> 0, numDeletionVectorsAdded -> 1, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 3360, numDeletionVectorsUpdated -> 0, scanTimeMs -> 1780, numAddedFiles -> 1, numUpdatedRows -> 5517, numAddedBytes -> 139933, rewriteTimeMs -> 1543)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
22,2025-09-15T18:00:38.000Z,8048247156126318,aiwithap@gmail.com,WRITE,"Map(mode -> Overwrite, statsOnLoad -> false, partitionBy -> [])",,List(2678599246042750),0915-175131-8xcaaaqy-v2n,21.0,WriteSerializable,False,"Map(numFiles -> 1, numRemovedFiles -> 1, numRemovedBytes -> 204229, numOutputRows -> 8523, numOutputBytes -> 204162)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
21,2025-09-13T15:34:49.000Z,8048247156126318,aiwithap@gmail.com,OPTIMIZE,"Map(predicate -> [], auto -> true, clusterBy -> [], zOrderBy -> [], batchId -> 0)",,List(2678599246042750),0913-153231-ebnfpbhg-v2n,20.0,SnapshotIsolation,False,"Map(numRemovedFiles -> 6, numRemovedBytes -> 216823, p25FileSize -> 204229, numDeletionVectorsRemoved -> 2, minFileSize -> 204229, numAddedFiles -> 1, maxFileSize -> 204229, p75FileSize -> 204229, p50FileSize -> 204229, numAddedBytes -> 204229)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
20,2025-09-13T15:34:47.000Z,8048247156126318,aiwithap@gmail.com,MERGE,"Map(predicate -> [""(Item_Identifier#13779 = Item_Identifier#13753)""], clusterBy -> [], matchedPredicates -> [{""actionType"":""update""}], statsOnLoad -> false, notMatchedBySourcePredicates -> [], notMatchedPredicates -> [{""actionType"":""insert""}])",,List(2678599246042750),0913-153231-ebnfpbhg-v2n,19.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 3, numTargetBytesAdded -> 9663, numTargetBytesRemoved -> 0, numTargetDeletionVectorsAdded -> 2, numTargetRowsMatchedUpdated -> 2, executionTimeMs -> 3448, materializeSourceTimeMs -> 290, numTargetRowsInserted -> 2, numTargetRowsMatchedDeleted -> 0, numTargetDeletionVectorsUpdated -> 0, scanTimeMs -> 1097, numTargetRowsUpdated -> 2, numOutputRows -> 4, numTargetDeletionVectorsRemoved -> 0, numTargetRowsNotMatchedBySourceUpdated -> 0, numTargetChangeFilesAdded -> 0, numSourceRows -> 3, numTargetFilesRemoved -> 0, numTargetRowsNotMatchedBySourceDeleted -> 0, rewriteTimeMs -> 1938)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
19,2025-09-13T15:34:05.000Z,8048247156126318,aiwithap@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(2678599246042750),0913-153231-ebnfpbhg-v2n,18.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 2, numOutputBytes -> 2090)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13


In [0]:
# Time travel to a specific version
# Let's read the data as it was before we added the new columns (version 0)
original_df = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)
print("Schema at version 0:")
original_df.printSchema()  # Should not have Promotion_Type column

Schema at version 0:
root
 |-- Item_Type: string (nullable = true)
 |-- Item_Identifier: string (nullable = true)
 |-- Item_Weight: double (nullable = true)
 |-- Item_Fat_Content: string (nullable = true)
 |-- Item_Visibility: double (nullable = true)
 |-- Item_MRP: double (nullable = true)
 |-- Outlet_Identifier: string (nullable = true)
 |-- Outlet_Establishment_Year: integer (nullable = true)
 |-- Outlet_Size: string (nullable = true)
 |-- Outlet_Location_Type: string (nullable = true)
 |-- Outlet_Type: string (nullable = true)
 |-- Item_Outlet_Sales: double (nullable = true)



In [0]:
# Time travel using a valid timestamp from Delta table history
# Get the earliest commit timestamp
history_df = delta_table.history()
earliest_commit = history_df.orderBy("timestamp").first()["timestamp"]

# Format timestamp for SQL (ISO8601)
earliest_commit_str = earliest_commit.strftime('%Y-%m-%dT%H:%M:%S.%f')[:-3] + '+00:00'

# Use the earliest available timestamp in the time travel query
query = f"""
SELECT * FROM delta.`{delta_table_path}` TIMESTAMP AS OF '{earliest_commit_str}'
LIMIT 5
"""
display(spark.sql(query))

# Check how many versions we have
num_versions = len(history_df.collect())
print(f"The Delta table has {num_versions} versions")

Item_Type,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Dairy,FDA15,9.3,LOW_FAT,0.016047301,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
Soft Drinks,DRC01,5.92,REGULAR,0.019278216,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
Meat,FDN15,17.5,LOW_FAT,0.016760075,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
Fruits and Vegetables,FDX07,19.2,REGULAR,0.0,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38
Household,NCD19,8.93,LOW_FAT,0.0,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


The Delta table has 29 versions


### Merge Operations (Upserts)

Delta Lake supports merge operations (upserts), which let you atomically insert, update, and delete data based on conditions.

In [0]:
from pyspark.sql import Row
# Create source data for a merge operation
# This will have both updates to existing rows and new rows
source_data = [
    # Updated rows (same Item_Identifier but different values)
    Row(
        Item_Identifier="NEW001", 
        Item_Fat_Content="Low Fat", 
        Item_Type="Snacks", 
        Item_MRP=155.0,  # Updated price
        Outlet_Identifier="OUT010", 
        Outlet_Type="Supermarket Type1", 
        Promotion_Type="Discount"  # Updated promotion
    ),
    # New rows
    Row(
        Item_Identifier="NEW003", 
        Item_Fat_Content="Low Fat", 
        Item_Type="Baking Goods", 
        Item_MRP=125.0, 
        Outlet_Identifier="OUT027", 
        Outlet_Type="Supermarket Type3", 
        Promotion_Type="None"
    ),
    Row(
        Item_Identifier="NEW004", 
        Item_Fat_Content="Regular", 
        Item_Type="Frozen Foods", 
        Item_MRP=175.0, 
        Outlet_Identifier="OUT045", 
        Outlet_Type="Supermarket Type1", 
        Promotion_Type="BOGO"
    )
]

source_df = spark.createDataFrame(source_data)
display(source_df)

Item_Identifier,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Type,Promotion_Type
NEW001,Low Fat,Snacks,155.0,OUT010,Supermarket Type1,Discount
NEW003,Low Fat,Baking Goods,125.0,OUT027,Supermarket Type3,
NEW004,Regular,Frozen Foods,175.0,OUT045,Supermarket Type1,BOGO


In [0]:
# Perform a merge operation
delta_table.alias("target").merge(
    source_df.alias("source"),
    "target.Item_Identifier = source.Item_Identifier"
).whenMatchedUpdate(set={
    "Item_MRP": "source.Item_MRP",
    "Promotion_Type": "source.Promotion_Type"
}).whenNotMatchedInsert(values={
    "Item_Identifier": "source.Item_Identifier",
    "Item_Fat_Content": "source.Item_Fat_Content",
    "Item_Type": "source.Item_Type",
    "Item_MRP": "source.Item_MRP",
    "Outlet_Identifier": "source.Outlet_Identifier",
    "Outlet_Type": "source.Outlet_Type",
    "Promotion_Type": "source.Promotion_Type"
}).execute()

DataFrame[num_affected_rows: bigint, num_updated_rows: bigint, num_deleted_rows: bigint, num_inserted_rows: bigint]

In [0]:
# Check the results of the merge
merged_df = spark.read.format("delta").load(delta_table_path)
display(merged_df.filter(col("Item_Identifier").startswith("NEW")).select(
    "Item_Identifier", "Item_Type", "Item_MRP", "Promotion_Type"
).orderBy("Item_Identifier"))

# Check the history again to see the merge operation
display(delta_table.history().limit(3))

Item_Identifier,Item_Type,Item_MRP,Promotion_Type
NEW001,Snacks,155.0,Discount
NEW001,Snacks,155.0,Discount
NEW002,Dairy,85.0,
NEW002,Dairy,85.0,
NEW003,Baking Goods,125.0,
NEW004,Frozen Foods,175.0,BOGO


version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
30,2025-09-15T18:03:18.000Z,8048247156126318,aiwithap@gmail.com,OPTIMIZE,"Map(predicate -> [], auto -> true, clusterBy -> [], zOrderBy -> [], batchId -> 0)",,List(2678599246042750),0915-175131-8xcaaaqy-v2n,29,SnapshotIsolation,False,"Map(numRemovedFiles -> 6, numRemovedBytes -> 216823, p25FileSize -> 204229, numDeletionVectorsRemoved -> 2, minFileSize -> 204229, numAddedFiles -> 1, maxFileSize -> 204229, p75FileSize -> 204229, p50FileSize -> 204229, numAddedBytes -> 204229)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
29,2025-09-15T18:03:16.000Z,8048247156126318,aiwithap@gmail.com,MERGE,"Map(predicate -> [""(Item_Identifier#13792 = Item_Identifier#13766)""], clusterBy -> [], matchedPredicates -> [{""actionType"":""update""}], statsOnLoad -> false, notMatchedBySourcePredicates -> [], notMatchedPredicates -> [{""actionType"":""insert""}])",,List(2678599246042750),0915-175131-8xcaaaqy-v2n,28,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 3, numTargetBytesAdded -> 9663, numTargetBytesRemoved -> 0, numTargetDeletionVectorsAdded -> 2, numTargetRowsMatchedUpdated -> 2, executionTimeMs -> 3726, materializeSourceTimeMs -> 335, numTargetRowsInserted -> 2, numTargetRowsMatchedDeleted -> 0, numTargetDeletionVectorsUpdated -> 0, scanTimeMs -> 1316, numTargetRowsUpdated -> 2, numOutputRows -> 4, numTargetDeletionVectorsRemoved -> 0, numTargetRowsNotMatchedBySourceUpdated -> 0, numTargetChangeFilesAdded -> 0, numSourceRows -> 3, numTargetFilesRemoved -> 0, numTargetRowsNotMatchedBySourceDeleted -> 0, rewriteTimeMs -> 1966)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13
28,2025-09-15T18:01:29.000Z,8048247156126318,aiwithap@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(2678599246042750),0915-175131-8xcaaaqy-v2n,27,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 2, numOutputBytes -> 2090)",,Databricks-Runtime/17.1.x-aarch64-photon-scala2.13


### Delta Lake Optimizations

Delta Lake provides several optimizations to improve query performance:

1. **OPTIMIZE**: Compacts small files into larger ones
2. **ZORDER**: Colocates related data for faster filtering
3. **VACUUM**: Removes old versions to reclaim storage

In [0]:
# Optimize the Delta table to improve performance
delta_table.optimize().executeCompaction()
print("Compaction completed")

# Z-Order by columns that are frequently used for filtering
delta_table.optimize().executeZOrderBy("Item_Type", "Outlet_Type")
print("Z-Order optimization completed")


Compaction completed
Z-Order optimization completed


In [0]:
# Vacuum old files (using a short retention period for demo)
delta_table.vacuum(170)  # 170 hours retention
print("Vacuum completed")

Vacuum completed


In [0]:
# Disable retention duration safety check for demo purposes (not recommended for production)
#spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", False)

# Vacuum old files (using a short retention period for demo)
#delta_table.vacuum(10)  # 10 hours retention
print("Vacuum completed")

# Check file size distribution (this is approximate in Databricks)
display(dbutils.fs.ls(delta_table_path))

Vacuum completed


path,name,size,modificationTime
dbfs:/Volumes/workspace/default/spark_workshop/delta/sales/_delta_log/,_delta_log/,0,1757959644882
dbfs:/Volumes/workspace/default/spark_workshop/delta/sales/deletion_vector_0701663f-2c0d-46d3-8eea-957c7845717d.bin,deletion_vector_0701663f-2c0d-46d3-8eea-957c7845717d.bin,6092,1757441274000
dbfs:/Volumes/workspace/default/spark_workshop/delta/sales/deletion_vector_0a5032e0-97ee-4bae-9b94-53042f4000b9.bin,deletion_vector_0a5032e0-97ee-4bae-9b94-53042f4000b9.bin,40,1757441275000
dbfs:/Volumes/workspace/default/spark_workshop/delta/sales/deletion_vector_30a44695-93c6-447d-bdae-0abf64625937.bin,deletion_vector_30a44695-93c6-447d-bdae-0abf64625937.bin,7692,1757441268000
dbfs:/Volumes/workspace/default/spark_workshop/delta/sales/deletion_vector_35f32b8b-61b1-47be-85ab-dbd1613dc4eb.bin,deletion_vector_35f32b8b-61b1-47be-85ab-dbd1613dc4eb.bin,6092,1757777623000
dbfs:/Volumes/workspace/default/spark_workshop/delta/sales/deletion_vector_46dfb757-bb19-40ac-a883-69d523e9546f.bin,deletion_vector_46dfb757-bb19-40ac-a883-69d523e9546f.bin,7692,1757775551000
dbfs:/Volumes/workspace/default/spark_workshop/delta/sales/deletion_vector_4828c9e2-3572-4079-9aa5-cd8a2cc643cb.bin,deletion_vector_4828c9e2-3572-4079-9aa5-cd8a2cc643cb.bin,6092,1757775555000
dbfs:/Volumes/workspace/default/spark_workshop/delta/sales/deletion_vector_6ddfd443-e78c-411c-9121-0e0712949324.bin,deletion_vector_6ddfd443-e78c-411c-9121-0e0712949324.bin,7692,1757959258000
dbfs:/Volumes/workspace/default/spark_workshop/delta/sales/deletion_vector_70f48d90-2f57-460a-947e-8bc4f699d986.bin,deletion_vector_70f48d90-2f57-460a-947e-8bc4f699d986.bin,40,1757777625000
dbfs:/Volumes/workspace/default/spark_workshop/delta/sales/deletion_vector_7e21e7f6-f337-4c22-8710-4ee21e24ce6d.bin,deletion_vector_7e21e7f6-f337-4c22-8710-4ee21e24ce6d.bin,7692,1757777618000


## Data Validation and Quality Framework

Data quality is critical for any data processing application. Let's build a simple data validation framework to check data quality using PySpark.

Our framework will include:
1. **Range validation**: Check if values are within expected ranges
2. **Uniqueness checks**: Verify primary key constraints
3. **Pattern validation**: Ensure text fields match expected formats
4. **Referential integrity**: Check foreign key relationships
5. **Null/missing value detection**: Identify incomplete data
6. **Data freshness**: Verify data is up-to-date

In [0]:
# Let's create a simple DataValidator class to handle data quality checks
from pyspark.sql.functions import col, count, when, isnan, regexp_extract, to_date, datediff, current_date
from typing import List, Dict, Any, Optional
import re
from pyspark.sql import Row, types as T

class DataValidator:
    def __init__(self, df):
        """Initialize with a DataFrame to validate."""
        self.df = df
        self.validation_results = []
    
    def validate_range(self, column_name: str, min_value: float, max_value: float, 
                      filter_condition: Optional[str] = None) -> 'DataValidator':
        """Check if values in a column are within the specified range."""
        if filter_condition:
            filtered_df = self.df.filter(filter_condition)
        else:
            filtered_df = self.df
        
        out_of_range = filtered_df.filter(
                              (col(column_name) < min_value) | (col(column_name) > max_value)
                               ).count()
        
        total = filtered_df.count()
        
        if total > 0:
            percentage = float(out_of_range) / float(total) * 100.0
        else:
            percentage = 0.0
        
        result = {
            "validation_type": "range_check",
            "column_name": column_name,
            "min_value": float(min_value),
            "max_value": float(max_value),
            "filter_condition": filter_condition,
            "out_of_range_count": float(out_of_range),
            "total_count": float(total),
            "out_of_range_percentage": float(percentage),
            "passed": out_of_range == 0
        }
        
        self.validation_results.append(result)
        return self
    
    def validate_unique(self, column_names: List[str]) -> 'DataValidator':
        """Check if the specified columns form a unique key."""
        total = self.df.count()
        distinct = self.df.select(column_names).distinct().count()
        
        result = {
            "validation_type": "uniqueness_check",
            "column_names": column_names,
            "total_count": float(total),
            "distinct_count": float(distinct),
            "duplicate_count": float(total - distinct),
            "passed": total == distinct
        }
        
        self.validation_results.append(result)
        return self
    
    def validate_pattern(self, column_name: str, pattern: str) -> 'DataValidator':
        """Check if values in a column match the specified regex pattern."""
        # Count non-null values
        total_non_null = self.df.filter(col(column_name).isNotNull()).count()
        
        # Count values matching the pattern
        matching = self.df.filter(
            col(column_name).isNotNull() & 
            (regexp_extract(col(column_name), pattern, 0) == col(column_name))
        ).count()
        
        non_matching = total_non_null - matching
        
        result = {
            "validation_type": "pattern_check",
            "column_name": column_name,
            "pattern": pattern,
            "total_non_null": float(total_non_null),
            "matching_count": float(matching),
            "non_matching_count": float(non_matching),
            "passed": non_matching == 0
        }
        
        self.validation_results.append(result)
        return self
    
    def validate_referential_integrity(self, source_column: str, 
                                      target_df, target_column: str) -> 'DataValidator':
        """Check if values in source_column exist in target_column of target_df."""
        # Get distinct values from source column
        source_values = self.df.select(source_column).distinct()
        
        # Get distinct values from target column
        target_values = target_df.select(target_column).distinct()
        
        # Find values in source that don't exist in target
        missing_values = source_values.join(
            target_values,
            source_values[source_column] == target_values[target_column],
            "left_anti"
        )
        
        missing_count = missing_values.count()
        total_count = source_values.count()
        
        result = {
            "validation_type": "referential_integrity",
            "source_column": source_column,
            "target_column": target_column,
            "total_distinct_source_values": float(total_count),
            "missing_values_count": float(missing_count),
            "passed": missing_count == 0
        }
        
        self.validation_results.append(result)
        return self
    
    def validate_freshness(self, date_column: str, max_days_old: int) -> 'DataValidator':
        """Check if data is not older than specified number of days."""
        if date_column not in self.df.columns:
            result = {
                "validation_type": "freshness_check",
                "column_name": date_column,
                "error": f"Column {date_column} not found in DataFrame",
                "passed": False
            }
        else:
            # Calculate the age of each record in days
            with_age = self.df.withColumn(
                "age_in_days", 
                datediff(current_date(), col(date_column))
            )
            
            # Count records that are too old
            too_old_count = with_age.filter(col("age_in_days") > max_days_old).count()
            total_count = with_age.count()
            
            result = {
                "validation_type": "freshness_check",
                "column_name": date_column,
                "max_days_old": float(max_days_old),
                "too_old_count": float(too_old_count),
                "total_count": float(total_count),
                "passed": too_old_count == 0
            }
        
        self.validation_results.append(result)
        return self
    
    def get_results(self) -> List[Dict[str, Any]]:
        """Return all validation results."""
        return self.validation_results
    
    def display_results(self):
        """Display validation results in a readable format."""
        # Define a schema for the validation results
        schema = T.StructType([
            T.StructField("validation_type", T.StringType(), True),
            T.StructField("column_name", T.StringType(), True),
            T.StructField("min_value", T.DoubleType(), True),
            T.StructField("max_value", T.DoubleType(), True),
            T.StructField("filter_condition", T.StringType(), True),
            T.StructField("out_of_range_count", T.DoubleType(), True),
            T.StructField("total_count", T.DoubleType(), True),
            T.StructField("out_of_range_percentage", T.DoubleType(), True),
            T.StructField("passed", T.BooleanType(), True),
            T.StructField("column_names", T.StringType(), True),
            T.StructField("distinct_count", T.DoubleType(), True),
            T.StructField("duplicate_count", T.DoubleType(), True),
            T.StructField("pattern", T.StringType(), True),
            T.StructField("total_non_null", T.DoubleType(), True),
            T.StructField("matching_count", T.DoubleType(), True),
            T.StructField("non_matching_count", T.DoubleType(), True),
            T.StructField("source_column", T.StringType(), True),
            T.StructField("target_column", T.StringType(), True),
            T.StructField("total_distinct_source_values", T.DoubleType(), True),
            T.StructField("missing_values_count", T.DoubleType(), True),
            T.StructField("max_days_old", T.DoubleType(), True),
            T.StructField("too_old_count", T.DoubleType(), True),
            T.StructField("error", T.StringType(), True)
        ])
        # Convert dicts to rows matching the schema
        def dict_to_row(d):
            # For list columns, convert to string for display
            d = d.copy()
            if "column_names" in d and isinstance(d["column_names"], list):
                d["column_names"] = ", ".join(d["column_names"])
            # Fill missing keys with None
            for field in schema.fieldNames():
                if field not in d:
                    d[field] = None
            return Row(**d)
        rows = [dict_to_row(d) for d in self.validation_results]
        results_df = spark.createDataFrame(rows, schema)
        display(results_df)
    
    def all_passed(self) -> bool:
        """Check if all validations passed."""
        return all(result["passed"] for result in self.validation_results)

### Using the Data Validation Framework

Let's use our data validation framework to check the quality of our sales data:

In [0]:
# Create a reference DataFrame for referential integrity checks
# This will be a simple list of valid outlet IDs
from pyspark.sql import Row

outlet_data = [
    Row(Outlet_ID="OUT010", Region="North"),
    Row(Outlet_ID="OUT013", Region="South"),
    Row(Outlet_ID="OUT017", Region="East"),
    Row(Outlet_ID="OUT018", Region="West"),
    Row(Outlet_ID="OUT019", Region="North"),
    Row(Outlet_ID="OUT027", Region="South"),
    Row(Outlet_ID="OUT035", Region="West"),
    Row(Outlet_ID="OUT045", Region="East"),
    Row(Outlet_ID="OUT046", Region="North"),
    Row(Outlet_ID="OUT049", Region="South")
]

outlet_df = spark.createDataFrame(outlet_data)
display(outlet_df)

Outlet_ID,Region
OUT010,North
OUT013,South
OUT017,East
OUT018,West
OUT019,North
OUT027,South
OUT035,West
OUT045,East
OUT046,North
OUT049,South


In [0]:
# Now use our DataValidator class to validate the sales data
validator = DataValidator(clean_sales_df)

# 1. Range validations
validator.validate_range("Item_MRP", 0, 500, None)
validator.validate_range("Item_Weight", 0, 50, None)
validator.validate_range("Item_Visibility", 0, 0.5, None)
validator.validate_range("Item_Outlet_Sales", 0, 10000, None)

<__main__.DataValidator at 0xff5d0f388a10>

In [0]:
# 2. Uniqueness checks
validator.validate_unique(["Item_Identifier", "Outlet_Identifier"])

<__main__.DataValidator at 0xff5d0f388a10>

In [0]:
# 3. Pattern validations
# Check if Item_Identifier follows the pattern (2 letters followed by 3 digits)
validator.validate_pattern("Item_Identifier", "^[A-Z]{2}[0-9]{3}$")

<__main__.DataValidator at 0xff5d0f388a10>

In [0]:
# 4. Referential integrity
validator.validate_referential_integrity("Outlet_Identifier", outlet_df, "Outlet_ID")

<__main__.DataValidator at 0xff5d0f388a10>

In [0]:
# 5. Check for establishment years in a valid range
validator.validate_range("Outlet_Establishment_Year", 1900, 2023, None)

# Display validation results
validator.display_results()

# Check if all validations passed
print(f"All validations passed: {validator.all_passed()}")

[0;31m---------------------------------------------------------------------------[0m
[0;31mArrowInvalid[0m                              Traceback (most recent call last)
File [0;32m<command-5352384061025337>, line 5[0m
[1;32m      2[0m validator[38;5;241m.[39mvalidate_range([38;5;124m"[39m[38;5;124mOutlet_Establishment_Year[39m[38;5;124m"[39m, [38;5;241m1900[39m, [38;5;241m2023[39m, [38;5;28;01mNone[39;00m)
[1;32m      4[0m [38;5;66;03m# Display validation results[39;00m
[0;32m----> 5[0m validator[38;5;241m.[39mdisplay_results()
[1;32m      7[0m [38;5;66;03m# Check if all validations passed[39;00m
[1;32m      8[0m [38;5;28mprint[39m([38;5;124mf[39m[38;5;124m"[39m[38;5;124mAll validations passed: [39m[38;5;132;01m{[39;00mvalidator[38;5;241m.[39mall_passed()[38;5;132;01m}[39;00m[38;5;124m"[39m)

File [0;32m<command-5352384061025334>, line 197[0m, in [0;36mDataValidator.display_results[0;34m(self)[0m
[1;32m    195[0m     [38;5;28