# 1.1 Understanding Spark's Lazy Evaluation and Immutability

This notebook demonstrates the fundamental concepts of Apache Spark's lazy evaluation and immutability, which form the foundation for functional programming in PySpark.

## Learning Objectives
- Understand how lazy evaluation works in Spark
- Explore DataFrame immutability and its implications
- Learn how these concepts support functional programming patterns
- See practical examples of lazy evaluation in action

## Environment Setup

**Platform Support**: This notebook runs on both:
- 🌐 **Databricks**: Uses pre-configured cluster with Databricks Runtime 12.2+
- 💻 **Local**: Uncomment `%run 00_Environment_Setup.ipynb` in the first code cell

**Setup Instructions**:
- **For Local Development**: Uncomment the `%run` line in the first code cell below
- **For Databricks**: Keep the `%run` line commented out (spark is pre-configured)

**Table Formats**:
- Primary examples use **Delta Lake** for ACID transactions and time travel
- Alternative examples for **Apache Iceberg** are provided in comments

## Introduction to Lazy Evaluation

Apache Spark's lazy evaluation means that transformations on DataFrames are not executed immediately. Instead, they build up a **Directed Acyclic Graph (DAG)** of operations that gets optimized and executed only when an **action** is called.

In [None]:
# Environment Setup
# FOR LOCAL DEVELOPMENT: Uncomment the line below to run the setup notebook
%run 00_Environment_Setup.ipynb

# FOR DATABRICKS: Keep the above line commented out
# Databricks provides pre-configured spark session and all necessary libraries

# Import common PySpark functions (works on both platforms)
import pyspark.sql.functions as F
from pyspark.sql.types import *

# Create sample data for demonstration
sample_data = [
    ("Alice", 25, "Engineer"),
    ("Bob", 30, "Manager"),
    ("Charlie", 35, "Analyst"),
    ("Diana", 28, "Designer"),
    ("Eve", 32, "Developer")
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("role", StringType(), True)
])

# Create DataFrame
df = spark.createDataFrame(sample_data, schema)
print("Original DataFrame created")
df.show()

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 65369)
Traceback (most recent call last):
  File "/Users/dw31/.pyenv/versions/3.13.4/lib/python3.13/socketserver.py", line 318, in _handle_request_noblock
    self.process_request(request, client_address)
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dw31/.pyenv/versions/3.13.4/lib/python3.13/socketserver.py", line 349, in process_request
    self.finish_request(request, client_address)
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dw31/.pyenv/versions/3.13.4/lib/python3.13/socketserver.py", line 362, in finish_request
    self.RequestHandlerClass(request, client_address, self)
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dw31/.pyenv/versions/3.13.4/lib/python3.13/socketserver.py", line 766, in __init__
    self.handle()
    ~~~~~~~~~~~^^
  File "/Users/dw31/.pyenv/versions/deltalake/lib/python3.13/site-packages/p

## Demonstrating Lazy Evaluation

Let's chain multiple transformations and observe that no actual computation happens until we call an action.

In [None]:
print("=== Applying Transformations (Lazy Evaluation) ===")

# These are all transformations - they build the execution plan but don't execute
print("1. Filtering records where age > 28...")
filtered_df = df.filter(F.col("age") > 28)

print("2. Adding a new column 'senior_level'...")
with_senior_df = filtered_df.withColumn("senior_level", 
                                      F.when(F.col("age") >= 30, "Senior").otherwise("Junior"))

print("3. Selecting specific columns...")
final_df = with_senior_df.select("name", "age", "role", "senior_level")

print("4. Ordering by age...")
ordered_df = final_df.orderBy(F.col("age").desc())

print("\nAll transformations defined! But no computation has happened yet.")
print("The execution plan (DAG) has been built, but not executed.")

## Triggering Execution with Actions

Now let's trigger the actual computation by calling an action. This is when Spark's Catalyst optimizer analyzes the entire DAG and creates an optimized execution plan.

In [None]:
print("=== Triggering Execution with Actions ===")

# This action triggers the execution of all transformations in the chain
print("Calling show() - this is an ACTION that triggers execution:")
ordered_df.show()

print("\nNow all the transformations were executed in an optimized manner!")

# Let's see the execution plan
print("\n=== Execution Plan ===")
ordered_df.explain(True)

## Understanding DataFrame Immutability

DataFrames in Spark are immutable - once created, they cannot be modified. Any operation that appears to "change" a DataFrame actually creates a new DataFrame.

In [None]:
print("=== Demonstrating DataFrame Immutability ===")

# Original DataFrame
print("Original DataFrame:")
df.show()

# "Modify" the DataFrame by adding a column
df_with_category = df.withColumn("category", F.lit("Employee"))

print("\nDataFrame with added column:")
df_with_category.show()

print("\nOriginal DataFrame (unchanged):")
df.show()

print("Notice: The original DataFrame remains unchanged!")
print(f"Original DataFrame ID: {id(df)}")
print(f"Modified DataFrame ID: {id(df_with_category)}")
print("These are different objects - immutability preserved!")

## Practical Example: Building a Functional Pipeline

Let's create a more complex example that demonstrates how lazy evaluation and immutability support functional programming patterns.

In [None]:
# Create a larger dataset for more realistic demonstration
large_sample_data = [
    ("Alice", 25, "Engineer", 75000),
    ("Bob", 30, "Manager", 85000),
    ("Charlie", 35, "Analyst", 65000),
    ("Diana", 28, "Designer", 70000),
    ("Eve", 32, "Developer", 80000),
    ("Frank", 29, "Engineer", 76000),
    ("Grace", 31, "Manager", 90000),
    ("Henry", 26, "Analyst", 62000),
    ("Ivy", 33, "Designer", 72000),
    ("Jack", 27, "Developer", 78000)
]

large_schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("role", StringType(), True),
    StructField("salary", IntegerType(), True)
])

employees_df = spark.createDataFrame(large_sample_data, large_schema)

print("Employee Dataset:")
employees_df.show()

## Functional Transformation Pipeline

Now let's build a functional pipeline that demonstrates how transformations compose naturally due to immutability and lazy evaluation.

In [None]:
def categorize_by_age(df):
    """Pure function: categorizes employees by age group"""
    return df.withColumn("age_group", 
                        F.when(F.col("age") < 28, "Young")
                         .when(F.col("age") < 32, "Mid-Career")
                         .otherwise("Senior"))

def calculate_salary_band(df):
    """Pure function: adds salary band classification"""
    return df.withColumn("salary_band",
                        F.when(F.col("salary") < 70000, "Low")
                         .when(F.col("salary") < 80000, "Medium")
                         .otherwise("High"))

def add_performance_score(df):
    """Pure function: adds a calculated performance score"""
    return df.withColumn("performance_score",
                        (F.col("salary") / 1000 + F.col("age") * 2).cast("integer"))

# Build the functional pipeline using method chaining
print("=== Building Functional Pipeline ===")

# Each transformation returns a new DataFrame (immutability)
# The entire chain is lazy-evaluated until an action is called
result_df = (employees_df
             .transform(categorize_by_age)
             .transform(calculate_salary_band) 
             .transform(add_performance_score)
             .filter(F.col("performance_score") > 100)
             .select("name", "age", "role", "age_group", "salary_band", "performance_score")
             .orderBy(F.col("performance_score").desc()))

print("Pipeline built! No execution yet - still lazy.")

## Executing the Pipeline

Now let's trigger execution and see the results. Notice how Spark optimizes the entire pipeline as one unit.

In [None]:
print("=== Executing the Functional Pipeline ===")

# Trigger execution with an action
result_df.show()

print("\n=== Pipeline Execution Statistics ===")
print(f"Number of high-performing employees: {result_df.count()}")

# Let's also see how the original DataFrame is unchanged
print("\n=== Original DataFrame (Unchanged) ===")
employees_df.show()

## Understanding the Benefits

Let's examine why lazy evaluation and immutability are powerful for functional programming:

In [None]:
print("=== Benefits Demonstration ===")

# 1. Optimization: Spark can optimize the entire chain
print("1. OPTIMIZATION:")
print("Spark's Catalyst optimizer can see the entire transformation chain and optimize it globally.")
result_df.explain()

print("\n" + "="*50)

# 2. Reusability: Pure functions can be reused
print("2. REUSABILITY:")
print("Pure transformation functions can be reused with different DataFrames:")

# Create a different dataset
test_data = [("Test1", 30, "Tester", 72000), ("Test2", 35, "QA", 68000)]
test_df = spark.createDataFrame(test_data, large_schema)

# Reuse the same transformations
reused_result = (test_df
                .transform(categorize_by_age)
                .transform(calculate_salary_band))

reused_result.show()

print("\n" + "="*50)

# 3. Composability: Functions can be easily combined
print("3. COMPOSABILITY:")
print("Transformations compose naturally due to immutability:")

# We can branch from any point in our pipeline
alternate_branch = (employees_df
                   .transform(categorize_by_age)
                   .filter(F.col("age_group") == "Young")
                   .select("name", "age", "role", "age_group"))

print("Young employees branch:")
alternate_branch.show()

## Common Patterns and Anti-Patterns

In [None]:
print("=== GOOD PATTERNS ===")

# ✅ Good: Chain transformations for readability
good_pattern = (employees_df
               .filter(F.col("salary") > 70000)
               .withColumn("bonus", F.col("salary") * 0.1)
               .select("name", "salary", "bonus")
               .orderBy("salary"))

print("Good: Chained transformations")
good_pattern.show(5)

print("\n" + "="*50)

print("=== ANTI-PATTERNS TO AVOID ===")

# ❌ Bad: Overusing actions in the middle of transformations
print("❌ Bad: Don't call actions unnecessarily in the middle of transformations")
print("This breaks lazy evaluation and forces premature computation:")

# This is inefficient - calling show() in the middle forces execution
temp_df = employees_df.filter(F.col("salary") > 70000)
print("Intermediate result (forces execution):")
temp_df.show(3)  # Action that forces execution

# Then continuing with more transformations
final_bad = temp_df.withColumn("bonus", F.col("salary") * 0.1)
print("Final result:")
final_bad.show(3)

print("\nThis pattern prevents Spark from optimizing the entire pipeline!")

25/10/05 03:33:59 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 980138 ms exceeds timeout 120000 ms
25/10/05 03:33:59 WARN SparkContext: Killing executors is not supported by current scheduler.
25/10/05 03:34:07 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:80)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:642)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1223)
	at o

## Summary

**Key Takeaways:**

1. **Lazy Evaluation**: Transformations build an execution plan (DAG) without immediate execution
2. **Immutability**: DataFrames cannot be modified; operations return new DataFrames
3. **Optimization**: Spark's Catalyst optimizer can analyze and optimize the entire transformation chain
4. **Functional Programming**: These concepts naturally support pure functions and composition
5. **Best Practices**: 
   - Chain transformations for readability
   - Use actions wisely (only when results are needed)
   - Extract transformation logic into pure functions
   - Leverage immutability for predictable, testable code

**Next Steps**: In the next notebook, we'll explore how to embrace pure functions and minimize side effects in PySpark code.

## Exercise

Try creating your own functional pipeline:
1. Start with the employees_df
2. Create 2-3 pure transformation functions
3. Chain them together using the .transform() method
4. Observe that no computation happens until you call an action
5. Examine the execution plan using .explain()

In [None]:
# Your exercise code here
# def your_transformation1(df):
#     # Your transformation logic
#     pass

# def your_transformation2(df):
#     # Your transformation logic  
#     pass

# Exercise pipeline:
# your_result = (employees_df
#               .transform(your_transformation1)
#               .transform(your_transformation2)
#               # Add more transformations
#               )

# Don't forget to trigger execution with an action!
# your_result.show()