# 2.1 Embracing Pure Functions and Minimizing Side Effects

This notebook demonstrates how to apply functional programming principles in PySpark by creating pure functions and minimizing side effects.

## Learning Objectives
- Understand what makes a function "pure" in the context of PySpark
- Learn to separate transformation logic from side effects
- Practice creating testable, reusable transformation functions
- Understand when and how to use actions wisely

## What is a Pure Function?

A **pure function** is a function that:
1. **Given the same input, always produces the same output** (deterministic)
2. **Has no observable side effects** (doesn't modify external state, perform I/O, etc.)

In PySpark context:
- Pure functions take DataFrames as input and return DataFrames as output
- They don't call actions like `show()`, `collect()`, or `write()`
- They don't modify global variables or external state
- They leverage Spark's immutability to ensure predictable behavior

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Sample sales data for demonstration
sales_data = [
    ("2023-01-15", "Alice", "Laptop", 1200.00, "Electronics"),
    ("2023-01-16", "Bob", "Coffee Maker", 89.99, "Appliances"),
    ("2023-01-17", "Charlie", "Book", 15.99, "Books"),
    ("2023-01-18", "Diana", "Headphones", 199.99, "Electronics"),
    ("2023-01-19", "Eve", "Desk Chair", 299.99, "Furniture"),
    ("2023-01-20", "Frank", "Smartphone", 699.99, "Electronics"),
    ("2023-01-21", "Grace", "Blender", 79.99, "Appliances")
]

sales_schema = StructType([
    StructField("date", StringType(), True),
    StructField("customer", StringType(), True),
    StructField("product", StringType(), True),
    StructField("amount", DoubleType(), True),
    StructField("category", StringType(), True)
])

sales_df = spark.createDataFrame(sales_data, sales_schema)
print("Sample Sales Data:")
sales_df.show()

## Examples of Pure Functions vs Impure Functions

Let's examine the difference between pure and impure functions in PySpark:

In [None]:
# ✅ PURE FUNCTION EXAMPLE
def add_tax_column_pure(df, tax_rate=0.08):
    """
    Pure function: adds tax calculation to DataFrame
    - Takes DataFrame as input, returns DataFrame as output
    - No side effects (no I/O, no external state modification)
    - Deterministic: same input always produces same output
    """
    return df.withColumn("tax", F.col("amount") * F.lit(tax_rate)) \
             .withColumn("total_with_tax", F.col("amount") + F.col("tax"))

# ❌ IMPURE FUNCTION EXAMPLE
def add_tax_column_impure(df, tax_rate=0.08):
    """
    Impure function: has side effects
    - Calls show() which is an action (side effect)
    - Prints to console (side effect)
    - Makes the function harder to test and reuse
    """
    result_df = df.withColumn("tax", F.col("amount") * F.lit(tax_rate)) \
                  .withColumn("total_with_tax", F.col("amount") + F.col("tax"))
    
    # Side effects - these make the function impure!
    print(f"Applied tax rate: {tax_rate}")
    result_df.show()  # Action with side effect
    
    return result_df

print("=== Demonstrating Pure vs Impure Functions ===")
print("\n1. Pure function - no side effects:")
pure_result = add_tax_column_pure(sales_df)
print("Pure function executed - no output yet because no action called")

print("\n2. Impure function - has side effects:")
impure_result = add_tax_column_impure(sales_df)
print("Impure function executed - notice the side effects above")

## Building a Library of Pure Transformation Functions

Let's create a collection of pure functions that can be composed together:

In [None]:
# ✅ Collection of Pure Transformation Functions

def standardize_dates(df, date_column="date"):
    """
    Pure function: converts string dates to proper date format
    """
    return df.withColumn(date_column, F.to_date(F.col(date_column), "yyyy-MM-dd"))

def categorize_purchase_size(df, amount_column="amount"):
    """
    Pure function: categorizes purchases by amount
    """
    return df.withColumn("purchase_size",
                        F.when(F.col(amount_column) < 50, "Small")
                         .when(F.col(amount_column) < 200, "Medium")
                         .when(F.col(amount_column) < 500, "Large")
                         .otherwise("Extra Large"))

def add_seasonal_info(df, date_column="date"):
    """
    Pure function: adds seasonal information based on date
    """
    return df.withColumn("month", F.month(F.col(date_column))) \
             .withColumn("season",
                        F.when(F.col("month").isin([12, 1, 2]), "Winter")
                         .when(F.col("month").isin([3, 4, 5]), "Spring")
                         .when(F.col("month").isin([6, 7, 8]), "Summer")
                         .otherwise("Fall"))

def filter_high_value_customers(df, threshold=100.0):
    """
    Pure function: filters for customers with purchases above threshold
    """
    return df.filter(F.col("amount") >= threshold)

def calculate_customer_metrics(df):
    """
    Pure function: calculates aggregated metrics per customer
    """
    return df.groupBy("customer") \
             .agg(F.sum("amount").alias("total_spent"),
                  F.avg("amount").alias("avg_purchase"),
                  F.count("*").alias("purchase_count"),
                  F.max("amount").alias("max_purchase"))

print("Pure transformation functions defined!")
print("These functions can be composed together without side effects.")

## Composing Pure Functions

Now let's demonstrate how pure functions can be easily composed and chained together:

In [None]:
print("=== Composing Pure Functions ===")

# Method 1: Using transform() method for clean composition
print("\n1. Using .transform() method:")
composed_pipeline = (sales_df
                    .transform(standardize_dates)
                    .transform(add_tax_column_pure)
                    .transform(categorize_purchase_size)
                    .transform(add_seasonal_info)
                    .transform(filter_high_value_customers, 100.0))

print("Pipeline built with .transform() - no execution yet!")

# Method 2: Direct function composition
print("\n2. Direct function composition:")
step1 = standardize_dates(sales_df)
step2 = add_tax_column_pure(step1)
step3 = categorize_purchase_size(step2)
step4 = add_seasonal_info(step3)
final_result = filter_high_value_customers(step4, 100.0)

print("Pipeline built with function composition - no execution yet!")

# Method 3: Nested function calls (less readable but functional)
print("\n3. Nested composition (less readable):")
nested_result = filter_high_value_customers(
    add_seasonal_info(
        categorize_purchase_size(
            add_tax_column_pure(
                standardize_dates(sales_df)
            )
        )
    ), 100.0
)

print("Pipeline built with nested calls - no execution yet!")
print("\nAll three methods produce equivalent results due to functional purity!")

## Separating Transformation Logic from Actions

Let's demonstrate the principle of confining side effects to the boundaries of our pipeline:

In [None]:
print("=== Separating Pure Logic from Side Effects ===")

# ✅ GOOD PATTERN: Pure transformation pipeline + controlled actions
def build_sales_analysis_pipeline(df):
    """
    Pure function: builds the entire transformation pipeline
    Returns the transformed DataFrame without executing actions
    """
    return (df
            .transform(standardize_dates)
            .transform(add_tax_column_pure)
            .transform(categorize_purchase_size)
            .transform(add_seasonal_info)
            .select("date", "customer", "product", "category", 
                   "amount", "total_with_tax", "purchase_size", "season"))

def execute_pipeline_with_actions(df):
    """
    Function that handles all side effects:
    - Applies transformations using pure functions
    - Executes actions for output/persistence
    - Confines side effects to this boundary function
    """
    # Pure transformation pipeline
    processed_df = build_sales_analysis_pipeline(df)
    
    # Side effects confined to this function
    print("=== Sales Analysis Results ===")
    processed_df.show()
    
    print(f"\nTotal records processed: {processed_df.count()}")
    
    # Could add more actions here:
    # processed_df.write.mode('overwrite').parquet('/path/to/output')
    # processed_df.createOrReplaceTempView('sales_analysis')
    
    return processed_df

# Execute the pipeline
result = execute_pipeline_with_actions(sales_df)

print("\nBenefits of this pattern:")
print("- Pure transformation logic is easily testable")
print("- Side effects are explicit and controlled")
print("- Pipeline can be reused with different actions")
print("- Transformations can be composed in different ways")

## Testing Pure Functions

Pure functions are much easier to test because they have no side effects and are deterministic:

In [None]:
print("=== Testing Pure Functions ===")

# Create test data
test_data = [
    ("2023-06-15", "TestCustomer", "TestProduct", 150.0, "TestCategory")
]

test_df = spark.createDataFrame(test_data, sales_schema)

def test_categorize_purchase_size():
    """
    Test function for purchase size categorization
    Pure functions are easy to test!
    """
    # Test with known input
    result = categorize_purchase_size(test_df)
    
    # Collect result for assertion (this is OK in tests)
    collected = result.select("purchase_size").collect()
    actual_category = collected[0]["purchase_size"]
    
    expected_category = "Medium"  # 150.0 should be "Medium"
    
    assert actual_category == expected_category, f"Expected {expected_category}, got {actual_category}"
    print(f"✅ Test passed: Purchase amount 150.0 correctly categorized as '{actual_category}'")

def test_add_tax_column():
    """
    Test function for tax calculation
    """
    # Test with known tax rate
    result = add_tax_column_pure(test_df, 0.10)  # 10% tax
    
    collected = result.select("amount", "tax", "total_with_tax").collect()
    row = collected[0]
    
    expected_tax = 15.0  # 10% of 150.0
    expected_total = 165.0  # 150.0 + 15.0
    
    assert row["tax"] == expected_tax, f"Expected tax {expected_tax}, got {row['tax']}"
    assert row["total_with_tax"] == expected_total, f"Expected total {expected_total}, got {row['total_with_tax']}"
    
    print(f"✅ Test passed: Tax calculation correct - Tax: {row['tax']}, Total: {row['total_with_tax']}")

def test_pipeline_composition():
    """
    Test that pure functions compose correctly
    """
    # Test the full pipeline
    result = build_sales_analysis_pipeline(test_df)
    
    # Check that all expected columns are present
    expected_columns = {"date", "customer", "product", "category", 
                       "amount", "total_with_tax", "purchase_size", "season"}
    actual_columns = set(result.columns)
    
    assert expected_columns.issubset(actual_columns), f"Missing columns: {expected_columns - actual_columns}"
    print("✅ Test passed: Pipeline produces all expected columns")
    
    # Test that data transformations work end-to-end
    collected = result.collect()
    assert len(collected) == 1, "Expected 1 row in result"
    
    row = collected[0]
    assert row["season"] == "Summer", f"Expected Summer for June, got {row['season']}"
    assert row["purchase_size"] == "Medium", f"Expected Medium, got {row['purchase_size']}"
    
    print("✅ Test passed: End-to-end pipeline transformations work correctly")

# Run the tests
test_categorize_purchase_size()
test_add_tax_column()
test_pipeline_composition()

print("\n🎉 All tests passed! Pure functions are easily testable.")

## Handling Configuration and Parameters Functionally

Let's demonstrate how to handle configuration and parameters in a functional way:

In [None]:
print("=== Functional Configuration Patterns ===")

# ✅ GOOD: Configuration as parameters (dependency injection)
class SalesConfig:
    """
    Configuration class - immutable configuration object
    """
    def __init__(self, tax_rate=0.08, high_value_threshold=100.0, 
                 date_format="yyyy-MM-dd"):
        self.tax_rate = tax_rate
        self.high_value_threshold = high_value_threshold
        self.date_format = date_format

def create_configurable_pipeline(config):
    """
    Higher-order function: returns a configured transformation pipeline
    This is a functional approach to configuration
    """
    def pipeline(df):
        return (df
                .withColumn("date", F.to_date(F.col("date"), config.date_format))
                .withColumn("tax", F.col("amount") * F.lit(config.tax_rate))
                .withColumn("total_with_tax", F.col("amount") + F.col("tax"))
                .filter(F.col("amount") >= config.high_value_threshold)
                .transform(categorize_purchase_size)
                .transform(add_seasonal_info))
    return pipeline

# Usage with different configurations
print("\n1. Standard configuration:")
standard_config = SalesConfig()
standard_pipeline = create_configurable_pipeline(standard_config)
standard_result = standard_pipeline(sales_df)
print(f"Standard pipeline created with tax rate: {standard_config.tax_rate}")
print(f"Records after filtering (threshold ${standard_config.high_value_threshold}): {standard_result.count()}")

print("\n2. High-tax configuration:")
high_tax_config = SalesConfig(tax_rate=0.15, high_value_threshold=200.0)
high_tax_pipeline = create_configurable_pipeline(high_tax_config)
high_tax_result = high_tax_pipeline(sales_df)
print(f"High-tax pipeline created with tax rate: {high_tax_config.tax_rate}")
print(f"Records after filtering (threshold ${high_tax_config.high_value_threshold}): {high_tax_result.count()}")

print("\nBenefits of functional configuration:")
print("- No global state or side effects")
print("- Easy to test with different configurations")
print("- Immutable configuration objects")
print("- Dependency injection pattern")

## Common Anti-Patterns to Avoid

Let's examine common anti-patterns that violate functional principles:

In [None]:
print("=== Anti-Patterns to Avoid ===")

# ❌ ANTI-PATTERN 1: Global state modification
print("\n❌ ANTI-PATTERN 1: Global state modification")

# Global variable (bad!)
processing_stats = {"records_processed": 0, "errors": 0}

def bad_transform_with_global_state(df):
    """
    BAD: This function modifies global state (side effect)
    Makes the function impure and hard to test
    """
    global processing_stats
    
    result = df.filter(F.col("amount") > 100)
    
    # Side effect: modifying global state
    processing_stats["records_processed"] += result.count()  # Also an action!
    
    return result

print("This function modifies global state and calls actions - both bad practices!")

# ❌ ANTI-PATTERN 2: Actions inside transformation functions
print("\n❌ ANTI-PATTERN 2: Actions inside transformation functions")

def bad_transform_with_actions(df):
    """
    BAD: Calling actions inside transformation logic
    Breaks lazy evaluation and makes function impure
    """
    filtered_df = df.filter(F.col("amount") > 100)
    
    # Bad: action inside transformation function
    count = filtered_df.count()
    print(f"Filtered {count} records")  # Side effect
    
    # Bad: another action
    if count > 0:
        filtered_df.show(5)  # Side effect
    
    return filtered_df.withColumn("processed", F.lit(True))

print("This function calls count() and show() - breaking lazy evaluation!")

# ❌ ANTI-PATTERN 3: Exception handling that masks errors
print("\n❌ ANTI-PATTERN 3: Exception handling that masks errors")

def bad_transform_with_hidden_errors(df):
    """
    BAD: Swallowing exceptions makes debugging difficult
    Pure functions should be predictable
    """
    try:
        return df.withColumn("invalid_column", F.col("nonexistent_column"))
    except Exception as e:
        print(f"Error occurred: {e}")  # Side effect
        return df  # Hiding the error!

print("This function hides errors and has side effects!")

print("\n✅ BETTER PATTERNS:")
print("- Keep transformations pure (no actions, no global state)")
print("- Handle configuration through parameters")
print("- Let errors propagate naturally for better debugging")
print("- Use actions only at pipeline boundaries")
print("- Return metrics/stats as part of the result, not side effects")

## Best Practices for Actions

When you do need to use actions, here are some best practices:

In [None]:
print("=== Best Practices for Using Actions ===")

# ✅ GOOD: Consolidate actions at pipeline boundaries
def execute_sales_pipeline_with_monitoring(df, config):
    """
    Good pattern: All side effects in one place
    - Pure transformations separated from actions
    - Monitoring and logging consolidated
    - Clear separation of concerns
    """
    print("=== Starting Sales Pipeline ===")
    
    # Pure transformation pipeline
    pipeline = create_configurable_pipeline(config)
    transformed_df = pipeline(df)
    
    # Actions consolidated at the boundary
    print(f"\nInput records: {df.count()}")
    print(f"Output records: {transformed_df.count()}")
    print(f"Filtering threshold: ${config.high_value_threshold}")
    print(f"Tax rate applied: {config.tax_rate * 100}%")
    
    print("\n=== Sample Results ===")
    transformed_df.show()
    
    print("\n=== Category Distribution ===")
    transformed_df.groupBy("purchase_size").count().show()
    
    return transformed_df

# ✅ GOOD: Actions for different purposes
def demonstrate_appropriate_actions(df):
    """
    Demonstrates appropriate use of different actions
    """
    pipeline = create_configurable_pipeline(SalesConfig())
    result_df = pipeline(df)
    
    print("=== Appropriate Actions ===")
    
    # 1. count() for monitoring/logging
    record_count = result_df.count()
    print(f"1. Monitoring: {record_count} records processed")
    
    # 2. show() for development/debugging
    print("\n2. Development preview:")
    result_df.show(3)
    
    # 3. collect() for small datasets or aggregated results
    print("\n3. Collecting aggregated metrics (small result set):")
    metrics = result_df.agg(
        F.avg("amount").alias("avg_amount"),
        F.sum("total_with_tax").alias("total_revenue")
    ).collect()[0]
    
    print(f"   Average amount: ${metrics['avg_amount']:.2f}")
    print(f"   Total revenue: ${metrics['total_revenue']:.2f}")
    
    # 4. write() for persistence (simulated)
    print("\n4. Persistence action:")
    print("   result_df.write.mode('overwrite').parquet('/path/to/output')")
    print("   (Simulated - not actually writing)")
    
    return result_df

# Execute examples
config = SalesConfig(tax_rate=0.10, high_value_threshold=150.0)
result1 = execute_sales_pipeline_with_monitoring(sales_df, config)

print("\n" + "="*60)
result2 = demonstrate_appropriate_actions(sales_df)

## Summary

**Key Takeaways:**

1. **Pure Functions**: 
   - Take DataFrames as input, return DataFrames as output
   - No side effects (no actions, no global state modification)
   - Deterministic and easily testable

2. **Side Effects Management**:
   - Confine side effects to pipeline boundaries
   - Separate transformation logic from actions
   - Use actions wisely and purposefully

3. **Functional Composition**:
   - Pure functions compose naturally
   - Use `.transform()` for clean chaining
   - Configuration through parameters, not global state

4. **Testing Benefits**:
   - Pure functions are easy to unit test
   - No mocking required for core logic
   - Predictable behavior

**Next Steps**: In the next notebook, we'll explore PySpark's built-in functions and higher-order functions to maximize performance while maintaining functional purity.

## Exercise

Create your own pure transformation functions:

1. Write a pure function that adds a "discount" column based on purchase amount
2. Write a pure function that categorizes customers by their purchase frequency
3. Compose these functions with existing ones in a pipeline
4. Write tests for your pure functions
5. Create a boundary function that executes actions on your pipeline

In [None]:
# Your exercise code here

def add_discount_column(df, discount_rate=0.05):
    """
    Your pure function: Add discount calculation
    """
    # Your implementation here
    pass

def categorize_customer_frequency(df):
    """
    Your pure function: Categorize customers by purchase frequency
    """
    # Your implementation here
    pass

def test_your_functions():
    """
    Test your pure functions
    """
    # Your tests here
    pass

def execute_your_pipeline(df):
    """
    Boundary function with actions
    """
    # Your pipeline execution here
    pass

# Run your exercise
# test_your_functions()
# execute_your_pipeline(sales_df)