# 5.4 Advanced Cluster Configuration and Tuning

This notebook explores platform-level optimizations for PySpark workloads, focusing on cluster configuration, Spark tuning parameters, and Databricks-specific features that enhance functional programming performance.

## Learning Objectives

By the end of this notebook, you will understand how to:
- Optimize Spark configuration parameters for functional workloads
- Leverage Adaptive Query Execution (AQE) for automatic optimization
- Enable and utilize Photon acceleration
- Size clusters appropriately for workload characteristics
- Configure memory and resource allocation
- Monitor and tune performance with Spark UI
- Apply autoscaling strategies for cost optimization
- Clean up legacy Spark configurations

## Prerequisites

- Understanding of PySpark fundamentals
- Familiarity with Spark execution model
- Knowledge of functional programming patterns
- Basic understanding of distributed computing

In [2]:
# Essential imports
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import *
import pyspark.sql.functions as F
from typing import Dict, List, Tuple, Optional, Any
import json
from dataclasses import dataclass, field
import time

# Initialize Spark session
try:
    spark
    print("✅ Using existing Spark session")
except NameError:
    spark = SparkSession.builder \
        .appName("ClusterConfiguration") \
        .getOrCreate()
    print("✅ Created new Spark session")

print(f"\nSpark Version: {spark.version}")
print(f"Spark Deployment Mode: {spark.sparkContext.master}")
print(f"Application Name: {spark.sparkContext.appName}")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/10/27 22:32:13 WARN Utils: Your hostname, Tarski.local, resolves to a loopback address: 127.0.0.1; using 10.0.0.28 instead (on interface en0)
25/10/27 22:32:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/27 22:32:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


✅ Created new Spark session

Spark Version: 4.0.0
Spark Deployment Mode: local[*]
Application Name: ClusterConfiguration


## 1. Understanding Spark Configuration Hierarchy

Spark configurations can be set at multiple levels with different precedence:

### Configuration Precedence (Highest to Lowest)

1. **Runtime Configuration** (via `spark.conf.set()`)
   - Set programmatically in notebooks or applications
   - Overrides all other settings
   - Can be changed during session (for some configs)

2. **Cluster Configuration** (Databricks cluster UI)
   - Set when creating/configuring cluster
   - Applies to all jobs on that cluster
   - Persists across restarts

3. **Application Configuration** (via `spark-submit` or SparkSession builder)
   - Set when creating SparkSession
   - Job-specific settings

4. **Spark Defaults** (spark-defaults.conf)
   - System-wide defaults
   - Lowest precedence

In [None]:
@dataclass
class SparkConfig:
    """Immutable Spark configuration snapshot"""
    name: str
    value: str
    category: str
    is_default: bool
    description: str = ""

def get_current_configurations(category_filter: Optional[str] = None) -> List[SparkConfig]:
    """
    Pure function to retrieve current Spark configurations.
    Returns immutable list of configuration objects.
    """
    all_configs = spark.sparkContext.getConf().getAll()
    
    configs = []
    for key, value in all_configs:
        # Categorize configuration
        if "memory" in key.lower():
            category = "Memory"
        elif "shuffle" in key.lower():
            category = "Shuffle"
        elif "adaptive" in key.lower() or "aqe" in key.lower():
            category = "AQE"
        elif "executor" in key.lower():
            category = "Executor"
        elif "driver" in key.lower():
            category = "Driver"
        elif "sql" in key.lower():
            category = "SQL"
        else:
            category = "Other"
        
        if category_filter is None or category == category_filter:
            configs.append(SparkConfig(
                name=key,
                value=value,
                category=category,
                is_default=False  # Would need to compare with defaults
            ))
    
    return sorted(configs, key=lambda c: (c.category, c.name))

# Display current configurations by category
print("="*80)
print("CURRENT SPARK CONFIGURATIONS")
print("="*80)

for category in ["Memory", "Executor", "AQE", "Shuffle"]:
    configs = get_current_configurations(category)
    if configs:
        print(f"\n{category} Configurations ({len(configs)}):")
        for config in configs[:5]:  # Show first 5
            print(f"  {config.name}: {config.value}")
        if len(configs) > 5:
            print(f"  ... and {len(configs) - 5} more")

## 2. Critical Spark Configuration Parameters

Let's explore the most important Spark configurations for functional PySpark workloads.

In [3]:
print("="*80)
print("CRITICAL SPARK CONFIGURATION PARAMETERS")
print("="*80)

@dataclass
class ConfigRecommendation:
    """Configuration parameter with recommendations"""
    parameter: str
    current_value: str
    recommended_value: str
    purpose: str
    impact: str
    category: str
    
    def is_optimal(self) -> bool:
        return self.current_value == self.recommended_value
    
    def __str__(self) -> str:
        status = "✅" if self.is_optimal() else "⚠️"
        return (
            f"{status} {self.parameter}\n"
            f"   Current: {self.current_value}\n"
            f"   Recommended: {self.recommended_value}\n"
            f"   Purpose: {self.purpose}\n"
            f"   Impact: {self.impact}"
        )

# Define critical configurations with recommendations
critical_configs = [
    {
        "parameter": "spark.sql.adaptive.enabled",
        "recommended": "true",
        "purpose": "Enable Adaptive Query Execution for automatic optimization",
        "impact": "Automatic partition tuning, join strategy optimization, skew handling",
        "category": "AQE"
    },
    {
        "parameter": "spark.sql.adaptive.coalescePartitions.enabled",
        "recommended": "true",
        "purpose": "Automatically reduce partition count after shuffle",
        "impact": "Prevents too many small tasks, improves performance",
        "category": "AQE"
    },
    {
        "parameter": "spark.sql.adaptive.skewJoin.enabled",
        "recommended": "true",
        "purpose": "Automatically handle skewed joins",
        "impact": "Splits skewed partitions, balances workload",
        "category": "AQE"
    },
    {
        "parameter": "spark.sql.shuffle.partitions",
        "recommended": "200-400",
        "purpose": "Number of partitions for shuffle operations",
        "impact": "Balance between parallelism and overhead (AQE auto-tunes this)",
        "category": "Shuffle"
    },
    {
        "parameter": "spark.sql.autoBroadcastJoinThreshold",
        "recommended": "10485760 (10MB)",
        "purpose": "Maximum table size for broadcast joins",
        "impact": "Larger = more broadcast joins (faster), but OOM risk if too large",
        "category": "Join"
    },
    {
        "parameter": "spark.sql.execution.arrow.pyspark.enabled",
        "recommended": "true",
        "purpose": "Enable Apache Arrow for Pandas UDFs and conversions",
        "impact": "10-100x faster Pandas UDF execution and toPandas() operations",
        "category": "Python"
    },
    {
        "parameter": "spark.executor.memory",
        "recommended": "Based on workload",
        "purpose": "Memory allocated per executor",
        "impact": "Reduce spills, but too much wastes resources",
        "category": "Memory"
    },
    {
        "parameter": "spark.executor.cores",
        "recommended": "4-8",
        "purpose": "CPU cores per executor",
        "impact": "Balance parallelism with resource sharing",
        "category": "Executor"
    },
]

# Check current values and create recommendations
recommendations = []
for config_def in critical_configs:
    try:
        current = spark.conf.get(config_def["parameter"])
    except:
        current = "Not Set"
    
    recommendations.append(ConfigRecommendation(
        parameter=config_def["parameter"],
        current_value=current,
        recommended_value=config_def["recommended"],
        purpose=config_def["purpose"],
        impact=config_def["impact"],
        category=config_def["category"]
    ))

# Display recommendations by category
for category in ["AQE", "Shuffle", "Join", "Python", "Memory", "Executor"]:
    category_recs = [r for r in recommendations if r.category == category]
    if category_recs:
        print(f"\n{'='*80}")
        print(f"{category} Configuration")
        print(f"{'='*80}")
        for rec in category_recs:
            print(f"\n{rec}")
            print()

# Summary
optimal_count = sum(1 for r in recommendations if r.is_optimal())
print(f"\n{'='*80}")
print(f"Configuration Health: {optimal_count}/{len(recommendations)} optimal")
print(f"{'='*80}")

CRITICAL SPARK CONFIGURATION PARAMETERS

AQE Configuration

✅ spark.sql.adaptive.enabled
   Current: true
   Recommended: true
   Purpose: Enable Adaptive Query Execution for automatic optimization
   Impact: Automatic partition tuning, join strategy optimization, skew handling


✅ spark.sql.adaptive.coalescePartitions.enabled
   Current: true
   Recommended: true
   Purpose: Automatically reduce partition count after shuffle
   Impact: Prevents too many small tasks, improves performance


✅ spark.sql.adaptive.skewJoin.enabled
   Current: true
   Recommended: true
   Purpose: Automatically handle skewed joins
   Impact: Splits skewed partitions, balances workload


Shuffle Configuration

⚠️ spark.sql.shuffle.partitions
   Current: 200
   Recommended: 200-400
   Purpose: Number of partitions for shuffle operations
   Impact: Balance between parallelism and overhead (AQE auto-tunes this)


Join Configuration

⚠️ spark.sql.autoBroadcastJoinThreshold
   Current: 10485760b
   Recommended: 1

## 3. Adaptive Query Execution (AQE) Deep Dive

AQE is a runtime optimization framework that dynamically adjusts query plans based on actual runtime statistics.

AQE provides three major optimizations:

### 1. 🔧 DYNAMIC PARTITION COALESCING

- Automatically reduces number of partitions after shuffle
- Prevents too many small tasks
- Adjusts based on actual data size
   
- Configuration:
    ```
    spark.sql.adaptive.coalescePartitions.enabled = true
    spark.sql.adaptive.advisoryPartitionSizeInBytes = 64MB (target size)
    ```

### 2. 🔀 DYNAMIC JOIN STRATEGY SWITCHING

- Converts sort-merge joins to broadcast joins at runtime
- Based on actual table sizes (not estimates)
- Avoids expensive shuffles when possible
   
- Configuration:
    ```
    spark.sql.adaptive.autoBroadcastJoinThreshold = 10MB
    ```

### 3. ⚖️ SKEWED JOIN OPTIMIZATION

- Detects skewed partitions during shuffle
- Splits large partitions into smaller chunks
- Duplicates smaller table data for split partitions
   
- Configuration:
    ```
    spark.sql.adaptive.skewJoin.enabled = true
    spark.sql.adaptive.skewJoin.skewedPartitionFactor = 5
    spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes = 256MB
    ```

In [9]:
# Check AQE status
def check_aqe_status() -> Dict[str, Any]:
    """Pure function to check AQE configuration status"""
    aqe_configs = {
        "enabled": spark.conf.get("spark.sql.adaptive.enabled", "false"),
        "coalesce_partitions": spark.conf.get("spark.sql.adaptive.coalescePartitions.enabled", "false"),
        "skew_join": spark.conf.get("spark.sql.adaptive.skewJoin.enabled", "false"),
        "advisory_partition_size": spark.conf.get("spark.sql.adaptive.advisoryPartitionSizeInBytes", "64MB"),
        "broadcast_threshold": spark.conf.get("spark.sql.adaptive.autoBroadcastJoinThreshold", "10MB"),
    }
    
    return aqe_configs

print("\nCurrent AQE Configuration:")
print("="*80)
aqe_status = check_aqe_status()
for key, value in aqe_status.items():
    status = "✅" if value in ["true", "True"] or "MB" in str(value) else "⚠️"
    print(f"{status} {key}: {value}")

print("\n" + "="*80)
print("💡 Recommendation: Always enable AQE for functional PySpark workloads")
print("   It provides automatic optimization without code changes!")
print("="*80)


Current AQE Configuration:
⚠️ enabled: false
⚠️ coalesce_partitions: false
⚠️ skew_join: false
✅ advisory_partition_size: 64MB
✅ broadcast_threshold: 10MB

💡 Recommendation: Always enable AQE for functional PySpark workloads
   It provides automatic optimization without code changes!


## 4. Photon Acceleration

Photon is Databricks' vectorized query engine written in C++, providing significant performance improvements for SQL operations.

📊 PERFORMANCE BENEFITS:
- 2-10x faster for SQL aggregations and joins
- 3-5x faster for Delta Lake operations
- Improved resource utilization
- Better performance on built-in functions

✅ WHAT PHOTON ACCELERATES:
- Built-in SQL functions (F.col, F.sum, F.avg, etc.)
- Aggregations (groupBy, agg)
- Joins (all types)
- Window functions
- Delta Lake reads and writes
- Parquet operations
- Filters and projections

❌ WHAT PHOTON DOES NOT ACCELERATE:
- Python UDFs (use Pandas UDFs with Arrow instead)
- RDD operations
- Non-SQL operations
- Third-party formats (without Delta/Parquet)

🎯 BEST PRACTICES:
1. Enable Photon for all production workloads
2. Use built-in functions instead of UDFs
3. Write to Delta Lake format for best performance
4. Monitor Photon usage in Spark UI

🔧 ENABLING PHOTON:
- Databricks Runtime 9.1 LTS or higher
- Select "Photon Acceleration" when creating cluster
- Available on AWS, Azure, and GCP
- No code changes required

💰 COST CONSIDERATIONS:
- ~20% higher DBU cost
- But 2-10x faster execution
- Net result: Lower total cost for most workloads
- Faster results = better productivity

In [4]:
# Check if Photon is available/enabled
def check_photon_status() -> Dict[str, str]:
    """Check Photon availability and status"""
    try:
        # Photon-specific configs (may not be present in all runtimes)
        photon_enabled = spark.conf.get("spark.databricks.photon.enabled", "unknown")
        runtime_version = spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion", "unknown")
        
        return {
            "photon_enabled": photon_enabled,
            "runtime_version": runtime_version,
            "status": "Available" if photon_enabled == "true" else "Not Enabled"
        }
    except:
        return {
            "photon_enabled": "unknown",
            "runtime_version": "unknown",
            "status": "Cannot determine (may not be in Databricks environment)"
        }

print("\nPhoton Status:")
print("="*80)
photon_status = check_photon_status()
for key, value in photon_status.items():
    print(f"{key}: {value}")

print("\n" + "="*80)
print("💡 Key Insight: Photon + Functional Programming = Maximum Performance")
print("   Built-in functions are dramatically faster with Photon!")
print("="*80)


Photon Status:
photon_enabled: unknown
runtime_version: unknown
status: Not Enabled

💡 Key Insight: Photon + Functional Programming = Maximum Performance
   Built-in functions are dramatically faster with Photon!


## 5. Cluster Sizing and Resource Allocation

Proper cluster sizing is critical for both performance and cost optimization.

In [7]:
print("="*80)
print("CLUSTER SIZING GUIDELINES")
print("="*80)

@dataclass
class ClusterRecommendation:
    """Cluster sizing recommendation"""
    workload_type: str
    data_size_gb: float
    instance_type: str
    min_workers: int
    max_workers: int
    executor_memory: str
    executor_cores: int
    rationale: str
    
    def __str__(self) -> str:
        return (
            f"Workload: {self.workload_type}\n"
            f"  Data Size: {self.data_size_gb}GB\n"
            f"  Instance Type: {self.instance_type}\n"
            f"  Workers: {self.min_workers}-{self.max_workers} (autoscaling)\n"
            f"  Executor Memory: {self.executor_memory}\n"
            f"  Executor Cores: {self.executor_cores}\n"
            f"  Rationale: {self.rationale}"
        )

recommendations = [
    ClusterRecommendation(
        workload_type="Interactive Development",
        data_size_gb=10,
        instance_type="Standard_DS3_v2 (AWS: m5.xlarge)",
        min_workers=1,
        max_workers=4,
        executor_memory="8g",
        executor_cores=4,
        rationale="Small cluster for development, fast startup, autoscale for larger queries"
    ),
    ClusterRecommendation(
        workload_type="ETL Pipeline (Medium)",
        data_size_gb=500,
        instance_type="Memory Optimized: Standard_E8s_v3 (AWS: r5.2xlarge)",
        min_workers=4,
        max_workers=16,
        executor_memory="16g",
        executor_cores=4,
        rationale="Balance memory and compute, autoscale based on load"
    ),
    ClusterRecommendation(
        workload_type="ETL Pipeline (Large)",
        data_size_gb=5000,
        instance_type="Compute Optimized: Standard_F16s_v2 (AWS: c5.4xlarge)",
        min_workers=8,
        max_workers=32,
        executor_memory="16g",
        executor_cores=8,
        rationale="High parallelism for large data processing"
    ),
    ClusterRecommendation(
        workload_type="Streaming",
        data_size_gb=100,
        instance_type="Memory Optimized: Standard_E16s_v3 (AWS: r5.4xlarge)",
        min_workers=4,
        max_workers=8,
        executor_memory="32g",
        executor_cores=4,
        rationale="Stable worker count, memory for state management"
    ),
    ClusterRecommendation(
        workload_type="Machine Learning",
        data_size_gb=1000,
        instance_type="GPU Optimized: Standard_NC6s_v3 (AWS: p3.2xlarge)",
        min_workers=2,
        max_workers=8,
        executor_memory="32g",
        executor_cores=6,
        rationale="GPU acceleration for ML, memory for feature caching"
    ),
]

for rec in recommendations:
    print(f"\n{rec}")
    print("\n" + "-"*80)

print("\n" + "="*80)
print("CLUSTER SIZING PRINCIPLES")
print("="*80)

print("""
1. 🎯 MATCH INSTANCE TYPE TO WORKLOAD:
   • Compute Optimized: CPU-intensive transformations, joins
   • Memory Optimized: Large aggregations, caching, ML
   • Storage Optimized: Heavy I/O, data loading
   • GPU Optimized: Deep learning, image processing

2. 📊 SIZE FOR DATA VOLUME:
   • Rule of thumb: 1 core per 1-2GB of data
   • Target: 128-200MB per partition
   • Memory: 3-4x input data size for complex operations

3. ⚡ LEVERAGE AUTOSCALING:
   • Set min for baseline performance
   • Set max for cost control
   • Scale up quickly, scale down slowly
   • Monitor utilization to adjust

4. 💰 COST OPTIMIZATION:
   • Larger clusters often cheaper per-job (faster completion)
   • Use spot/preemptible instances for fault-tolerant workloads
   • Terminate idle clusters automatically
   • Use cluster pools for faster startup

5. 🔧 EXECUTOR CONFIGURATION:
   • Executor cores: 4-8 (sweet spot for most workloads)
   • Executor memory: Leave 10% overhead for Spark
   • Number of executors: (total_cores / executor_cores) - 1 (driver)

""")

CLUSTER SIZING GUIDELINES

Workload: Interactive Development
  Data Size: 10GB
  Instance Type: Standard_DS3_v2 (AWS: m5.xlarge)
  Workers: 1-4 (autoscaling)
  Executor Memory: 8g
  Executor Cores: 4
  Rationale: Small cluster for development, fast startup, autoscale for larger queries

--------------------------------------------------------------------------------

Workload: ETL Pipeline (Medium)
  Data Size: 500GB
  Instance Type: Memory Optimized: Standard_E8s_v3 (AWS: r5.2xlarge)
  Workers: 4-16 (autoscaling)
  Executor Memory: 16g
  Executor Cores: 4
  Rationale: Balance memory and compute, autoscale based on load

--------------------------------------------------------------------------------

Workload: ETL Pipeline (Large)
  Data Size: 5000GB
  Instance Type: Compute Optimized: Standard_F16s_v2 (AWS: c5.4xlarge)
  Workers: 8-32 (autoscaling)
  Executor Memory: 16g
  Executor Cores: 8
  Rationale: High parallelism for large data processing

-------------------------------------

## 6. Memory Configuration and Tuning

Understanding Spark's memory model is crucial for avoiding OOM errors and optimizing performance.

In [8]:
print("="*80)
print("SPARK MEMORY MODEL")
print("="*80)

print("""
Spark divides executor memory into several regions:

┌─────────────────────────────────────────────────────────────┐
│                    Executor Memory (100%)                    │
├─────────────────────────────────────────────────────────────┤
│  Reserved Memory (300MB)                                     │
│  • Fixed overhead for Spark internals                        │
├─────────────────────────────────────────────────────────────┤
│  User Memory (40% by default)                                │
│  • User data structures                                      │
│  • UDFs and custom objects                                   │
│  • RDD transformations                                       │
├─────────────────────────────────────────────────────────────┤
│  Unified Memory (60% by default)                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Execution Memory (Dynamic, ~50%)                      │  │
│  │ • Shuffles, joins, sorts, aggregations                │  │
│  │ • Spills to disk if exceeded                          │  │
│  ├───────────────────────────────────────────────────────┤  │
│  │ Storage Memory (Dynamic, ~50%)                        │  │
│  │ • Cached DataFrames                                   │  │
│  │ • Broadcast variables                                 │  │
│  │ • Can be evicted if execution needs memory            │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

KEY CONFIGURATIONS:

1. spark.executor.memory
   • Total memory per executor
   • Example: "16g" for 16GB

2. spark.memory.fraction (default: 0.6)
   • Fraction for unified memory (execution + storage)
   • Remaining goes to user memory

3. spark.memory.storageFraction (default: 0.5)
   • Fraction of unified memory protected for storage
   • Execution can borrow if storage not using it

4. spark.executor.memoryOverhead
   • Off-heap memory for JVM overhead
   • Default: max(executorMemory * 0.10, 384MB)

""")

def calculate_memory_breakdown(executor_memory_gb: float) -> Dict[str, float]:
    """
    Pure function to calculate Spark memory allocation.
    Returns memory breakdown in GB.
    """
    reserved_mb = 300
    memory_fraction = 0.6
    storage_fraction = 0.5
    
    total_mb = executor_memory_gb * 1024
    usable_mb = total_mb - reserved_mb
    
    unified_mb = usable_mb * memory_fraction
    user_mb = usable_mb * (1 - memory_fraction)
    
    storage_mb = unified_mb * storage_fraction
    execution_mb = unified_mb * (1 - storage_fraction)
    
    overhead_mb = max(total_mb * 0.10, 384)
    
    return {
        "total_gb": executor_memory_gb,
        "reserved_gb": reserved_mb / 1024,
        "user_memory_gb": user_mb / 1024,
        "execution_memory_gb": execution_mb / 1024,
        "storage_memory_gb": storage_mb / 1024,
        "overhead_gb": overhead_mb / 1024,
    }

# Example calculations
print("\nMEMORY BREAKDOWN EXAMPLES:")
print("="*80)

for memory_size in [8, 16, 32, 64]:
    breakdown = calculate_memory_breakdown(memory_size)
    print(f"\nExecutor Memory: {memory_size}GB")
    print(f"  Reserved:  {breakdown['reserved_gb']:.2f}GB")
    print(f"  User:      {breakdown['user_memory_gb']:.2f}GB (UDFs, custom objects)")
    print(f"  Execution: {breakdown['execution_memory_gb']:.2f}GB (shuffles, joins, sorts)")
    print(f"  Storage:   {breakdown['storage_memory_gb']:.2f}GB (cache, broadcast)")
    print(f"  Overhead:  {breakdown['overhead_gb']:.2f}GB (JVM overhead)")

print("\n" + "="*80)
print("💡 Memory Tuning Tips:")
print("   • Start with defaults, only tune if seeing spills/OOM")
print("   • Monitor Spark UI Storage and Executors tabs")
print("   • For cache-heavy workloads: increase storageFraction")
print("   • For shuffle-heavy workloads: defaults usually optimal")
print("   • Use larger executors (16-32GB) for complex operations")
print("="*80)

SPARK MEMORY MODEL

Spark divides executor memory into several regions:

┌─────────────────────────────────────────────────────────────┐
│                    Executor Memory (100%)                    │
├─────────────────────────────────────────────────────────────┤
│  Reserved Memory (300MB)                                     │
│  • Fixed overhead for Spark internals                        │
├─────────────────────────────────────────────────────────────┤
│  User Memory (40% by default)                                │
│  • User data structures                                      │
│  • UDFs and custom objects                                   │
│  • RDD transformations                                       │
├─────────────────────────────────────────────────────────────┤
│  Unified Memory (60% by default)                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Execution Memory (Dynamic, ~50%)                      │  │
│  │ • Shuffles, joins,

## 7. Configuration Management Best Practices

In [None]:
print("="*80)
print("CONFIGURATION MANAGEMENT BEST PRACTICES")
print("="*80)

print("""
1. 🧹 CLEAN UP LEGACY CONFIGURATIONS

Problem: Old configurations can cause performance issues
Solution: Regularly audit and remove outdated settings

# Identify potentially problematic configs
spark.conf.getAll().forEach { case (k, v) =>
  if (k.contains("deprecated") || k.contains("legacy")) {
    println(s"Review: $k = $v")
  }
}

2. 📝 DOCUMENT CONFIGURATION CHANGES

Good Practice:
• Maintain a configuration changelog
• Document why each non-default setting was changed
• Include performance test results

Example:
# configs/production.conf
# 2024-01-15: Increased shuffle partitions for 10TB daily load
# Performance improvement: 40% faster aggregations
spark.sql.shuffle.partitions=400

3. 🎯 USE ENVIRONMENT-SPECIFIC CONFIGURATIONS

Development:
• Small shuffle partitions (50-100)
• Verbose logging
• Smaller memory allocations

Production:
• Optimal shuffle partitions (200-400)
• Minimal logging
• Full resource allocation
• AQE enabled

4. 🔄 VERSION CONTROL YOUR CONFIGURATIONS

# cluster-configs/
#   dev-cluster.json
#   staging-cluster.json
#   prod-cluster.json

Track changes with git
Review configuration changes in PRs
Automate deployment with Terraform/ARM templates

5. ⚡ START WITH DEFAULTS, TUNE INCREMENTALLY

Approach:
1. Start with Spark/Databricks defaults
2. Enable AQE (should be default in Spark 3.0+)
3. Monitor performance and resource usage
4. Tune one configuration at a time
5. Measure impact before moving to next

6. 📊 MONITOR CONFIGURATION EFFECTIVENESS

Metrics to track:
• Job duration
• Shuffle read/write bytes
• Spill to disk (memory/disk)
• GC time
• Task distribution

7. 🛡️ AVOID ANTI-PATTERNS

❌ Don't:
• Copy configurations blindly from StackOverflow
• Set every configuration "just in case"
• Ignore default recommendations
• Change multiple configs simultaneously
• Use extremely large/small values without testing

✅ Do:
• Understand what each configuration does
• Measure before and after changes
• Document rationale for changes
• Use AQE to auto-tune when possible
• Review Databricks best practices regularly

""")

# Utility to export current configurations
def export_spark_configurations(filename: str = "spark_config.json") -> None:
    """
    Export current Spark configurations to JSON file.
    Useful for documentation and version control.
    """
    configs = dict(spark.sparkContext.getConf().getAll())
    
    config_export = {
        "spark_version": spark.version,
        "export_timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
        "configurations": configs
    }
    
    print(f"\nConfiguration export (first 10):")
    for i, (key, value) in enumerate(list(configs.items())[:10]):
        print(f"  {key}: {value}")
    print(f"  ... and {len(configs) - 10} more")
    
    return config_export

# Example usage
config_export = export_spark_configurations()
print(f"\n✅ Configuration exported ({len(config_export['configurations'])} parameters)")

## Summary

In this notebook, we explored advanced cluster configuration and tuning for PySpark workloads:

### Key Concepts Covered

1. **Spark Configuration Hierarchy**
   - Runtime vs cluster vs application configurations
   - Configuration precedence and override behavior
   - Critical parameters for functional workloads

2. **Adaptive Query Execution (AQE)**
   - Dynamic partition coalescing
   - Runtime join strategy optimization
   - Automatic skew handling
   - Always-on recommendation for Spark 3.0+

3. **Photon Acceleration**
   - 2-10x performance improvements for SQL operations
   - Optimizations for built-in functions
   - Integration with functional programming patterns
   - Cost-benefit analysis

4. **Cluster Sizing Strategies**
   - Instance type selection for workload characteristics
   - Executor and core allocation
   - Autoscaling configuration
   - Cost optimization techniques

5. **Memory Management**
   - Spark memory model (execution, storage, user)
   - Memory fraction configuration
   - Avoiding OOM errors and spills
   - Overhead calculations

6. **Configuration Best Practices**
   - Clean up legacy configurations
   - Environment-specific settings
   - Version control and documentation
   - Incremental tuning approach

### Platform-Level Optimizations

These configurations enhance functional programming by:
- **Reducing manual tuning**: AQE automates many optimizations
- **Improving built-in function performance**: Photon accelerates declarative operations
- **Enabling larger datasets**: Proper memory configuration supports complex transformations
- **Cost efficiency**: Faster execution = lower total cost

### Critical Configurations for Functional PySpark

**Always Enable:**
```python
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
```

**Tune Based on Workload:**
- `spark.sql.shuffle.partitions`: 200-400 (AQE auto-tunes)
- `spark.executor.memory`: Based on data volume and operations
- `spark.executor.cores`: 4-8 for most workloads

### Next Steps

- Enable AQE on all production clusters
- Evaluate Photon for your workloads
- Audit and clean legacy configurations
- Implement configuration version control
- Monitor Spark UI for optimization opportunities
- Establish baseline metrics before tuning

### Key Takeaway

Modern Spark (3.0+) with AQE and Photon provides excellent performance with minimal configuration tuning. Focus on:
1. **Writing functional, declarative code** with built-in functions
2. **Enabling platform optimizations** (AQE, Photon)
3. **Proper cluster sizing** for your workload
4. **Monitoring and incremental tuning** based on evidence

The platform handles most low-level optimizations automatically, allowing you to focus on business logic and functional design patterns.

## Exercises

Practice advanced cluster configuration and tuning.

In [None]:
print("="*80)
print("EXERCISES: Advanced Configuration and Tuning")
print("="*80)

print("""
Exercise 1: Configuration Audit
--------------------------------
Audit your current Spark configuration:

1. List all non-default configurations
2. Identify deprecated or legacy settings
3. Check if AQE is fully enabled
4. Verify Arrow is enabled for Pandas UDFs
5. Document the purpose of each custom setting

def audit_spark_config() -> Dict[str, List[str]]:
    # Your implementation
    pass

---

Exercise 2: Memory Calculation
-------------------------------
Given a workload that processes 500GB of data with complex aggregations:

Questions:
1. How much executor memory do you need?
2. How many executors should you use?
3. What should spark.executor.cores be?
4. Calculate the memory breakdown (execution, storage, user)

Assumptions:
• Target: 128MB per partition
• Executor cores: 4
• Need to cache 100GB intermediate results

---

Exercise 3: AQE Configuration
------------------------------
Design optimal AQE configuration for:

Scenario A: Streaming workload with small batches
Scenario B: Large batch ETL with multiple joins
Scenario C: Interactive analytics with ad-hoc queries

For each scenario, specify:
• AQE enabled configurations
• Advisory partition size
• Broadcast threshold
• Skew join settings

---

Exercise 4: Cluster Sizing
---------------------------
Design a cluster for each workload:

Workload 1: Daily ETL processing 2TB data
• SLA: Complete in < 2 hours
• Operations: Joins, aggregations, window functions
• Budget: Optimize for cost

Workload 2: Real-time streaming dashboard
• Latency: < 1 minute end-to-end
• Throughput: 10K events/sec
• State: 50GB

Specify:
• Instance type
• Number of workers (min/max)
• Executor memory and cores
• Key Spark configurations

---

Exercise 5: Configuration Migration
------------------------------------
You're migrating from Spark 2.4 to Spark 3.2. Update this configuration:

Old (Spark 2.4):
spark.sql.shuffle.partitions=200
spark.sql.autoBroadcastJoinThreshold=10485760
# Manual partition tuning for each job
# No skew handling

Questions:
1. What new AQE configurations should you enable?
2. Can you remove any manual tuning?
3. What performance improvements do you expect?
4. What new features can optimize your code?

---

Exercise 6: Performance Monitoring
-----------------------------------
Create a monitoring dashboard that tracks:

def create_performance_metrics() -> Dict[str, Any]:
    """
    Collect key performance metrics:
    • Job duration
    • Shuffle read/write
    • Spill metrics
    • GC time
    • Task skew
    • AQE optimizations applied
    """
    # Your implementation
    pass

---

Exercise 7: Cost Optimization
------------------------------
Your current cluster costs $100/hour and jobs run 8 hours/day.

Options:
A) Double cluster size (+100% cost)
B) Enable Photon (+20% cost)
C) Optimize code with AQE (no cost increase)

Expected speedups:
A) 1.8x faster
B) 3x faster
C) 1.5x faster

Questions:
1. Calculate total daily cost for each option
2. Which option provides best ROI?
3. Can you combine options? What's the impact?
4. Factor in developer time savings

""")

print("\n📝 Complete these exercises to master cluster configuration!")