
# Apache Spark Architecture - Complete Overview

## What is Apache Spark?

**Apache Spark** is a unified analytics engine for large-scale data processing.

### Key Characteristics:
- **Fast:** In-memory processing (100x faster than MapReduce)
- **Distributed:** Processes data across multiple machines
- **General-purpose:** Batch, streaming, ML, SQL, graph processing
- **Fault-tolerant:** Recovers from failures automatically

### Why Spark?
- Process Big Data (TB to PB scale)
- Parallel processing
- In-memory computation
- Rich APIs (Python, Scala, Java, R, SQL)

---

## High-Level Spark Architecture

```text
Spark Architecture Overview
============================

           Driver Program
    (Your Application Main)
              |
              |
    +---------+---------+
    |                   |
    v                   v
Cluster Manager    SparkContext
(YARN/Mesos/K8s)   (Coordination)
    |                   |
    +-------+-----------+
            |
    +-------+-------+-------+
    |       |       |       |
    v       v       v       v
Worker   Worker  Worker  Worker
Node 1   Node 2  Node 3  Node 4
  |        |       |       |
Executor Executor Executor Executor
(Tasks)  (Tasks)  (Tasks)  (Tasks)
```

---

## Core Components

### 1. Driver Program
**What it is:**
- Your application's main program
- Contains SparkContext
- Runs the main() function

**Responsibilities:**
- Analyzes user code
- Creates execution plan (DAG)
- Schedules tasks
- Monitors execution
- Collects results

**Location:**
- Can run on client machine
- Or on cluster (cluster mode)

**Example:**
```python
# This code runs on Driver
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()
df = spark.read.parquet("/data")
result = df.count()  # Driver coordinates this
```

---

### 2. SparkContext

**What it is:**
- Entry point to Spark functionality
- Connection to Spark cluster
- Coordinates execution

**Key Functions:**
- Creates RDDs
- Broadcasts variables
- Manages configuration
- Monitors application

**In modern Spark:**
```python
# SparkSession wraps SparkContext
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext  # Access SparkContext
```

---

### 3. Cluster Manager

**What it is:**
- External service that manages cluster resources
- Allocates resources to applications

**Types:**
1. **Standalone:** Spark's built-in manager
2. **YARN:** Hadoop's resource manager
3. **Mesos:** Apache Mesos
4. **Kubernetes:** Container orchestration

**Role:**
- Accept application requests
- Allocate executors to workers
- Monitor resource usage
- Handle failures

---

### 4. Worker Nodes

**What they are:**
- Physical/virtual machines in cluster
- Host executor processes

**Responsibilities:**
- Run executors
- Manage local resources (CPU, memory, disk)
- Report status to cluster manager

**Example Cluster:**
```text
Worker Node 1 (32GB RAM, 8 cores)
    - Executor 1 (8GB, 2 cores)
    - Executor 2 (8GB, 2 cores)

Worker Node 2 (32GB RAM, 8 cores)
    - Executor 3 (8GB, 2 cores)
    - Executor 4 (8GB, 2 cores)
```

---

### 5. Executors

**What they are:**
- JVM processes running on worker nodes
- Execute tasks
- Store data in memory/disk

**Responsibilities:**
- Run tasks assigned by driver
- Store computed data
- Return results to driver
- Cache data when requested

**Configuration:**
```python
spark = SparkSession.builder \
    .config("spark.executor.memory", "4g") \
    .config("spark.executor.cores", "2") \
    .config("spark.executor.instances", "10") \
    .getOrCreate()
```

---

## Complete Architecture Diagram

```text
Detailed Spark Architecture
============================

Client Machine / Driver Node
+----------------------------------+
|  Driver Program                  |
|  +--------------------------+    |
|  | SparkContext             |    |
|  | - DAG Scheduler          |    |
|  | - Task Scheduler         |    |
|  | - Backend Scheduler      |    |
|  +--------------------------+    |
+----------------------------------+
            |
            | Submit Application
            v
+----------------------------------+
|  Cluster Manager                 |
|  (YARN / Mesos / K8s)           |
|  - Resource Allocation           |
|  - Executor Launch               |
+----------------------------------+
            |
            | Launch Executors
            v
+----------------------------------+
|  Worker Nodes (Cluster)          |
|                                  |
|  Worker Node 1                   |
|  +----------------------------+  |
|  | Executor 1                 |  |
|  | +------------------------+ |  |
|  | | Task 1 | Task 2       | |  |
|  | +------------------------+ |  |
|  | Cache / Storage           |  |
|  +----------------------------+  |
|                                  |
|  Worker Node 2                   |
|  +----------------------------+  |
|  | Executor 2                 |  |
|  | +------------------------+ |  |
|  | | Task 3 | Task 4       | |  |
|  | +------------------------+ |  |
|  | Cache / Storage           |  |
|  +----------------------------+  |
|                                  |
|  Worker Node 3                   |
|  +----------------------------+  |
|  | Executor 3                 |  |
|  | +------------------------+ |  |
|  | | Task 5 | Task 6       | |  |
|  | +------------------------+ |  |
|  | Cache / Storage           |  |
|  +----------------------------+  |
+----------------------------------+
```

---

## Job Execution Flow

### Step-by-Step Execution

```text
Step 1: User Submits Application
=================================
spark-submit app.py
    |
    v
Driver Program Starts

Step 2: Driver Creates SparkContext
====================================
SparkContext initializes
    |
    v
Connects to Cluster Manager

Step 3: Resource Allocation
============================
Cluster Manager allocates resources
    |
    v
Launches Executors on Worker Nodes

Step 4: Code Execution
======================
Driver sends application code to Executors
    |
    v
Driver creates DAG from transformations

Step 5: Task Scheduling
========================
Driver converts DAG to stages
    |
    v
Driver creates tasks (one per partition)
    |
    v
Driver sends tasks to Executors

Step 6: Task Execution
======================
Executors run tasks in parallel
    |
    v
Process data partitions

Step 7: Results Collection
===========================
Executors send results back to Driver
    |
    v
Driver aggregates final results

Step 8: Completion
==================
Application completes
    |
    v
Resources released
```

---

## Detailed Execution Example

```python
# User code
df = spark.read.parquet("/data")
df = df.filter(col("age") > 30)
df = df.groupBy("city").count()
result = df.show()
```

### What Happens Internally:

```text
1. Driver: Read code, build logical plan
   Logical Plan:
   - Read /data
   - Filter age > 30
   - GroupBy city
   - Count
   - Show

2. Driver: Optimize plan (Catalyst Optimizer)
   Optimized Plan:
   - Read only needed columns
   - Push filter to source
   - Optimize groupBy

3. Driver: Create physical plan
   Physical Plan:
   - Stage 1: Read + Filter (narrow)
   - Stage 2: GroupBy + Count (wide, shuffle)

4. Driver: Create tasks
   Stage 1: 8 tasks (8 partitions)
   Stage 2: 4 tasks (4 output partitions)

5. Cluster Manager: Allocate resources
   Launch 3 executors across worker nodes

6. Driver: Send Stage 1 tasks to Executors
   Executor 1: Tasks 1, 2, 3
   Executor 2: Tasks 4, 5, 6
   Executor 3: Tasks 7, 8

7. Executors: Execute Stage 1
   Read data, apply filter, write shuffle files

8. Driver: Send Stage 2 tasks to Executors
   Executor 1: Task 1
   Executor 2: Task 2
   Executor 3: Task 3, 4

9. Executors: Execute Stage 2
   Read shuffle data, groupBy, count

10. Executors: Return results to Driver
    Driver collects and displays results
```

---

## Driver Node vs Worker Node

### Driver Node

**Role:** Coordinator and planner

**Contains:**
- Driver Program
- SparkContext
- DAG Scheduler
- Task Scheduler

**Responsibilities:**
- Parse user code
- Create execution plan
- Schedule tasks
- Monitor progress
- Collect results

**Resources Needed:**
- Moderate CPU (for planning)
- Moderate memory (for coordination)
- Network bandwidth (for task distribution)

**Configuration:**
```python
spark-submit \
  --driver-memory 4g \
  --driver-cores 2 \
  app.py
```

---

### Worker Node

**Role:** Executor and processor

**Contains:**
- One or more Executors
- Task execution engine
- Data cache/storage

**Responsibilities:**
- Execute tasks
- Process data
- Store intermediate results
- Return results to driver

**Resources Needed:**
- High CPU (for processing)
- High memory (for data + cache)
- Disk space (for shuffle files)

**Configuration:**
```python
spark-submit \
  --executor-memory 8g \
  --executor-cores 4 \
  --num-executors 10 \
  app.py
```

---

## Communication Flow

```text
Driver <---> Cluster Manager
    |
    +---> Executor 1 (Worker Node 1)
    |         |
    |         +---> Task execution
    |         +---> Send results
    |
    +---> Executor 2 (Worker Node 2)
    |         |
    |         +---> Task execution
    |         +---> Send results
    |
    +---> Executor 3 (Worker Node 3)
              |
              +---> Task execution
              +---> Send results

Executors <---> Executors (during shuffle)
```

---

## Job Execution Hierarchy

```text
Application
    |
    +---> Job 1 (triggered by action 1)
    |       |
    |       +---> Stage 1 (narrow transformations)
    |       |       |
    |       |       +---> Task 1 (partition 1)
    |       |       +---> Task 2 (partition 2)
    |       |       +---> Task 3 (partition 3)
    |       |
    |       +---> Stage 2 (wide transformation, shuffle)
    |               |
    |               +---> Task 1 (partition 1)
    |               +---> Task 2 (partition 2)
    |
    +---> Job 2 (triggered by action 2)
            |
            +---> Stage 1
                    |
                    +---> Tasks...
```

---

## Memory Architecture

### Driver Memory
```text
Driver Memory (4GB example)
===========================
+---------------------------+
| User Code Memory          | 60%
| (Application variables)   |
+---------------------------+
| Spark Internal Memory     | 30%
| (Task scheduling, DAG)    |
+---------------------------+
| Reserved Memory           | 10%
+---------------------------+
```

### Executor Memory
```text
Executor Memory (8GB example)
==============================
+---------------------------+
| Execution Memory          | 40%
| (Task computation)        |
+---------------------------+
| Storage Memory            | 40%
| (Cached data, broadcast)  |
+---------------------------+
| User Memory               | 15%
| (User data structures)    |
+---------------------------+
| Reserved Memory           | 5%
+---------------------------+
```

---

## Deployment Modes

### 1. Client Mode
```text
Client Machine (Driver)
    |
    | Task distribution
    v
Cluster (Executors only)
```
- Driver runs on client machine
- Good for interactive analysis
- Requires client to stay connected

### 2. Cluster Mode
```text
Cluster
    |
    +---> Driver (on cluster)
    |
    +---> Executors (on cluster)
```
- Driver runs on cluster
- Good for production jobs
- Client can disconnect

---

## Production Architecture Example

```text
Production Spark Cluster
=========================

Driver Node (16GB RAM, 4 cores)
    - Runs Spark Driver
    - Monitors application

Worker Node 1 (64GB RAM, 16 cores)
    - Executor 1 (16GB, 4 cores)
    - Executor 2 (16GB, 4 cores)
    - Executor 3 (16GB, 4 cores)

Worker Node 2 (64GB RAM, 16 cores)
    - Executor 4 (16GB, 4 cores)
    - Executor 5 (16GB, 4 cores)
    - Executor 6 (16GB, 4 cores)

Worker Node 3 (64GB RAM, 16 cores)
    - Executor 7 (16GB, 4 cores)
    - Executor 8 (16GB, 4 cores)
    - Executor 9 (16GB, 4 cores)

Total Resources:
- 1 Driver
- 9 Executors
- 144 GB executor memory
- 36 executor cores
```

---

## Key Concepts Summary

| Component | Role | Location |
|-----------|------|----------|
| Driver | Coordinator | Driver Node |
| SparkContext | Entry point | Driver Node |
| Cluster Manager | Resource manager | Separate service |
| Worker Node | Host for executors | Cluster |
| Executor | Task executor | Worker Node |
| Task | Unit of work | Executor |

---

## Resource Allocation Strategy

### Small Job (< 100GB data)
```python
spark-submit \
  --driver-memory 2g \
  --executor-memory 4g \
  --executor-cores 2 \
  --num-executors 5 \
  app.py
```

### Medium Job (100GB - 1TB data)
```python
spark-submit \
  --driver-memory 4g \
  --executor-memory 8g \
  --executor-cores 4 \
  --num-executors 20 \
  app.py
```

### Large Job (> 1TB data)
```python
spark-submit \
  --driver-memory 8g \
  --executor-memory 16g \
  --executor-cores 5 \
  --num-executors 50 \
  app.py
```

---

## What We've Learned

### 1. What is Apache Spark
- Distributed analytics engine
- In-memory processing
- General-purpose framework

### 2. High-Level Architecture
- Driver coordinates
- Cluster Manager allocates
- Executors process
- Workers host executors

### 3. Driver Node and Worker Node
- Driver: Planning and coordination
- Worker: Execution and processing

### 4. Job Execution Flow
- Code submission
- Resource allocation
- Task creation
- Parallel execution
- Result collection

---

## Visual Summary

```text
Complete Flow Recap
===================

User Code
    |
    v
Driver (SparkContext)
    |
    +---> Builds DAG
    +---> Optimizes Plan
    +---> Creates Jobs
    +---> Divides into Stages
    +---> Generates Tasks
    |
    v
Cluster Manager
    |
    +---> Allocates Resources
    +---> Launches Executors
    |
    v
Worker Nodes
    |
    +---> Executor 1 ---> Tasks
    +---> Executor 2 ---> Tasks
    +---> Executor 3 ---> Tasks
    |
    v
Results back to Driver
    |
    v
User gets output
```

---

## Next Steps

Now that you understand the architecture, you can learn:
- Transformations and Actions
- Partitioning strategies
- Performance tuning
- Memory management
- Shuffle optimization
- Caching strategies

---

## Quick Reference

**Remember the hierarchy:**
```text
Application > Job > Stage > Task

Application = Your entire program
Job = One action
Stage = Set of transformations (divided by shuffle)
Task = Processing one partition
```

**Remember the components:**
```text
Driver = Brain (plans)
Cluster Manager = Resource allocator
Worker = Physical machine
Executor = Worker process
Task = Smallest unit of work
```

---

**Key Takeaway:** Spark distributes your code across a cluster, the Driver coordinates, and Executors do the actual work in parallel!


# Application Master Container – PySpark to JVM Flow

## Overview

Understanding how PySpark code executes is crucial for debugging and optimization in production environments.

---

## Architecture Diagram

```text
Application Master Container (Driver Node)
============================================

   User PySpark Code
   (df.count(), df.show())
           |
           v
   PySpark Main (PySpark driver)
   (Python Process)
           |
           |  Py4J (Python <-> JVM bridge)
           v
   JVM Main (Application Driver)
   (Spark Driver - Java/Scala)
           |
           v
   Spark Core Engine
   (DAG, Jobs, Stages, Tasks)
           |
           v
   Cluster Manager
   (YARN/Mesos/Kubernetes)
           |
           v
   Executors (Worker Nodes)
   (Task Execution)

============================================
```

---

## Step-by-Step Execution Flow

### Step 1: User Writes PySpark Code
```python
# User writes code in Python
df = spark.read.parquet("/data")
result = df.filter(df.age > 30).count()
```

**What happens:**
- Code is written in Python
- Runs in PySpark environment
- Uses DataFrame API

---

### Step 2: PySpark Main Process
```text
PySpark Main (Python Process)
    |
    +-> Interprets Python code
    +-> Validates DataFrame operations
    +-> Prepares instructions for JVM
```

**Key Points:**
- PySpark Main is a Python process
- Python itself CANNOT execute Spark logic
- Acts as a client/wrapper

---

### Step 3: Py4J Bridge
```text
Python Side              Py4J Bridge              JVM Side
    |                         |                      |
    |  df.count()            |                      |
    +----------------------->|                      |
    |                         |  JavaObject.count() |
    |                         +-------------------->|
    |                         |                      |
    |                         |  <-- Result (100) --|
    |  <-- Return 100 --------|                      |
```

**What is Py4J?**
- A library that enables Python to call JVM methods
- Converts Python objects to Java objects
- Serializes data between Python and JVM

**Example:**
```python
# Python code
df.count()

# Py4J translates to
# JavaObject.DataFrame.count()
```

---

### Step 4: JVM Main (Spark Driver)
```text
JVM Main (Spark Driver)
    |
    +-> Receives instruction from Py4J
    +-> Executes Spark Core logic
    +-> Creates execution plan
```

**Key Points:**
- This is the REAL Spark Driver
- Written in Scala/Java
- Contains all Spark optimization logic

---

### Step 5: Spark Core Engine
```text
Spark Core Engine
    |
    +-> Step 1: Build DAG (Directed Acyclic Graph)
    |
    +-> Step 2: Create Jobs (from Actions)
    |
    +-> Step 3: Divide into Stages (based on shuffles)
    |
    +-> Step 4: Break into Tasks (one per partition)
    |
    +-> Step 5: Schedule Tasks on Executors
```

---

### Step 6: Task Execution
```text
Cluster Manager (YARN/Mesos/K8s)
    |
    +-> Executor 1 (Worker Node 1)
    |       |
    |       +-> Task 1 (Partition 1)
    |       +-> Task 2 (Partition 2)
    |
    +-> Executor 2 (Worker Node 2)
    |       |
    |       +-> Task 3 (Partition 3)
    |       +-> Task 4 (Partition 4)
    |
    +-> Executor 3 (Worker Node 3)
            |
            +-> Task 5 (Partition 5)
            +-> Task 6 (Partition 6)
```

---

## Complete Flow Example

```python
# User code
df = spark.read.parquet("/data")
result = df.filter(df.age > 30).count()
```

### Execution Steps:

**1. Python Process (PySpark Main):**
```text
User Code -> PySpark API
```

**2. Py4J Bridge:**
```text
Python df.count() -> Py4J -> Java DataFrame.count()
```

**3. JVM Spark Driver:**
```text
Receives count() instruction
Creates DAG
Identifies Action (count)
Creates 1 Job
```

**4. Spark Core:**
```text
Job 1:
  Stage 1: Read + Filter (narrow)
  Stage 2: Count aggregation (wide)
```

**5. Task Creation:**
```text
Stage 1: 8 tasks (8 partitions)
Stage 2: 1 task (final aggregation)
```

**6. Execution:**
```text
Tasks sent to Executors
Results collected back to Driver
Py4J returns result to Python
User gets: result = 1500
```

---

## Detailed Component Breakdown

### PySpark Main (Python Side)
**Role:** Client-side wrapper

**Responsibilities:**
- Parse Python code
- Validate DataFrame operations
- Manage Python-side objects
- Handle serialization

**Limitations:**
- Cannot execute Spark logic
- Depends on JVM for actual processing

---

### Py4J Bridge
**Role:** Communication layer

**How it works:**
```text
Python Object  <--Py4J-->  Java Object
   |                           |
   +-- Serialization -->       |
   |                           |
   |       <-- Deserialization-+
```

**What gets transferred:**
- DataFrame operations (lazy)
- Action triggers
- Configuration parameters
- Results (after execution)

---

### JVM Main (Spark Driver)
**Role:** Execution orchestrator

**Components:**
- SparkContext
- DAGScheduler
- TaskScheduler
- Catalyst Optimizer (for DataFrames)

**Key Functions:**
- Query optimization
- Execution planning
- Resource management
- Task distribution

---

## Why This Architecture?

### Advantages:
1. **Python ease of use** with **JVM performance**
2. Spark optimizations work regardless of language
3. Python developers can use Spark without learning Scala/Java
4. Catalyst optimizer works the same for all APIs

### Disadvantages:
1. **Serialization overhead** (Python <-> JVM)
2. **Extra process** (Python process)
3. **Memory duplication** (data in both Python and JVM)

---

## Performance Implications

### Where Performance is Lost:

**1. Py4J Communication:**
```python
# BAD: Frequent small operations
for i in range(1000):
    df = df.withColumn(f"col_{i}", lit(i))  # 1000 Py4J calls!

# GOOD: Single operation
from pyspark.sql.functions import *
df = df.select("*", *[lit(i).alias(f"col_{i}") for i in range(1000)])
```

**2. UDF (User Defined Functions):**
```python
# BAD: Python UDF (slow, serialization overhead)
from pyspark.sql.functions import udf
@udf("int")
def add_one(x):
    return x + 1

df = df.withColumn("new_col", add_one(col("age")))

# GOOD: Built-in functions (JVM, fast)
df = df.withColumn("new_col", col("age") + 1)
```

**3. collect() Operations:**
```python
# BAD: Bringing data to Python
data = df.collect()  # Serializes ALL data to Python
for row in data:
    print(row.age)

# GOOD: Process in Spark
df.select("age").show()
```

---

## Production Best Practices

### 1. Minimize Python UDFs
```python
# AVOID Python UDFs
@udf("string")
def format_name(name):
    return name.upper()

# USE Built-in functions
from pyspark.sql.functions import upper
df = df.withColumn("name_upper", upper(col("name")))
```

### 2. Use Pandas UDFs for Better Performance
```python
# Pandas UDF (vectorized, faster than regular UDF)
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("double")
def multiply_by_two(series: pd.Series) -> pd.Series:
    return series * 2

df = df.withColumn("doubled", multiply_by_two(col("value")))
```

### 3. Avoid collect() on Large Data
```python
# BAD
all_data = df.collect()  # Brings all data to driver

# GOOD
df.write.parquet("/output")  # Process in distributed manner
```

### 4. Use Arrow for Pandas Conversion
```python
# Enable Arrow optimization
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Faster conversion
pandas_df = spark_df.toPandas()
```

### 5. Batch Operations
```python
# BAD: Multiple API calls
df = df.withColumn("col1", lit(1))
df = df.withColumn("col2", lit(2))
df = df.withColumn("col3", lit(3))

# GOOD: Single call
df = df.select("*", lit(1).alias("col1"), 
                    lit(2).alias("col2"), 
                    lit(3).alias("col3"))
```

---

## Debugging Tips

### 1. Check Py4J Connection
```python
# Verify Py4J is working
print(spark._jvm)  # Should show JVM gateway object
```

### 2. Monitor Serialization
```python
# Check serialization time in Spark UI
# Go to Stage Details -> Task Metrics -> Serialization Time
```

### 3. Profile Python Code
```python
import cProfile
cProfile.run('df.count()')
```

### 4. Check Driver Logs
```bash
# Driver logs show Py4J communication
# Look for: py4j.java_gateway logs
```

---

## Common Issues and Solutions

### Issue 1: Py4J Gateway Timeout
**Symptom:** `Py4JNetworkError: An error occurred while trying to connect`

**Solution:**
```python
# Increase timeout
spark = SparkSession.builder \
    .config("spark.driver.extraJavaOptions", 
            "-Dpy4j.gateway.startup.timeout=60") \
    .getOrCreate()
```

### Issue 2: Serialization Errors
**Symptom:** `PicklingError` or `SerializationError`

**Solution:**
```python
# Use built-in functions instead of UDFs
# Or ensure objects are serializable
```

### Issue 3: Slow Performance
**Symptom:** Job takes much longer than expected

**Solution:**
```python
# 1. Avoid Python UDFs
# 2. Use Pandas UDFs if needed
# 3. Minimize collect() calls
# 4. Enable Arrow
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
```

---

## Architecture Comparison

### PySpark vs Scala Spark

```text
PySpark:
Python Code -> Py4J -> JVM -> Spark Core -> Executors

Scala Spark:
Scala Code -> JVM -> Spark Core -> Executors

Key Difference: Extra Py4J layer in PySpark
```

**Performance Impact:**
- DataFrame operations: Similar (optimized by Catalyst)
- RDD operations: Scala is faster
- UDFs: Scala is much faster
- Complex transformations: Scala is faster

---

## Memory Layout

```text
Driver Node Memory Layout
==========================

[Python Process Memory]
    - PySpark objects
    - Python variables
    - Py4J client
        |
        | (IPC - Inter Process Communication)
        v
[JVM Process Memory]
    - Spark Driver
    - Java objects
    - Execution plans
    - Metadata
```

**Important:** Data exists in BOTH processes during certain operations!

---

## Quick Reference

| Component | Language | Purpose |
|-----------|----------|---------|
| PySpark Main | Python | User interface |
| Py4J | Python/Java | Bridge |
| Spark Driver | Java/Scala | Execution planning |
| Spark Core | Scala | Core engine |
| Executors | JVM | Task execution |

---

## Summary Diagram

```text
Full PySpark Execution Flow
============================

User Python Code
    |
    v
PySpark API (Python)
    |
    v
Py4J Serialization
    |
    v
JVM Spark Driver (Scala/Java)
    |
    +-> DAG Creation
    +-> Query Optimization (Catalyst)
    +-> Job Creation
    +-> Stage Division
    +-> Task Generation
    |
    v
Cluster Manager
    |
    v
Executors (Distributed)
    |
    +-> Task Execution
    +-> Data Processing
    |
    v
Results
    |
    v
Py4J Deserialization
    |
    v
Python Results
```

---

## Key Takeaways

1. **PySpark is a wrapper** around Spark's JVM engine
2. **Py4J bridges** Python and JVM
3. **Actual execution** happens in JVM (Scala/Java)
4. **Performance loss** occurs at serialization boundaries
5. **Avoid Python UDFs** when possible
6. **Use built-in functions** for best performance
7. **DataFrame API** is optimized regardless of language

---

**Remember:** PySpark gives you Python syntax with JVM performance, but minimize crossing the Python-JVM boundary for best results!


# Lazy Evaluation & Actions in Apache Spark

## What is Lazy Evaluation?

**Lazy Evaluation** means:
Spark does NOT execute transformations immediately.

Spark only:
- Records transformations
- Builds a logical plan (DAG)
- Waits until an ACTION is called

**Execution starts ONLY when an ACTION is triggered.**

---

## Visual Representation

```text
What You Think Happens:
========================
df.select() -> Execute -> Result
df.filter() -> Execute -> Result
df.groupBy() -> Execute -> Result

What Actually Happens:
========================
df.select() -> Record in plan
df.filter() -> Record in plan
df.groupBy() -> Record in plan
df.show() -> NOW EXECUTE ALL!
```

---

## Transformations

Transformations are operations that:
- Create a new DataFrame / RDD
- Do NOT trigger execution
- Are lazily evaluated

### Examples:
```python
# All these are transformations (NO execution)
df.select("name", "age")
df.filter(df.age > 30)
df.withColumn("bonus", col("salary") * 0.1)
df.groupBy("department").count()
df.join(df2, "id")
```

### Key Point:
**None of these execute immediately!**

---

## Transformation Flow Example

```python
# User writes code
df1 = spark.read.parquet("/data")
df2 = df1.select("name", "salary")
df3 = df2.filter(df2.salary > 50000)
df4 = df3.groupBy("name").count()
```

### What Spark is doing internally (NO execution yet):

```text
Step 1: df1 = spark.read.parquet("/data")
    Spark: "Okay, I'll remember to read from /data"
    Status: NO execution

Step 2: df2 = df1.select("name", "salary")
    Spark: "Okay, I'll remember to select these columns"
    Status: NO execution

Step 3: df3 = df2.filter(df2.salary > 50000)
    Spark: "Okay, I'll remember to filter"
    Status: NO execution

Step 4: df4 = df3.groupBy("name").count()
    Spark: "Okay, I'll remember to group and count"
    Status: STILL NO execution!
```

**Important:** Spark does NOT:
- Read data
- Compute results
- Use any cluster resources
- Spark only remembers the steps

---

## DAG (Directed Acyclic Graph)

Spark builds a DAG internally to represent the execution plan.

**DAG** = Directed Acyclic Graph = Flow of operations

```text
Logical Plan (DAG):
===================

Source Data (/data)
     |
     v
select(name, salary)
     |
     v
filter(salary > 50000)
     |
     v
groupBy(name).count()
     |
     v
[Waiting for ACTION...]
```

### What is DAG?
- **Directed:** Operations flow in one direction
- **Acyclic:** No loops, no circular dependencies
- **Graph:** Series of connected operations

---

## Actions

**Actions** are operations that:
- Trigger execution
- Return results to Driver
- OR write data to storage

### Common Actions:

| Action | What it does |
|--------|--------------|
| show() | Display rows in console |
| count() | Count number of rows |
| collect() | Bring all data to driver |
| take(n) | Bring first n rows to driver |
| first() | Get first row |
| write() | Write data to storage |
| save() | Save to file system |
| foreach() | Apply function to each row |

---

## Action Triggers Execution

```python
# All transformations (no execution)
df1 = spark.read.parquet("/data")
df2 = df1.select("name", "salary")
df3 = df2.filter(df2.salary > 50000)
df4 = df3.groupBy("name").count()

# ACTION - triggers execution
df4.show()  # NOW everything executes!
```

### What happens when show() is called:

```text
1. Spark finalizes the DAG
2. Catalyst optimizer optimizes the plan
3. Creates physical execution plan
4. Divides into Jobs, Stages, Tasks
5. Executes on cluster
6. Returns results to driver
7. Displays in console
```

---

## Complete Lazy Evaluation Flow

```text
User Code Flow:
================

Transformation 1 (select)
      |
      v
Transformation 2 (filter)
      |
      v
Transformation 3 (groupBy)
      |
      v
 (NO execution yet)
      |
      v
   ACTION (show/count/write)
      |
      v
Spark starts execution
      |
      v
Optimizer runs
      |
      v
Physical plan created
      |
      v
Jobs, Stages, Tasks created
      |
      v
Execute on cluster
      |
      v
Results returned
```

---

## Why Lazy Evaluation?

### 1. Performance Optimization
Spark can see the ENTIRE plan before execution and optimize it.

**Example:**
```python
df1 = spark.read.parquet("/data")
df2 = df1.select("name", "age", "salary", "department", "city")
df3 = df2.filter(df2.age > 30)
df4 = df3.select("name", "age")  # Only need 2 columns!
df4.show()
```

**Without lazy evaluation:**
- Read all columns
- Filter
- Then select 2 columns

**With lazy evaluation (Spark optimizes):**
- Read ONLY "name" and "age" columns (column pruning)
- Apply filter
- Much faster!

### 2. Avoids Unnecessary Computation
```python
df1 = spark.read.parquet("/data")
df2 = df1.filter(df1.status == "active")
df3 = df1.filter(df1.status == "inactive")

# If only df2 is used
df2.count()  # Only df2 executes, df3 is never computed
```

### 3. Enables Query Optimization
Spark's Catalyst optimizer can:
- Predicate pushdown
- Column pruning
- Constant folding
- Join reordering

### 4. Reduces Disk and Memory Usage
Only necessary data is read and processed.

---

## Detailed Example with Optimization

```python
# Transformations (lazy)
df = spark.read.parquet("/large_dataset")  # 100 columns, 1TB data
df = df.filter(col("year") == 2024)
df = df.filter(col("country") == "USA")
df = df.select("user_id", "sales")
df = df.groupBy("user_id").agg(sum("sales"))

# Action (triggers execution)
result = df.count()
```

### What Spark Optimizes:

**Before Optimization (Naive approach):**
```text
1. Read all 100 columns, 1TB data
2. Filter by year
3. Filter by country
4. Select 2 columns
5. Group and aggregate
```

**After Optimization (Catalyst):**
```text
1. Read ONLY user_id, sales, year, country columns (column pruning)
2. Push filters to storage layer (predicate pushdown)
3. Read only 2024 USA data
4. Group and aggregate
Result: Read only 10GB instead of 1TB!
```

---

## Transformations vs Actions

### Transformations (Lazy):
```python
# None of these execute
df.select("col1")
df.filter(df.col1 > 10)
df.withColumn("new", lit(1))
df.groupBy("col1").sum("col2")
df.join(df2, "key")
df.distinct()
df.orderBy("col1")
```

### Actions (Eager):
```python
# All of these trigger execution
df.show()
df.count()
df.collect()
df.take(10)
df.first()
df.write.parquet("/output")
df.foreach(lambda x: print(x))
```

---

## Production Examples

### Example 1: Multiple Transformations, Single Action
```python
# Good: All transformations, single action
df = spark.read.parquet("/data")
df = df.filter(col("active") == True)
df = df.select("user_id", "revenue")
df = df.groupBy("user_id").agg(sum("revenue"))
df.write.parquet("/output")  # Single action

# Result: 1 job execution
```

### Example 2: Multiple Actions (Avoid)
```python
# Bad: Multiple actions
df = spark.read.parquet("/data")
df = df.filter(col("active") == True)

count = df.count()  # Action 1 - Full execution
df.show()           # Action 2 - Full execution again!
df.write.parquet("/output")  # Action 3 - Full execution again!

# Result: 3 separate job executions!
```

### Example 3: Cache for Multiple Actions
```python
# Better: Cache when multiple actions needed
df = spark.read.parquet("/data")
df = df.filter(col("active") == True)
df.cache()  # Cache the result

count = df.count()  # Action 1 - Executes and caches
df.show()           # Action 2 - Uses cache
df.write.parquet("/output")  # Action 3 - Uses cache

# Result: 1 execution, 2 cache reads
```

---

## How to Check DAG

### In Spark UI:
1. Go to Spark UI (port 4040)
2. Click on "SQL" or "Jobs" tab
3. Click on a job
4. See "DAG Visualization"

### In Code:
```python
# See logical plan
df.explain(True)

# Output shows:
# == Parsed Logical Plan ==
# == Analyzed Logical Plan ==
# == Optimized Logical Plan ==
# == Physical Plan ==
```

---

## Optimization Examples

### Example 1: Filter Pushdown
```python
# Code
df = spark.read.parquet("/data")
df = df.select("name", "age", "city")
df = df.filter(df.age > 30)

# Spark optimizes to:
# Read parquet with filter age > 30 pushed down
# Read only name, age, city columns
```

### Example 2: Column Pruning
```python
# Code
df = spark.read.parquet("/data")  # 50 columns
df = df.select("col1", "col2")

# Spark optimizes to:
# Read ONLY col1 and col2 from parquet
# (Doesn't read all 50 columns)
```

### Example 3: Predicate Pushdown
```python
# Code
df = spark.read.jdbc(url, "table")
df = df.filter(col("date") == "2024-01-01")

# Spark optimizes to:
# SELECT * FROM table WHERE date = '2024-01-01'
# (Filter happens in database, not in Spark)
```

---

## Common Mistakes

### Mistake 1: Calling collect() on Large Data
```python
# BAD - Brings all data to driver
data = df.collect()  # Can crash driver!

# GOOD - Process in distributed manner
df.write.parquet("/output")
```

### Mistake 2: Multiple Actions Without Cache
```python
# BAD - Recomputes every time
df.count()
df.show()
df.write.parquet("/output")

# GOOD - Cache first
df.cache()
df.count()
df.show()
df.write.parquet("/output")
```

### Mistake 3: Unnecessary Intermediate Actions
```python
# BAD
df1 = spark.read.parquet("/data")
print(f"Count: {df1.count()}")  # Action 1
df2 = df1.filter(col("active") == True)
print(f"Count: {df2.count()}")  # Action 2
result = df2.groupBy("user").count()
result.show()  # Action 3

# GOOD
df1 = spark.read.parquet("/data")
df2 = df1.filter(col("active") == True)
result = df2.groupBy("user").count()
result.show()  # Single action
```

---

## Understanding with Timeline

```text
Time 0: Code starts
==================
df = spark.read.parquet("/data")
Status: No execution, just plan

Time 1: More transformations
=============================
df = df.select("col1", "col2")
df = df.filter(df.col1 > 10)
Status: Still no execution, plan grows

Time 2: Even more transformations
==================================
df = df.groupBy("col1").count()
Status: Still no execution, plan continues to grow

Time 3: ACTION called
=====================
df.show()
Status: NOW execution starts!
- Optimizer runs
- Physical plan created
- Jobs launched
- Cluster resources used
- Results returned

Time 4: Complete
================
Results displayed
Execution finished
```

---

## Key Concepts Summary

| Concept | Meaning |
|---------|---------|
| Lazy Evaluation | Delay execution until action |
| Transformation | Create new DF, no execution |
| Action | Trigger execution |
| DAG | Logical plan of operations |
| Catalyst | Spark's optimizer |
| Physical Plan | Actual execution strategy |

---

## Benefits Recap

1. **Optimization:** Spark sees entire plan, optimizes globally
2. **Efficiency:** Only necessary work is done
3. **Resource Savings:** Read only needed data
4. **Performance:** Push filters/projections to source
5. **Smart Execution:** Combines operations when possible

---

## Quick Decision Tree

```text
Should I use .cache() or .persist()?
=====================================

Will you use this DataFrame multiple times?
    |
    +-- NO -> Don't cache (waste of memory)
    |
    +-- YES -> How many times?
              |
              +-- 2 times -> Maybe cache
              |
              +-- 3+ times -> Definitely cache
```

---

## Final Example: Complete Flow

```python
from pyspark.sql.functions import col, sum

# Start: All lazy transformations
df = spark.read.parquet("/sales_data")  # Not executed
df = df.filter(col("year") == 2024)     # Not executed
df = df.filter(col("region") == "USA")  # Not executed
df = df.select("product", "revenue")    # Not executed
df = df.groupBy("product").agg(sum("revenue"))  # Not executed

# At this point: ZERO execution, ZERO data read
# Spark only has a plan (DAG)

# Now action
df.write.parquet("/output")  # EXECUTES EVERYTHING

# What Spark does:
# 1. Optimizes: Read only year, region, product, revenue
# 2. Pushes filters: year=2024 AND region=USA to storage
# 3. Reads only filtered data
# 4. Groups and aggregates
# 5. Writes output
```

---

## Remember

**Transformations are lazy, Actions are eager!**

Every transformation just adds to the plan.
Every action executes the entire plan.

**Optimize by:**
- Minimizing actions
- Using cache when needed
- Letting Spark optimize the full plan

---

**Key Takeaway:** Lazy evaluation allows Spark to be smart about execution. Trust it, use it, benefit from it!

## How to make Spark Session (Databricks)

```python
# SparkSession is already available as `spark` in Databricks

df = spark.createDataFrame(data, schema=schema)

from pyspark.sql.functions import *
from pyspark.sql.types import *



# Partitions & RDD in Apache Spark

## What is a Partition?

- A partition is a small chunk of data
- Spark divides large data into multiple partitions
- Each partition is processed independently

### Key Points:
- One partition = one task
- One task runs on one executor core
- More partitions = more parallelism

### Visual Representation:
```text
Large Dataset (1 TB)
        |
        v
Divided into partitions
        |
        v
[P1] [P2] [P3] [P4] [P5] [P6] [P7] [P8]
 |    |    |    |    |    |    |    |
 v    v    v    v    v    v    v    v
Task Task Task Task Task Task Task Task
 |    |    |    |    |    |    |    |
 v    v    v    v    v    v    v    v
Core Core Core Core Core Core Core Core
```

---

## Why Partitions are Important?

Partitions decide:
- Parallel execution
- Job performance
- Resource utilization
- Execution speed

**Bad partitioning = slow job**  
**Good partitioning = fast job**

### Example Impact:
```text
Scenario 1: Too Few Partitions (2 partitions, 8 cores)
[P1======] [P2======]
Core1      Core2      Core3 Core4 Core5 Core6 Core7 Core8
(busy)     (busy)     (idle)(idle)(idle)(idle)(idle)(idle)
Result: 6 cores wasted! Poor performance.

Scenario 2: Good Partitions (8 partitions, 8 cores)
[P1] [P2] [P3] [P4] [P5] [P6] [P7] [P8]
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
Result: All cores utilized! Good performance.

Scenario 3: Too Many Partitions (100 partitions, 8 cores)
[P1][P2][P3]...[P100]
Result: Too much scheduling overhead! Slow performance.
```

---

## What is RDD?

**RDD** stands for **Resilient Distributed Dataset**.

### Meaning:
- **Resilient** = fault tolerant (can recover from failures)
- **Distributed** = spread across multiple machines
- **Dataset** = collection of data

RDD is the fundamental data structure of Spark.

### RDD Characteristics:
- Immutable (cannot be changed once created)
- Lazily evaluated (computed only when action is called)
- Partitioned across cluster
- Fault-tolerant through lineage

---

## Relationship Between RDD and Partition

- An RDD is made up of multiple partitions
- Data inside an RDD is always partitioned
- Spark processes RDD partition by partition

### Visual Diagram:
```text
         RDD (Logical View)
              |
              v
     [Complete Dataset]
              |
              v
    Physical Distribution
              |
    +---------+---------+
    |         |         |
    v         v         v
Partition 1  Partition 2  Partition 3
(on Node A)  (on Node B)  (on Node C)
```

---

## How Spark Executes Using Partitions

```text
RDD with 4 Partitions
        |
        +---> Partition 1 -> Task 1 -> Executor Core 1
        |
        +---> Partition 2 -> Task 2 -> Executor Core 2
        |
        +---> Partition 3 -> Task 3 -> Executor Core 3
        |
        +---> Partition 4 -> Task 4 -> Executor Core 4

Each partition is processed in parallel
```

---

## Important Execution Rules

1. Spark never processes full data at once
2. Spark always processes data partition-wise
3. Number of partitions = number of tasks
4. Tasks run in parallel on available cores

### Code Example:
```python
# Check number of partitions
df = spark.read.parquet("/data")
print(f"Number of partitions: {df.rdd.getNumPartitions()}")

# This creates 8 tasks (one per partition)
df.rdd.getNumPartitions()  # Output: 8
```

---

## RDD vs DataFrame (Basic Understanding)

| Aspect | RDD | DataFrame |
|--------|-----|-----------|
| Level | Low-level | High-level |
| Optimization | Manual | Automatic (Catalyst) |
| Type Safety | Compile-time | Runtime |
| Ease of Use | Complex | Easy |
| Performance | Good | Better (optimized) |
| API | Functional | SQL + Functional |

### Relationship:
```text
DataFrame (High-level API)
        |
        v
   Uses internally
        |
        v
RDD (Low-level execution engine)
        |
        v
Partitions (Physical data chunks)
```

**Note:** DataFrame internally uses RDD but adds optimization layer.

---

## Production Best Practices

### 1. Optimal Partition Size
**Rule of Thumb:** Each partition should be 100-200 MB

```python
# Calculate ideal partitions
data_size_gb = 100  # GB
partition_size_mb = 128  # MB
ideal_partitions = (data_size_gb * 1024) / partition_size_mb
print(f"Ideal partitions: {ideal_partitions}")  # ~800 partitions
```

### 2. Partition Count
**Rule:** Number of partitions = 2-3x number of cores in cluster

```python
# If you have 100 cores
ideal_partitions = 100 * 3  # 300 partitions
df = df.repartition(300)
```

### 3. Check Partition Sizes
```python
# Check partition distribution
partition_sizes = df.rdd.glom().map(len).collect()
print(f"Partition sizes: {partition_sizes}")

# Check for data skew
max_size = max(partition_sizes)
min_size = min(partition_sizes)
if max_size > min_size * 2:
    print("Warning: Data skew detected!")
```

### 4. When to Repartition
```python
# BEFORE wide transformations
df = spark.read.parquet("/large_data")
df = df.repartition(200, "user_id")  # Partition by key
result = df.groupBy("user_id").agg(...)  # Now faster
```

### 5. When to Coalesce
```python
# AFTER filtering
df_large = spark.read.parquet("/data")  # 1000 partitions
df_filtered = df_large.filter(col("active") == True)  # 90% data removed
df_optimized = df_filtered.coalesce(100)  # Reduce partitions
df_optimized.write.parquet("/output")
```

---

## Common Production Problems

### Problem 1: Too Few Partitions
**Symptom:** Some cores idle, job takes long time

**Solution:**
```python
# Increase partitions
df = df.repartition(num_cores * 3)
```

### Problem 2: Too Many Partitions
**Symptom:** High scheduling overhead, many small tasks

**Solution:**
```python
# Reduce partitions
df = df.coalesce(optimal_count)
```

### Problem 3: Data Skew
**Symptom:** Most tasks finish fast, few tasks take forever

**Solution:**
```python
# Add salt to skewed key
from pyspark.sql.functions import rand, concat, lit

df = df.withColumn("salt", (rand() * 10).cast("int"))
df = df.withColumn("salted_key", concat(col("user_id"), lit("_"), col("salt")))
df = df.repartition("salted_key")
```

### Problem 4: Small Files Problem
**Symptom:** Thousands of tiny output files

**Solution:**
```python
# Coalesce before writing
df.coalesce(10).write.parquet("/output")  # Creates 10 files
```

---

## Advanced: Partition Control

### 1. Default Partitioning
```python
# Spark decides based on file size
df = spark.read.parquet("/data")
# Default: ~128 MB per partition
```

### 2. Custom Partitioning
```python
# Explicit partition count
df = spark.read.option("spark.sql.files.maxPartitionBytes", "256MB") \
          .parquet("/data")
```

### 3. Range Partitioning
```python
# Partition by value ranges
df = df.repartitionByRange(10, "user_id")
# Good for sorted data
```

### 4. Hash Partitioning
```python
# Partition by hash of column
df = df.repartition(20, "user_id")
# Ensures same user_id in same partition
```

---

## Monitoring Partitions

### In Spark UI:
1. Go to Spark UI (usually port 4040)
2. Check "Stages" tab
3. Look at "Number of Tasks"
4. Each task = one partition

### In Code:
```python
# Get partition count
print(df.rdd.getNumPartitions())

# See data distribution
df.rdd.glom().map(len).collect()

# Check partition details
df.explain()
```

---

## Complete Example: Optimizing Partitions

```python
from pyspark.sql.functions import col, sum

# Step 1: Read data (default partitioning)
df = spark.read.parquet("/large_dataset")
print(f"Initial partitions: {df.rdd.getNumPartitions()}")

# Step 2: Filter reduces data significantly
df_filtered = df.filter(col("year") == 2024)

# Step 3: Coalesce after filtering
df_filtered = df_filtered.coalesce(50)

# Step 4: Repartition by key before groupBy
df_prepared = df_filtered.repartition(100, "department")

# Step 5: Perform aggregation (benefits from partitioning)
result = df_prepared.groupBy("department").agg(
    sum("sales").alias("total_sales")
)

# Step 6: Coalesce before writing
result.coalesce(5).write.mode("overwrite").parquet("/output")

print("Optimization complete!")
```

---

## Key Formulas to Remember

```text
Ideal Partitions = Data Size (MB) / Target Partition Size (128 MB)

Or

Ideal Partitions = Number of Cores × 2 to 3

Partition Size Check = Total Data Size / Number of Partitions
```

---

## One-Line Summary

RDD is a distributed dataset divided into partitions, and Spark processes each partition in parallel using one task per partition.

---

## Quick Reference Card

| Concept | Key Point |
|---------|-----------|
| Partition | Small chunk of data |
| RDD | Distributed collection divided into partitions |
| Task | Processing unit for one partition |
| Narrow Transform | No shuffle, same partition count |
| Wide Transform | Shuffle occurs, new partition count |
| Repartition | Increase or decrease with shuffle |
| Coalesce | Decrease only, no shuffle |
| Optimal Size | 100-200 MB per partition |
| Optimal Count | 2-3x number of cores |

---

**Remember:** Good partitioning = Good performance. Always monitor and optimize partition count and size in production!



# Transformations in Apache Spark

## What is a Transformation?

- A transformation is an operation that creates a new RDD or DataFrame
- Transformations are lazily evaluated
- Execution does NOT happen until an action is called

Examples:
- select()
- filter()
- map()
- groupBy()

---

## Types of Transformations

Spark transformations are classified into two types:
1. Narrow Transformations
2. Wide Transformations

---

## Narrow Transformation

### Definition
A narrow transformation is one where:
- Each output partition depends on only ONE input partition
- No data movement across partitions
- No shuffle occurs

### Partition Flow
```text
Partition 1 -> Transformation -> Partition 1
Partition 2 -> Transformation -> Partition 2
Partition 3 -> Transformation -> Partition 3
```

### Characteristics
- Fast execution
- One-to-one partition dependency
- Same stage execution
- No network overhead

### Examples
```python
# Narrow transformations
df.select("name", "age")
df.filter(df.age > 30)
df.withColumn("new_col", col("salary") * 2)
df.drop("unwanted_column")
```

---

## Wide Transformation

### Definition
A wide transformation is one where:
- Output partitions depend on MULTIPLE input partitions
- Data moves across partitions
- Shuffle occurs

### Partition Flow
```text
Partition 1 \
Partition 2  \
              -> Shuffle -> New Partition A
Partition 3  /
Partition 4 /

Data is redistributed across the cluster
```

### Characteristics
- Slower execution
- Many-to-one partition dependency
- Creates new stage
- Network overhead (shuffle)

### Examples
```python
# Wide transformations
df.groupBy("department").count()
df.orderBy("salary")
df1.join(df2, "user_id")
df.distinct()
```

---

## Common Transformations: Narrow vs Wide

### Narrow Transformations (No Shuffle)

These transformations operate on data within the same partition.

**Functions:**
- select()
- filter()
- map()
- withColumn()
- drop()
- limit()
- sample()
- union()

**Example:**
```python
# All these are narrow transformations
df.select("name", "department") \
  .filter(df.salary > 50000) \
  .withColumn("bonus", col("salary") * 0.1) \
  .drop("temp_column")
```

---

### Wide Transformations (Shuffle Happens)

These transformations require data movement across partitions.

**Functions:**
- groupBy()
- reduceByKey()
- aggregateByKey()
- join()
- distinct()
- orderBy()
- sort()
- repartition()

**Example:**
```python
# All these are wide transformations
df.groupBy("department").agg(sum("salary")) \
  .orderBy("department") \
  .distinct()
```

---

## Comparison Table

| Aspect | Narrow Transformation | Wide Transformation |
|--------|----------------------|---------------------|
| Shuffle | No shuffle | Shuffle happens |
| Speed | Faster | Slower |
| Dependency | One-to-one | Many-to-one |
| Stage | Same stage | Creates new stage |
| Network | No network transfer | Network transfer required |
| Examples | select, filter, map | groupBy, join, orderBy |

---

## Performance Impact

### Narrow Transformations
- Minimal overhead
- Can be pipelined together
- Execute in memory
- No disk I/O for shuffle

### Wide Transformations
- High overhead
- Requires disk write/read
- Network data transfer
- Creates shuffle files

---

## Production Best Practices

### DO:
- Chain multiple narrow transformations together
- Use filter() early to reduce data size before wide transformations
- Minimize the number of wide transformations
- Use broadcast joins for small tables

### AVOID:
- Unnecessary groupBy() or distinct()
- Multiple joins in sequence without optimization
- orderBy() on large datasets unless required

---

## Full Example

```python
from pyspark.sql.functions import col, sum

# Load data
df = spark.read.parquet("/data")

# Narrow transformations (fast, no shuffle)
df_filtered = df.select("user_id", "department", "salary") \
                .filter(col("salary") > 50000) \
                .withColumn("tax", col("salary") * 0.3)

# Wide transformation (shuffle happens, creates new stage)
result = df_filtered.groupBy("department") \
                    .agg(sum("salary").alias("total_salary"))

# Another wide transformation
final = result.orderBy("total_salary", ascending=False)

# Action triggers execution
final.show()
```

### Execution breakdown:
1. select, filter, withColumn = Narrow (Stage 1)
2. groupBy = Wide (creates Stage 2, shuffle occurs)
3. orderBy = Wide (creates Stage 3, shuffle occurs)
4. show() = Action (triggers all stages)

---

## Key Takeaways

- Narrow transformations are fast and efficient
- Wide transformations cause shuffle and are expensive
- Minimize wide transformations for better performance
- Use filters early to reduce data before shuffles
- Every wide transformation creates a new stage

---

**Remember:** Narrow = No Shuffle = Fast | Wide = Shuffle = Slow


# repartition() vs coalesce() in Apache Spark

## Why do we need them?
- Spark divides data into partitions
- Each partition is processed by one executor core
- Too few partitions = resources are wasted
- Too many partitions = overhead and slow performance

So we adjust partitions using `repartition()` and `coalesce()`.

---

## repartition()

**What it does:**  
Reshuffles data to change the number of partitions.

### Key Points
- Causes shuffle (expensive)
- Can INCREASE or DECREASE partitions
- Creates balanced partitions

### Examples
```python
# Original DataFrame
df = spark.range(1000)

# Increase partitions for better parallelism
df_new = df.repartition(8)

# Repartition by column (useful before join/groupBy)
df_by_dept = df.repartition(4, "department")
```

---

## coalesce()

**What it does:**  
Merges existing partitions to reduce their count without reshuffling data.

### Key Points
- No shuffle (fast)
- Can ONLY decrease partitions
- May create uneven partitions

### Examples
```python
# Start with more partitions
df = spark.range(1000).repartition(8)

# Reduce partitions quickly
df_small = df.coalesce(2)

# Before writing output files
df.coalesce(1).write.parquet("/path")  # Creates a single file
```

---

## When to use repartition()

- When you need MORE partitions for parallel processing
- Before wide operations like join or groupBy
- When even data distribution is required

---

## When to use coalesce()

- After filtering when data size is reduced
- Before writing data to disk (to reduce number of files)
- When you want to reduce partitions quickly with low cost

---

## Simple Rules to Remember

- Need more partitions? Use repartition()
- Need fewer partitions? Use coalesce()
- Before join or groupBy? Use repartition(column)
- Before writing files? Use coalesce()

---

## Full Example

```python
from pyspark.sql.functions import col

# 1. Load data
df = spark.read.parquet("/data")

# 2. Repartition by key before aggregation
df = df.repartition(10, "user_id")

# 3. Perform wide transformation
result = df.groupBy("user_id").count()

# 4. Filter reduces data size
result = result.filter(col("count") > 100)

# 5. Reduce partitions before writing
result.coalesce(1).write.parquet("/output")
```

---

## Quick Comparison

| Feature | repartition() | coalesce() |
|---------|---------------|------------|
| Shuffle? | Yes | No |
| Speed | Slow | Fast |
| Direction | Up or Down | Only Down |
| Distribution | Even | May be uneven |
| Use case | Before joins/groupBy | Before writing files |

---

**Summary:** Use repartition() when you need balanced partitions or more parallelism. Use coalesce() when you just want to reduce partitions quickly.

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.functions import *
## API for Read the file.  and that option is fdor extra config that we want to pass , after that the  thigns is used to define a schema without defining a schema  it means it will predict the best schema for your data frame

# 1 job = 1 stage + 1 task( Always) atleast
df = spark.read.format("CSV")\  
    .option("header" , True)\
        .option("inferSchema" , True)\
            .load("/FileStore/tables/StudentDataExample.csv")


# Transformations 

df = df.filter(col('prodcut_name') == 'Sneakers')

# only select order id and prouct name 
df = df.select('order_id', 'prodcut_name')


#group by ,  performing wide transformation 
 df = df.groupBy('product_name').agg(count(col('order_id') == ))



# Jobs, Stages, and Tasks in Apache Spark

This section explains how Spark actually executes code internally.
Understanding this is critical for performance tuning and debugging in production.

---

## Big Picture

Spark executes work in the following hierarchy:

**Job -> Stage -> Task**

- **Job** answers: WHY Spark runs
- **Stage** answers: HOW work is divided (shuffle or no shuffle)
- **Task** answers: WHO does the actual work

---

## What is a Job?

### Definition
A **Job** is created whenever an **ACTION** is called in Spark.

Actions include:
- show()
- count()
- collect()
- write()

### Example
```python
df.filter(df.age > 30).count()
# This creates ONE Job
```

### Key Points
- No action = no job
- One action = one job
- Multiple actions = multiple jobs

### Production Note
Too many actions in code = multiple jobs = slower pipelines

---

## What is a Stage?

A **Stage** is a set of transformations that can be executed without shuffle.

- **Narrow transformations** stay in the same stage
- **Wide transformations** (shuffle) create a new stage

### Example
```python
df.select("name") \
  .filter(df.salary > 50000) \
  .groupBy("department") \
  .count()
```

### Spark creates:
- **Stage 1** - select + filter (no shuffle)
- **Stage 2** - groupBy + count (shuffle)

### Key Points
- Stages are divided by shuffle boundaries
- More shuffles = more stages = slower jobs

---

## What is a Task?

A **Task** is the smallest unit of execution in Spark.

### Important rule:
**One partition = one task**

### Example
If a stage has 8 partitions, Spark creates 8 tasks.

These tasks run in parallel on executor cores.

### Key Points
- Tasks run on executors
- Tasks do the actual data processing
- Number of tasks depends on number of partitions

---

## Full Execution Flow Example

```python
df = spark.read.parquet("/data")
df2 = df.filter(df.age > 30)          # Narrow transformation
df3 = df2.groupBy("dept").count()     # Wide transformation (shuffle)
df3.write.parquet("/output")          # Action
```

### Spark execution:
1. write() triggers ONE Job
2. filter() runs in Stage 1
3. groupBy() causes shuffle - creates Stage 2
4. Each stage is executed using tasks (one per partition)

---

## Relationship Summary

| Concept | Creates |
|---------|---------|
| Action | Job |
| Shuffle | New Stage |
| Partition | Task |

---

## Common Production Mistakes

### 1. Too many actions
```python
df.count()
df.show()
df.write.parquet("/out")
# Creates 3 jobs (avoid if possible)
```

### 2. Too many shuffles
- Unnecessary groupBy
- Bad joins
- Wrong repartition usage
- Results in slow performance

### 3. Bad partitioning
- Too few partitions = CPU underutilized
- Too many partitions = scheduling overhead

---

## What to Remember (Very Important)

- Every action creates a job
- Every shuffle creates a new stage
- Every partition creates a task

### Performance depends on:
- Number of shuffles
- Number of partitions
- Number of actions

---

**Optimize by reducing shuffles and actions, and tuning partition count!**


# Shuffle Join vs Broadcast Join in Apache Spark

## Introduction

Joins are one of the most expensive operations in Spark. Understanding when to use Shuffle Join vs Broadcast Join can save you **thousands of dollars** in cloud costs and **hours** in execution time.

---

## What is a Join?

A join combines two tables based on a common key.

**Example:**
```python
# Customers table
customers: customer_id, name, country

# Orders table  
orders: order_id, customer_id, amount

# Join to get customer name with each order
result = orders.join(customers, "customer_id")
```

**The Problem:**
- Tables are distributed across multiple partitions
- Matching rows might be on different machines
- How do we bring them together?

**Two Solutions:**
1. Shuffle Join (move both tables)
2. Broadcast Join (copy small table to everyone)

---

## Shuffle Join (Sort-Merge Join)

### What is Shuffle Join?

Shuffle Join redistributes data across the network so that rows with the same join key end up on the same partition.

### Visual Explanation:

```text
BEFORE SHUFFLE:
===============

Table A (Customers)          Table B (Orders)
Partition 1:                 Partition 1:
  customer_id=1                order: customer_id=5
  customer_id=2                order: customer_id=6

Partition 2:                 Partition 2:
  customer_id=3                order: customer_id=1
  customer_id=4                order: customer_id=2

Problem: customer_id=1 is in different partitions!


AFTER SHUFFLE (Data Movement):
===============================

Table A (Customers)          Table B (Orders)
Partition 1:                 Partition 1:
  customer_id=1                order: customer_id=1  <- Now together!
  customer_id=3                order: customer_id=3

Partition 2:                 Partition 2:
  customer_id=2                order: customer_id=2  <- Now together!
  customer_id=4                order: customer_id=4

Now: Join happens locally in each partition
```

### How It Works:

```text
Step 1: Hash the join key
    Spark calculates hash(customer_id) for both tables

Step 2: Shuffle data
    Rows with same hash go to same partition
    Data moves across network (expensive!)

Step 3: Sort data in each partition
    Sort by join key

Step 4: Merge join
    Walk through sorted data and match rows

Step 5: Return results
```

### Code Example:

```python
# Both tables are large
customers = spark.read.parquet("/customers")  # 1M rows, 2 GB
orders = spark.read.parquet("/orders")        # 100M rows, 50 GB

# Shuffle Join happens (default for large tables)
result = customers.join(orders, "customer_id")

# What Spark does internally:
# 1. Shuffle 2 GB of customers data
# 2. Shuffle 50 GB of orders data
# 3. Total shuffle: 52 GB across network!
# 4. Creates 52 GB of shuffle files on disk
```

### Costs of Shuffle Join:

```text
Network Cost:
- Both tables transferred across network
- Example: 50 GB + 2 GB = 52 GB transfer

Disk I/O Cost:
- Write shuffle files to disk
- Read shuffle files from disk
- Example: 52 GB write + 52 GB read = 104 GB I/O

Memory Cost:
- Buffer data during shuffle
- Sort data in memory

Time Cost:
- Network transfer time
- Disk I/O time
- Sorting time
- Example: Can take 10-30 minutes for large tables
```

### When Shuffle Join Happens:

- Both tables are large (> 10 MB)
- No broadcast hint provided
- Default join strategy

---

## Broadcast Join (Map-Side Join)

### What is Broadcast Join?

Broadcast Join sends the entire small table to every executor, eliminating the need to shuffle the large table.

### Visual Explanation:

```text
BROADCAST STRATEGY:
===================

Table A (Countries - Small)    Table B (Customers - Large)
195 rows, 10 KB               100M rows, 20 GB

Step 1: Broadcast Table A
    Send 10 KB to ALL executors
    Each executor gets a complete copy

Step 2: No shuffle needed!
    Large table stays in place
    
Executor 1:                   
  Memory: Countries (10 KB) <- Broadcasted
  Processes: Partition 1 of Customers
  Joins locally (fast!)

Executor 2:
  Memory: Countries (10 KB) <- Broadcasted
  Processes: Partition 2 of Customers
  Joins locally (fast!)

Executor 3:
  Memory: Countries (10 KB) <- Broadcasted
  Processes: Partition 3 of Customers
  Joins locally (fast!)

Total Network Transfer: 10 KB x 10 executors = 100 KB
vs Shuffle Join would transfer: 20 GB!
```

### How It Works:

```text
Step 1: Collect small table
    Driver collects entire small table

Step 2: Broadcast to executors
    Send small table to all executors
    Each executor stores it in memory

Step 3: Hash join
    Each executor joins its partition with broadcasted table
    No shuffle needed!

Step 4: Return results
```

### Code Example:

```python
from pyspark.sql.functions import broadcast

# Small table
countries = spark.read.parquet("/countries")  # 195 rows, 10 KB

# Large table
customers = spark.read.parquet("/customers")  # 100M rows, 20 GB

# Broadcast Join (explicit)
result = customers.join(broadcast(countries), "country_code")

# What Spark does internally:
# 1. Send 10 KB to all executors (minimal network)
# 2. No shuffle of 20 GB customers table!
# 3. Join happens locally on each executor
# 4. Total network: ~100 KB vs 20 GB!
```

### Benefits of Broadcast Join:

```text
Network Cost: Minimal
- Only small table transferred
- Example: 10 KB x 10 executors = 100 KB

Disk I/O Cost: Zero
- No shuffle files written
- No shuffle files read

Memory Cost: Low
- Small table in executor memory
- Example: 10 KB per executor

Time Cost: Very fast
- No shuffle overhead
- Example: Can be 10-100x faster!
```

### Limitations:

- Small table must fit in executor memory
- Default broadcast limit: 10 MB
- Can increase but be careful with memory

---

## Shuffle Join vs Broadcast Join Comparison

| Aspect | Shuffle Join | Broadcast Join |
|--------|--------------|----------------|
| Network Transfer | Both tables | Only small table |
| Disk I/O | High (shuffle files) | Zero |
| Speed | Slow | Fast |
| Memory Usage | Moderate | Low (per executor) |
| Use Case | Both tables large | One table small |
| Cost | High | Low |
| Typical Time | 10-30 minutes | 1-5 minutes |

---

## Real Production Examples

### Example 1: E-Commerce Order Analysis

**Scenario:** Join 100 million orders with 50 stores

```python
# Tables
stores = spark.read.parquet("/stores")      # 50 rows, 2 KB
orders = spark.read.parquet("/orders")      # 100M rows, 50 GB

# ❌ BAD: Shuffle Join (default if you forget broadcast)
result = orders.join(stores, "store_id")
# Cost: Shuffles 50 GB
# Time: 15 minutes
# Money: $200 per run

# ✅ GOOD: Broadcast Join
result = orders.join(broadcast(stores), "store_id")
# Cost: Broadcasts 2 KB
# Time: 2 minutes  
# Money: $20 per run
# Savings: $180 per run!
```

---

### Example 2: Banking Fraud Detection

**Scenario:** Check 500M transactions against 1000 fraud rules

```python
from pyspark.sql.functions import broadcast

# Tables
transactions = spark.read.parquet("/transactions")  # 500M rows, 100 GB
fraud_rules = spark.read.parquet("/fraud_rules")    # 1000 rows, 50 KB
categories = spark.read.parquet("/categories")      # 500 rows, 20 KB

# ❌ WRONG: Multiple shuffle joins
result = transactions.join(fraud_rules, "rule_id") \
                    .join(categories, "category_id")
# Problem: Shuffles 100 GB twice!
# Time: 45 minutes
# Often fails with OOM errors

# ✅ CORRECT: Broadcast both small tables
result = transactions.join(broadcast(fraud_rules), "rule_id") \
                    .join(broadcast(categories), "category_id")
# Network: Only 70 KB broadcasted
# Time: 5 minutes
# Never fails!
# Savings: 9x faster!
```

---

### Example 3: Ad Tech - Real-Time Bidding

**Scenario:** Process 10 billion ad impressions with dimension tables

```python
from pyspark.sql.functions import broadcast

# Fact table (huge)
impressions = spark.read.parquet("/impressions")    # 10B rows, 5 TB

# Dimension tables (small)
advertisers = spark.read.parquet("/advertisers")    # 50K rows, 5 MB
campaigns = spark.read.parquet("/campaigns")        # 200K rows, 20 MB
geo = spark.read.parquet("/geo_locations")          # 10K rows, 500 KB

# ❌ DISASTER: Shuffle joins
result = impressions.join(advertisers, "advertiser_id") \
                   .join(campaigns, "campaign_id") \
                   .join(geo, "geo_id")
# Would shuffle 5 TB multiple times!
# Time: 8 hours
# Cost: $2000 per run
# Likely to fail!

# ✅ PRODUCTION-READY: Broadcast all dimensions
result = impressions.join(broadcast(advertisers), "advertiser_id") \
                   .join(broadcast(campaigns), "campaign_id") \
                   .join(broadcast(geo), "geo_id")
# Broadcasts only 25.5 MB total
# Time: 30 minutes
# Cost: $120 per run
# Monthly savings: $56,400!
```

---

## When to Use Which Join?

### Use Broadcast Join When:

1. One table is small (< 100 MB)
2. Joining fact table with dimension tables
3. You have sufficient executor memory
4. Performance is critical

**Common Broadcast Candidates:**
- Countries (195 rows)
- US States (50 rows)
- Product categories (< 10K rows)
- Status codes (< 100 rows)
- Currency codes (180 rows)
- Date dimensions (< 10K rows)
- Configuration tables
- Lookup tables

---

### Use Shuffle Join When:

1. Both tables are large (> 1 GB each)
2. Cannot fit small table in memory
3. No dimension tables involved
4. Fact-to-fact table joins

**Example:**
```python
# Both tables are huge - no choice but to shuffle
sales_2023 = spark.read.parquet("/sales/2023")  # 500 GB
sales_2024 = spark.read.parquet("/sales/2024")  # 600 GB

# Shuffle join (unavoidable)
result = sales_2023.join(sales_2024, "product_id")
```

---

## Configuration and Tuning

### Broadcast Threshold

```python
# Default broadcast threshold: 10 MB
# Check current value
current = spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
print(f"Current threshold: {current} bytes")

# Increase to 100 MB (if you have memory)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 100 * 1024 * 1024)

# Disable auto broadcast (force manual control)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
```

### Broadcast Timeout

```python
# Default: 300 seconds (5 minutes)
# Increase if broadcasting large tables
spark.conf.set("spark.sql.broadcastTimeout", 600)  # 10 minutes
```

### Check Table Size Before Joining

```python
def should_broadcast(df, max_size_mb=100):
    """
    Check if DataFrame should be broadcasted
    
    Args:
        df: DataFrame to check
        max_size_mb: Maximum size in MB for broadcasting
    
    Returns:
        Boolean indicating if table should be broadcasted
    """
    # Cache and count to get accurate size
    df.cache()
    row_count = df.count()
    
    # Get size from Spark's statistics
    size_bytes = df._jdf.queryExecution().optimizedPlan().stats().sizeInBytes()
    size_mb = size_bytes / (1024 * 1024)
    
    print(f"Table size: {size_mb:.2f} MB ({row_count:,} rows)")
    
    if size_mb < max_size_mb:
        print(f"✅ Recommend BROADCAST join")
        return True
    else:
        print(f"❌ Use SHUFFLE join")
        return False

# Usage
if should_broadcast(countries, 100):
    result = orders.join(broadcast(countries), "country_id")
else:
    result = orders.join(countries, "country_id")
```

---

## How to Verify Join Type

### Method 1: Check Execution Plan

```python
# Create join
result = orders.join(broadcast(countries), "country_id")

# Check plan
result.explain()

# Look for:
# ✅ "BroadcastHashJoin" = Broadcast join (good!)
# ❌ "SortMergeJoin" = Shuffle join
```

**Example Output:**
```text
== Physical Plan ==
BroadcastHashJoin [country_id#123], [country_id#456]
:- LocalTableScan [country_id#123, name#124]
+- *(1) FileScan parquet [order_id#789, country_id#456]

This shows BroadcastHashJoin - Perfect!
```

---

### Method 2: Spark UI

```text
Steps to check in Spark UI:
1. Open Spark UI (http://localhost:4040 or your cluster UI)
2. Click "SQL" tab
3. Click on your query
4. Look at the DAG visualization:
   
   ✅ Broadcast join shows:
      - "BroadcastExchange" node
      - No "Exchange" (shuffle) on large table
   
   ❌ Shuffle join shows:
      - "Exchange" nodes on both tables
      - "Sort" operations
```

---

### Method 3: Check Metrics

```python
# In Spark UI, go to Stages tab
# Check stage metrics:

# ✅ Broadcast Join:
#    Shuffle Write: 0 B
#    Shuffle Read: 0 B

# ❌ Shuffle Join:
#    Shuffle Write: 50 GB
#    Shuffle Read: 50 GB
```

---

## Production Best Practices

### Practice 1: Always Broadcast Dimension Tables

```python
from pyspark.sql.functions import broadcast

# ✅ ALWAYS do this for dimension tables
result = fact_table.join(broadcast(dim_table), "key")

# Common dimensions to broadcast:
countries = broadcast(spark.read.parquet("/countries"))
categories = broadcast(spark.read.parquet("/categories"))
products = broadcast(spark.read.parquet("/products"))
statuses = broadcast(spark.read.parquet("/statuses"))
```

---

### Practice 2: Order Your Joins

```python
# When joining multiple tables, broadcast small ones first

# ✅ GOOD ORDER:
result = large_fact \
    .join(broadcast(small_dim1), "key1") \
    .join(broadcast(small_dim2), "key2") \
    .join(medium_table, "key3")  # Only this shuffles

# ❌ BAD ORDER:
result = large_fact \
    .join(medium_table, "key3") \  # Shuffles here
    .join(broadcast(small_dim1), "key1") \  # Already shuffled!
    .join(broadcast(small_dim2), "key2")
```

---

### Practice 3: Cache Broadcasted Tables

```python
# If using same small table multiple times, cache it

# Load and cache dimension
countries = spark.read.parquet("/countries").cache()
countries.count()  # Materialize cache

# Use in multiple joins
result1 = orders.join(broadcast(countries), "country_id")
result2 = customers.join(broadcast(countries), "country_id")
result3 = stores.join(broadcast(countries), "country_id")

# All three joins use cached version!
```

---

### Practice 4: Monitor and Alert

```python
# Set up monitoring for shuffle operations

def check_shuffle_size(df, threshold_gb=10):
    """Alert if shuffle size exceeds threshold"""
    
    # Get shuffle write metrics from last stage
    # (In production, integrate with monitoring system)
    
    plan = df.explain(mode="formatted")
    
    if "Exchange" in str(plan):
        print("⚠️ WARNING: Shuffle detected!")
        print("Consider using broadcast join if possible")
        return True
    else:
        print("✅ No shuffle - optimized join")
        return False

# Usage
result = orders.join(countries, "country_id")
check_shuffle_size(result)
```

---

## Common Production Mistakes

### Mistake 1: Broadcasting Too Large Tables

```python
# ❌ WRONG: Broadcasting 2 GB table
huge_dim = spark.read.parquet("/huge_dimension")  # 2 GB!
result = facts.join(broadcast(huge_dim), "key")

# What happens:
# - Each executor gets 2 GB in memory
# - 50 executors x 2 GB = 100 GB total memory!
# - Out of memory errors
# - Job fails

# ✅ CORRECT: Check size first
if should_broadcast(huge_dim, max_size_mb=100):
    result = facts.join(broadcast(huge_dim), "key")
else:
    result = facts.join(huge_dim, "key")
```

---

### Mistake 2: Forgetting to Broadcast

```python
# ❌ COMMON MISTAKE: Small table not broadcasted
products = spark.read.parquet("/products")  # Only 5 MB!
sales = spark.read.parquet("/sales")        # 1 TB

result = sales.join(products, "product_id")  # Forgot broadcast()!

# Impact:
# - Shuffles 1 TB of sales data unnecessarily
# - Takes 2 hours instead of 5 minutes
# - Costs $800 instead of $40

# ✅ FIX: Add broadcast()
result = sales.join(broadcast(products), "product_id")
```

---

### Mistake 3: Multiple Unnecessary Shuffles

```python
# ❌ WRONG: Multiple shuffle joins
customers = spark.read.parquet("/customers")  # 50 GB
orders = spark.read.parquet("/orders")        # 100 GB
products = spark.read.parquet("/products")    # 200 MB
countries = spark.read.parquet("/countries")  # 10 KB

result = customers.join(orders, "customer_id") \      # Shuffle 1
                 .join(products, "product_id") \      # Shuffle 2
                 .join(countries, "country_id")       # Shuffle 3

# Impact: 150 GB shuffled 3 times!

# ✅ CORRECT: Broadcast small tables
result = customers.join(orders, "customer_id") \           # Shuffle (unavoidable)
                 .join(broadcast(products), "product_id") \ # No shuffle!
                 .join(broadcast(countries), "country_id")  # No shuffle!

# Impact: Only ONE shuffle!
```

---

## Decision Tree

```text
Need to join two tables?
    |
    v
Is one table < 10 MB?
    |
    +-- YES --> ✅ Use BROADCAST JOIN (automatic)
    |           result = large.join(broadcast(small), "key")
    |
    +-- NO --> Is one table < 100 MB?
               |
               +-- YES --> Do you have enough memory?
               |           |
               |           +-- YES --> ✅ Use BROADCAST JOIN (manual)
               |           |           spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 100*1024*1024)
               |           |           result = large.join(broadcast(small), "key")
               |           |
               |           +-- NO --> ❌ Use SHUFFLE JOIN
               |                      result = table1.join(table2, "key")
               |
               +-- NO --> Are BOTH tables > 10 GB?
                          |
                          +-- YES --> ❌ Use SHUFFLE JOIN (no choice)
                          |           result = table1.join(table2, "key")
                          |
                          +-- NO --> Test both strategies
                                     Pick the faster one
```

---

## Performance Comparison

### Real Numbers from Production

**Scenario:** Join 100M orders with 50 stores

| Metric | Shuffle Join | Broadcast Join | Improvement |
|--------|--------------|----------------|-------------|
| Network Transfer | 50 GB | 2 KB | 25,000,000x |
| Disk I/O | 100 GB | 0 GB | Infinite |
| Execution Time | 15 min | 2 min | 7.5x faster |
| Cost per run | $200 | $20 | $180 saved |
| Monthly cost | $6,000 | $600 | $5,400 saved |

---

## Complete Working Example

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast, col, sum

# Initialize Spark
spark = SparkSession.builder \
    .appName("BroadcastJoinExample") \
    .config("spark.sql.autoBroadcastJoinThreshold", 100 * 1024 * 1024) \
    .getOrCreate()

# Load dimension tables (small - always broadcast these)
countries = spark.read.parquet("/data/countries")  # 195 rows
categories = spark.read.parquet("/data/categories")  # 1000 rows
products = spark.read.parquet("/data/products")  # 50000 rows, 10 MB

# Cache small tables if used multiple times
countries.cache()
categories.cache()
products.cache()

# Materialize cache
countries.count()
categories.count()
products.count()

# Load fact tables (large)
orders = spark.read.parquet("/data/orders")  # 100M rows, 50 GB
customers = spark.read.parquet("/data/customers")  # 10M rows, 5 GB

# Perform joins with broadcast
result = orders \
    .join(broadcast(products), "product_id") \
    .join(broadcast(categories), "category_id") \
    .join(broadcast(countries), "country_id") \
    .join(customers, "customer_id")  # Only this shuffles

# Aggregate
final_result = result.groupBy("country_id", "category_id") \
    .agg(sum("amount").alias("total_sales"))

# Check execution plan
print("Execution Plan:")
final_result.explain()

# Execute and save
final_result.write.mode("overwrite").parquet("/output/sales_summary")

print("✅ Job completed successfully!")
```

---

## Summary

### Key Takeaways:

1. **Broadcast Join** is for small tables (< 100 MB)
   - Fast, no shuffle, minimal network
   - Use for dimension tables

2. **Shuffle Join** is for large tables (> 1 GB)
   - Slow, requires shuffle, lots of network
   - Use when both tables are huge

3. **Always broadcast dimension tables**
   - Countries, categories, statuses, etc.
   - Can save 10-100x in performance

4. **Check execution plans**
   - Look for "BroadcastHashJoin" vs "SortMergeJoin"
   - Monitor in Spark UI

5. **Production impact is HUGE**
   - Can save thousands of dollars per month
   - Can reduce execution time by 10x

---

## Quick Reference

```python
# ✅ CORRECT PATTERNS

# Pattern 1: Explicit broadcast
result = large_table.join(broadcast(small_table), "key")

# Pattern 2: Multiple broadcasts
result = large_table \
    .join(broadcast(dim1), "key1") \
    .join(broadcast(dim2), "key2") \
    .join(broadcast(dim3), "key3")

# Pattern 3: Cache + broadcast
small_table.cache().count()
result = large_table.join(broadcast(small_table), "key")

# Pattern 4: Configure threshold
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 100 * 1024 * 1024)

# ❌ WRONG PATTERNS

# Wrong 1: Forget broadcast
result = large_table.join(small_table, "key")  # Shuffles unnecessarily!

# Wrong 2: Broadcast huge table
result = table1.join(broadcast(huge_table), "key")  # OOM errors!

# Wrong 3: Multiple shuffles
result = t1.join(t2, "k1").join(t3, "k2").join(t4, "k3")  # All shuffle!
```

---

**Remember:** Broadcast small, shuffle large. Your wallet will thank you! 💰

# Spark SQL Engine & Catalyst Optimizer

## What is Spark SQL Engine?

Spark SQL Engine is the component that processes DataFrame and SQL queries in Spark. It optimizes your queries before execution to make them faster and more efficient.

---

## The Catalyst Optimizer Flow

```text
User Query
    |
    v
Unresolved Logical Plan
    |
    v
[Analysis Phase]
    |
    v
Resolved Logical Plan
    |
    v
[Optimization Phase]
    |
    v
Optimized Logical Plan
    |
    v
[Code Generation]
    |
    v
Physical Plans
    |
    v
[Cost Model Selection]
    |
    v
Best Physical Plan
    |
    v
Executors (Execution)
```

---

## Phase 1: Unresolved Logical Plan

**What happens:**
- Spark parses your SQL or DataFrame code
- Creates initial logical plan
- Column names and table names are NOT verified yet

**Example:**
```python
df.select("name", "age").filter(col("age") > 30)
```

**Unresolved Plan:**
```text
Filter (age > 30)
  |
  v
Project (name, age)
  |
  v
Scan (table)
```

**Status:** Column names might be wrong, table might not exist - not checked yet!

---

## Phase 2: Analysis (Using Catalog)

**What happens:**
- Spark checks the **Catalog** (metadata store)
- Verifies tables exist
- Verifies columns exist
- Resolves data types
- Checks if query is valid

**Catalog contains:**
- Table schemas
- Column names and types
- Table locations
- Partitioning information
- Statistics

**Example:**
```python
# If "age" column doesn't exist, Analysis phase will throw error:
# AnalysisException: Column 'age' does not exist
```

**After Analysis:**
```text
Resolved Logical Plan
- All tables verified ✓
- All columns verified ✓
- Data types resolved ✓
```

---

## Phase 3: Optimized Logical Plan

**What happens:**
- Catalyst Optimizer applies optimization rules
- Improves the query plan
- Makes query faster without changing results

**Common Optimizations:**

### 1. Predicate Pushdown
```python
# Your code
df = spark.read.parquet("/data")
df = df.select("name", "age", "salary")
df = df.filter(col("age") > 30)

# Spark optimizes to:
# Read parquet with filter age > 30 (filter happens at source!)
```

### 2. Column Pruning
```python
# Your code
df = spark.read.parquet("/data")  # Table has 50 columns
df = df.select("name", "age")     # Only need 2 columns

# Spark optimizes to:
# Read ONLY name and age columns (doesn't read all 50!)
```

### 3. Constant Folding
```python
# Your code
df = df.filter((col("age") + 10 - 10) > 30)

# Spark optimizes to:
# df.filter(col("age") > 30)  # Simplified!
```

### 4. Filter Reordering
```python
# Your code
df = df.filter(col("country") == "USA")  # Keeps 10% of data
df = df.filter(col("age") > 30)          # Keeps 50% of data

# Spark may reorder to:
# Filter country first (removes 90% early!)
# Then filter age on remaining 10%
```

---

## Phase 4: Physical Plans

**What happens:**
- Catalyst generates MULTIPLE physical execution plans
- Each plan is a different way to execute the query

**Example Physical Plans:**

**Plan 1:** Broadcast Join
```text
BroadcastHashJoin
  - Broadcast small table
  - Join with large table
Cost: Low (if small table fits in memory)
```

**Plan 2:** Shuffle Join
```text
SortMergeJoin
  - Shuffle both tables
  - Sort and merge
Cost: High (shuffle overhead)
```

**Plan 3:** Different partition strategy
```text
Use different number of partitions
Cost: Varies
```

---

## Phase 5: Cost Model

**What happens:**
- Spark evaluates the **cost** of each physical plan
- Cost based on:
  - Data size
  - Number of partitions
  - Shuffle operations
  - Memory usage
  - CPU usage

**Cost Model considers:**
```text
- How much data to shuffle?
- How much memory needed?
- How many disk I/O operations?
- Network bandwidth required?
- Number of tasks to create?
```

**Example:**
```text
Plan 1 (Broadcast Join):
  - Shuffle: 0 GB
  - Memory: 10 MB
  - Cost Score: 100

Plan 2 (Shuffle Join):
  - Shuffle: 50 GB
  - Memory: 5 GB
  - Cost Score: 10000

Winner: Plan 1 (lowest cost!)
```

---

## Phase 6: Best Physical Plan Selection

**What happens:**
- Spark picks the plan with LOWEST cost
- This becomes the execution plan
- Sent to executors for actual execution

---

## Phase 7: Execution on Executors

**What happens:**
- Best physical plan is executed
- Distributed across executors
- Tasks run in parallel
- Results collected

---

## Complete Example

```python
# User writes query
df = spark.read.parquet("/sales")
df = df.filter(col("year") == 2024)
df = df.filter(col("country") == "USA")
df = df.select("product", "revenue")
df = df.groupBy("product").sum("revenue")
result = df.show()
```

### Step-by-Step through Catalyst:

**1. Unresolved Logical Plan:**
```text
Aggregate (product, sum(revenue))
  |
Project (product, revenue)
  |
Filter (country = USA)
  |
Filter (year = 2024)
  |
Scan (/sales)
```

**2. Analysis (check Catalog):**
```text
✓ Table /sales exists
✓ Columns: product, revenue, country, year exist
✓ Data types match
Result: Resolved Logical Plan
```

**3. Optimization:**
```text
Optimized plan:
- Push filters to parquet read
- Read only: product, revenue, year, country
- Filter at source: year=2024 AND country=USA
- Then aggregate

Result: Read 1GB instead of 100GB!
```

**4. Physical Plans generated:**
```text
Plan A: Hash aggregation, 200 partitions
Plan B: Sort-based aggregation, 100 partitions
Plan C: Partial aggregation, 200 partitions
```

**5. Cost Model evaluates:**
```text
Plan A cost: 1000
Plan B cost: 1500
Plan C cost: 800 <- WINNER!
```

**6. Execute Plan C on executors**

---

## Why This Matters in Production

### Without Catalyst Optimizer:
```python
# Naive execution
df = spark.read.parquet("/data")  # Reads 100GB, all 50 columns
df = df.filter(col("age") > 30)   # Filters in Spark
df = df.select("name", "age")     # Drops columns after reading

Time: 30 minutes
Data read: 100 GB
```

### With Catalyst Optimizer:
```python
# Same code, but Catalyst optimizes
df = spark.read.parquet("/data")  # Reads only name, age columns with filter
df = df.filter(col("age") > 30)   # Filter pushed to parquet
df = df.select("name", "age")     # Already optimized

Time: 3 minutes (10x faster!)
Data read: 5 GB (95% less!)
```

---

## Key Components Summary

| Component | Role |
|-----------|------|
| Unresolved Logical Plan | Initial parse of query |
| Catalog | Metadata store (tables, columns, types) |
| Analysis | Verify query is valid |
| Optimizer | Apply optimization rules |
| Physical Plans | Different execution strategies |
| Cost Model | Evaluate each plan's cost |
| Best Plan | Lowest cost plan selected |
| Executors | Execute the plan |

---

## Common Optimizations Applied

1. **Predicate Pushdown** - Push filters to data source
2. **Column Pruning** - Read only needed columns
3. **Constant Folding** - Simplify expressions
4. **Join Reordering** - Optimize join sequence
5. **Broadcast Join** - Broadcast small tables
6. **Partition Pruning** - Skip irrelevant partitions

---

## How to See the Plan

```python
# See all phases
df.explain(True)

# Output shows:
# == Parsed Logical Plan ==
# == Analyzed Logical Plan ==
# == Optimized Logical Plan ==
# == Physical Plan ==
```

**Example Output:**
```text
== Optimized Logical Plan ==
Aggregate [product], [sum(revenue)]
  +- Project [product, revenue]
      +- Filter (year = 2024 AND country = USA)
          +- Relation [product, revenue, year, country]
          
== Physical Plan ==
HashAggregate
  +- Exchange hashpartitioning(product)
      +- HashAggregate
          +- FileScan parquet [product, revenue, year, country]
              PushedFilters: [year=2024, country=USA]
```

---

## Production Tips

### Tip 1: Trust the Optimizer
```python
# Don't over-optimize manually
# Catalyst is very smart!

# ❌ Don't do this:
df = df.filter(col("age") > 30)
df = df.select("name", "age")
df = df.filter(col("name").isNotNull())  # Trying to optimize

# ✅ Do this:
df = df.filter((col("age") > 30) & col("name").isNotNull())
df = df.select("name", "age")
# Catalyst will optimize it anyway!
```

### Tip 2: Check Execution Plans
```python
# Always check plans for expensive queries
df.explain()

# Look for:
# ✓ PushedFilters (good!)
# ✓ BroadcastHashJoin (good for small tables!)
# ✗ CartesianProduct (very bad!)
# ✗ No pushed filters (might be bad)
```

### Tip 3: Collect Statistics
```python
# Help Catalyst make better decisions
spark.sql("ANALYZE TABLE sales COMPUTE STATISTICS")

# This helps Cost Model choose better plans
```

---

## Summary

**Catalyst Optimizer automatically:**
- Validates your query (Analysis)
- Optimizes execution plan (Optimization)
- Generates multiple strategies (Physical Plans)
- Picks the fastest one (Cost Model)
- Executes efficiently (Executors)

**Result:** Your queries run 10-100x faster without you doing anything!

---

## Quick Reference

```text
Query Flow:
User Code -> Unresolved Plan -> Analysis -> Resolved Plan 
-> Optimization -> Optimized Plan -> Physical Plans 
-> Cost Model -> Best Plan -> Executors -> Results

Key Insight:
Spark doesn't execute what you write.
It executes the OPTIMIZED version of what you write!
```

---

**Remember:** Catalyst Optimizer is working behind the scenes to make your queries faster. Trust it, but verify with explain()! 🚀

# Spark Memory Management & Advanced Concepts

## Driver Memory Management

### Two Main Sections:

**1. JVM Heap Memory**
- Core memory used by Driver's JVM
- Stores DAG, metadata, broadcast variables, task scheduling info

**2. Overhead Memory**
- Non-heap memory for JVM threads, shared libraries, native code

```python
# Configure driver memory
spark-submit \
  --driver-memory 4g \
  --driver-memoryOverhead 1g \
  app.py
```

---

## Executor Memory Management

### Three Main Sections:

```text
Total Executor Memory (Example: 4 GB)
==========================================
[Reserved Memory: 300 MB] - Fixed for Spark
[Spark Memory Pool: 60%] - For Spark operations
[User Memory: 40%] - For your code
```

### Spark Memory Pool (60% of total - 300MB)

**Split into two parts:**

**1. Storage Memory (50% of Spark Pool)**
- Used for caching (cache(), persist())
- Long-term storage
- Data persists across operations

**2. Execution Memory (50% of Spark Pool)**
- Used for transformations (joins, sorts, aggregations)
- Short-term, temporary
- Released after operation completes

```python
# Configure executor memory
spark-submit \
  --executor-memory 8g \
  --executor-cores 4 \
  app.py
```

**Configuration:**
```python
# Default: 60% for Spark Pool
spark.conf.set("spark.memory.fraction", 0.6)

# Default: 50% of Spark Pool for Storage
spark.conf.set("spark.memory.storageFraction", 0.5)
```

---

## Unified Memory Management

**Key Feature:** Storage and Execution Memory can borrow from each other!

```text
Normal State:
[Storage: 50%] [Execution: 50%]

When Execution needs more:
[Storage: 30%] [Execution: 70%] <- Borrowed!

When Storage needs more:
[Storage: 70%] [Execution: 30%] <- Borrowed!
```

**Benefits:**
- Dynamic allocation
- Better memory utilization
- Automatic optimization

---

## Out of Memory (OOM) Errors

### What Causes OOM?

**Example Scenario:**
```python
# Huge groupBy on skewed data
df.groupBy("ProductCategory").count()

# If "Food" category has 1M rows:
# - Tries to fit in Execution Memory (1.2 GB)
# - Memory fills up
# - Borrows from Storage
# - Still not enough
# - Spills to disk
# - Disk fills up
# - OOM ERROR!
```

### Data Spill

When memory is full, Spark writes to disk temporarily.

**In Spark UI, look for:**
- Spill (Memory): Amount spilled from memory
- Spill (Disk): Amount written to disk

**High spill = Performance problem!**

---

## Salting (Fix for Data Skew)

### Problem:
```text
Product Categories:
- Food: 1,000,000 rows (SKEWED!)
- Shoes: 100 rows
- Dairy: 200 rows

Result: One executor overloaded, others idle
```

### Solution - Add Salt:
```python
from pyspark.sql.functions import rand, concat, lit

# Add random salt (0-3)
df_salted = df.withColumn("salt", (rand() * 4).cast("int"))
df_salted = df_salted.withColumn("salted_key", 
                                  concat(col("ProductCategory"), lit("_"), col("salt")))

# Now groupBy on salted key
result = df_salted.groupBy("salted_key").count()

# Food is split into:
# Food_0, Food_1, Food_2, Food_3
# Distributed evenly across executors!
```

---

## Caching & Persist

### When to Cache?

```python
# BAD - Used only once
df = spark.read.parquet("/data")
df.count()  # No cache needed

# GOOD - Used multiple times
df = spark.read.parquet("/data")
df.cache()  # Cache it!

df.filter(col("age") > 30).count()  # Use 1
df.filter(col("salary") > 50000).count()  # Use 2
df.groupBy("dept").count().show()  # Use 3

# Without cache: Reads /data 3 times
# With cache: Reads /data once, reuses cached data
```

### cache() vs persist()

```python
# cache() is shortcut for MEMORY_AND_DISK
df.cache()

# persist() gives you control
df.persist(StorageLevel.MEMORY_ONLY)
df.persist(StorageLevel.DISK_ONLY)
```

### Unpersist (Free Memory)

```python
df.cache()
# ... use df ...
df.unpersist()  # Free the memory!
```

---

## Storage Levels

### 1. MEMORY_ONLY
- Fastest, but if memory is full, data is lost
- Use when you have enough memory

```python
df.persist(StorageLevel.MEMORY_ONLY)
```

### 2. MEMORY_AND_DISK (Default)
- Tries memory first, spills to disk if needed
- Safe, balanced approach
- Default for cache()

```python
df.cache()  # Uses MEMORY_AND_DISK
```

### 3. DISK_ONLY
- Stores only on disk
- Slow, rarely used

```python
df.persist(StorageLevel.DISK_ONLY)
```

### 4. MEMORY_ONLY_2
- Stores in memory on TWO executors (replication)
- Fault tolerant
- Uses 2x memory

```python
# Critical data that must not be lost
df.persist(StorageLevel.MEMORY_ONLY_2)
```

### 5. OFF_HEAP
- Stores outside JVM heap (in OS memory)
- Avoids garbage collection pauses
- For large executors (> 32 GB)

```python
spark.conf.set("spark.memory.offHeap.enabled", "true")
spark.conf.set("spark.memory.offHeap.size", "10g")

df.persist(StorageLevel.OFF_HEAP)
```

---

## Storage Levels Comparison

| Storage Level | Memory | Disk | Speed | Use Case |
|--------------|--------|------|-------|----------|
| MEMORY_ONLY | Yes | No | Fastest | Small data, lots of RAM |
| MEMORY_AND_DISK | Yes | Yes | Fast | Default, safe choice |
| DISK_ONLY | No | Yes | Slow | Rarely used |
| MEMORY_ONLY_2 | Yes (2x) | No | Fastest | Critical, fault-tolerant |
| OFF_HEAP | Yes* | No | Fast | Large executors, avoid GC |

---

## Client Mode vs Cluster Mode

### Client Mode
- Driver runs on YOUR machine (laptop)
- Good for development, notebooks
- Your machine must stay connected

```bash
spark-submit --deploy-mode client app.py
```

**Visual:**
```text
Your Laptop (Driver)
    |
    v
Cluster (Executors only)
```

**Use for:**
- Development
- Jupyter notebooks
- Interactive analysis

---

### Cluster Mode
- Driver runs ON the cluster
- Production ready
- Can submit job and disconnect

```bash
spark-submit --deploy-mode cluster app.py
```

**Visual:**
```text
Your Laptop (submit and disconnect)
    |
    v
Cluster (Driver + Executors)
```

**Use for:**
- Production jobs
- Scheduled jobs
- Long-running jobs

---

## Edge Node

**What it is:**
Gateway machine between you and the cluster.

```text
Your Laptop
    |
    | SSH
    v
Edge Node (Gateway)
    |
    v
Hadoop/Spark Cluster
```

**Purpose:**
- Submit jobs to cluster
- Security (don't expose cluster directly)
- Centralized tools

**Workflow:**
```bash
# SSH to edge node
ssh user@edge-node.company.com

# Submit job from edge node
spark-submit --master yarn app.py
```

---

## Partition Pruning

**What it is:**
Skip reading partitions that don't contain relevant data.

### Example:

**Partitioned Data:**
```python
# Write partitioned by year
df.write.partitionBy("year").parquet("/sales")

# Directory structure:
# /sales/year=2020/
# /sales/year=2021/
# /sales/year=2022/
# /sales/year=2023/
# /sales/year=2024/
```

**Query with Pruning:**
```python
df = spark.read.parquet("/sales")
result = df.filter(col("year") == 2024)

# Spark reads ONLY /sales/year=2024/
# Skips 2020, 2021, 2022, 2023
# Reads 20% of data instead of 100%!
```

**Performance Impact:**
```text
Without Pruning: Read 1 TB, filter to 200 GB (30 min)
With Pruning: Read 200 GB directly (5 min)
Savings: 6x faster!
```

---

## AQE (Adaptive Query Execution)

**What it is:**
Spark adjusts the query plan DURING execution based on real statistics.

**Enable AQE:**
```python
spark.conf.set("spark.sql.adaptive.enabled", "true")
```

### Three Main Features:

### 1. Dynamically Coalescing Partitions

**Problem:**
```python
df = df.filter(col("active") == True)  # 99% filtered out
# Still has 1000 partitions, most are tiny!
```

**AQE Solution:**
```text
Detects many small partitions
Merges them automatically
1000 partitions -> 50 partitions
Less overhead, better performance!
```

```python
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
```

---

### 2. Dynamically Switching Join Strategies

**Problem:**
```python
# Estimated: 50 MB (shuffle join chosen)
# Actual after filter: 5 MB (should broadcast!)
```

**AQE Solution:**
```text
Realizes actual size is small
Switches to broadcast join mid-execution
Much faster!
```

```python
spark.conf.set("spark.sql.adaptive.autoBroadcastJoinThreshold", "10MB")
```

---

### 3. Dynamically Optimizing Skew Joins

**Problem:**
```text
User "popular": 1M records (skewed!)
Other users: 100 records each
One executor overloaded!
```

**AQE Solution:**
```text
Detects skew
Splits large partition
Distributes across executors
Balanced execution!
```

```python
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
```

---

## Complete AQE Configuration

```python
# Enable AQE (Spark 3.0+)
spark.conf.set("spark.sql.adaptive.enabled", "true")

# Coalesce partitions
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

# Auto broadcast
spark.conf.set("spark.sql.adaptive.autoBroadcastJoinThreshold", "10MB")

# Skew join
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
```

---

## Production Best Practices

### Memory Management
```python
# Don't over-allocate
# Rule: executor-memory = 4-8 GB per executor
# More executors is better than huge executors

spark-submit \
  --executor-memory 8g \
  --executor-cores 4 \
  --num-executors 20 \
  app.py
```

### Caching
```python
# Cache only what's reused
df.cache()
# ... use multiple times ...
df.unpersist()  # Always unpersist when done!
```

### Storage Levels
```python
# For most cases
df.cache()  # MEMORY_AND_DISK

# For critical small data
df.persist(StorageLevel.MEMORY_ONLY_2)

# For large executors
df.persist(StorageLevel.OFF_HEAP)
```

### Partition Pruning
```python
# Always partition large tables
df.write.partitionBy("date", "country").parquet("/data")

# Filter on partition columns
df.filter(col("date") == "2024-01-01")  # Uses pruning!
```

### AQE
```python
# Enable in Spark 3.0+ (production default)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
```

---

## Common Issues & Solutions

### Issue 1: OOM Errors
**Solution:**
- Increase executor memory
- Reduce partition size (more partitions)
- Use salting for skewed data
- Avoid caching too much

### Issue 2: Slow Performance with Cache
**Solution:**
- Use MEMORY_ONLY if you have RAM
- Unpersist when done
- Check if data actually fits in memory

### Issue 3: Skewed Joins
**Solution:**
- Use salting technique
- Enable AQE skew join optimization
- Repartition by skewed key with salt

### Issue 4: Too Many Small Files
**Solution:**
- Coalesce before writing
- Use partition pruning
- Enable AQE coalescing

---

## Quick Reference

### Memory Formula
```text
Executor Memory = Reserved (300MB) + Spark Pool (60%) + User (40%)
Spark Pool = Storage (50%) + Execution (50%)
```

### Cache Decision
```text
Used once? -> No cache
Used 2-3 times? -> Maybe cache
Used 4+ times? -> Definitely cache
```

### Deployment Mode
```text
Development? -> Client mode
Production? -> Cluster mode
```

### Storage Level
```text
Small, reused often? -> MEMORY_ONLY
Default case? -> MEMORY_AND_DISK (cache())
Critical data? -> MEMORY_ONLY_2
Large executors? -> OFF_HEAP
```

---

## Summary

**Memory Management:**
- Understand executor memory layout
- Monitor spills in Spark UI
- Use salting for skewed data

**Caching:**
- Cache only reused data
- Choose appropriate storage level
- Unpersist when done

**Deployment:**
- Client mode for development
- Cluster mode for production
- Use edge nodes for security

**Optimization:**
- Use partition pruning (partition tables by common filters)
- Enable AQE in Spark 3.0+
- Monitor and tune based on metrics

**Key Insight:** Proper memory management and caching can make your Spark jobs 10-100x faster!