In [3]:
# --- 1) Point PySpark to the exact Python you're running right now ---
import os, sys

py = sys.executable  # e.g., C:\Users\PX\anaconda3\envs\music-chatbot\python.exe
os.environ["PYSPARK_DRIVER_PYTHON"] = py
os.environ["PYSPARK_PYTHON"] = py

# --- 2) Stop any existing Spark session cleanly (if already created) ---
try:
    spark.stop()
except Exception:
    pass

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("PySpark-Windows-Fix")
    # Ensures executors pick the same Python as the driver
    .config("spark.pyspark.driver.python", py)
    .config("spark.pyspark.python", py)
    # Optional: Arrow speeds up Pandas ↔ Spark conversions if available
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")
    .getOrCreate()
)

# 💤 Lazy Evaluation in Spark

**Definition:**  
Spark does not execute transformations immediately.  
Instead, it **builds a logical plan (DAG)** of all transformations and only executes them when an **action** is called.

---

## 🧠 How it works:
1. **Transformations** (e.g., `select`, `filter`, `map`, `withColumn`)
   - **Lazy:** They just *record* the steps in a DAG.
   - No actual computation or data movement happens yet.

2. **Actions** (e.g., `show`, `count`, `collect`, `write`)
   - **Trigger execution:** Spark optimizes the DAG, splits it into stages, schedules tasks, and computes results.

---

## 📊 Benefits of Lazy Evaluation
- **Optimization:** Spark can reorder operations, combine filters, and minimize shuffles before running.
- **Efficiency:** Avoids doing unnecessary work (e.g., if you filter then drop columns, Spark will push the drop down first).
- **Fault tolerance:** DAG keeps track of lineage → recomputes lost partitions only when needed.

---

## 🎯 Typical Interview Questions
- What is lazy evaluation in Spark?
- Give an example of a transformation and an action.
- Why does Spark use lazy evaluation?
- How does lazy evaluation improve performance?


In [4]:
from pyspark.sql import functions as F
import time

# Create a DataFrame
df_lazy = spark.range(1, 10)

# Define transformations (LAZY, nothing runs yet)
df_transformed = df_lazy.filter(df_lazy.id % 2 == 0) \
                        .withColumn("double_val", df_lazy.id * 2)

print("✅ Transformations defined, but no computation has run yet!")

# Trigger an action (ACTUAL EXECUTION happens here)
start = time.time()
df_transformed.show()
end = time.time()

print(f"⏱️ Action triggered execution; took {end - start:.4f} seconds")


✅ Transformations defined, but no computation has run yet!
+---+----------+
| id|double_val|
+---+----------+
|  2|         4|
|  4|         8|
|  6|        12|
|  8|        16|
+---+----------+

⏱️ Action triggered execution; took 1.5328 seconds


# 🎯 Lazy Evaluation – Interview Answers

---

## ❓ What is lazy evaluation in Spark?
💡 Lazy evaluation means Spark does **not execute transformations immediately**.  
Instead, it builds a **logical plan (DAG)** of all transformations and only executes it when an **action** (like `show`, `count`, `collect`, `write`) is called.

---

## ❓ Give an example of a transformation and an action
- **Transformation (lazy):** Only adds a step to the DAG (no computation yet).  
- **Action:** Triggers Spark to optimize the DAG, run tasks, and produce results.

---

## ❓ Why does Spark use lazy evaluation?
- 🧠 **Global optimization** – Spark can optimize the whole job before running it.  
- 🛑 **Avoids unnecessary work** – skips computations for unused transformations.  
- 🔄 **Fault tolerance** – DAG lineage lets Spark recompute only missing data if a node fails.

---

## ❓ How does lazy evaluation improve performance?
- 🔗 **Pipelining:** Combines multiple transformations into fewer stages.  
- 🎯 **Predicate pushdown:** Filters early to shrink data before shuffles.  
- 📦 **Minimized shuffles:** Reduces costly network transfers.  
- ⚡ **Efficiency:** Computes only what’s required, when it’s required.


# 🔀 Narrow vs Wide Transformations in Spark

Understanding narrow vs wide transformations is crucial for interviews — it explains how Spark decides **when to create a new stage** in the DAG.

---

## 🔹 Narrow Transformations
- Each input partition contributes to **exactly one output partition**.
- No data needs to move across the cluster.
- **No shuffle** occurs.
- Examples:
  - `map()`, `filter()`, `select()`, `withColumn()`, `union()`, `sample()`

**Effect:** These operations can be **pipelined** together in the same stage → faster execution.

---

## 🔸 Wide Transformations
- Input partitions contribute to **multiple output partitions**.
- **Data must be shuffled** across the network (expensive).
- Examples:
  - `groupBy()`, `reduceByKey()`, `join()`, `distinct()`, `orderBy()`

**Effect:** Spark creates a **new stage** after a wide transformation because data has to be redistributed across executors.

---

## 🧠 Why This Matters
- Wide transformations are **more expensive** (network + disk I/O).
- Optimizing pipelines means:
  - Filter early (reduce data before shuffle).
  - Repartition intelligently to control shuffle size.
  - Cache if needed to avoid recomputation after expensive shuffles.

---

## 🎯 Typical Interview Questions
- What is the difference between narrow and wide transformations?
- Which type causes a shuffle and why?
- Give examples of narrow and wide transformations.
- Why does Spark create a new stage after a wide transformation?


In [5]:
from pyspark.sql import functions as F

# Sample DataFrame
df_demo = spark.range(0, 20)

# Narrow transformations (no shuffle)
df_narrow = df_demo.filter(df_demo.id % 2 == 0) \
                   .withColumn("double_val", df_demo.id * 2)

# Wide transformation (shuffle required)
df_wide = df_narrow.groupBy("double_val").count()

# Trigger execution
print("=== Narrow transformations output ===")
df_narrow.show()

print("\n=== Wide transformation output (shuffle happens here) ===")
df_wide.show()

# Tip: You can open the Spark UI (driver host printed earlier) to see
# the DAG visualization and confirm that groupBy caused a shuffle (new stage).


=== Narrow transformations output ===
+---+----------+
| id|double_val|
+---+----------+
|  0|         0|
|  2|         4|
|  4|         8|
|  6|        12|
|  8|        16|
| 10|        20|
| 12|        24|
| 14|        28|
| 16|        32|
| 18|        36|
+---+----------+


=== Wide transformation output (shuffle happens here) ===
+----------+-----+
|double_val|count|
+----------+-----+
|         0|    1|
|         8|    1|
|         4|    1|
|        12|    1|
|        16|    1|
|        20|    1|
|        28|    1|
|        24|    1|
|        32|    1|
|        36|    1|
+----------+-----+



# 🧩 Jobs, Stages, and Tasks in Spark

Understanding **jobs → stages → tasks** is key to explaining Spark's execution model.

---

## 🏗️ Execution Hierarchy

1️⃣ **Job**
- A job is triggered by an **action** (e.g., `count()`, `collect()`, `show()`).
- One Spark application can have many jobs.
- Example: Calling `.count()` triggers a job.

2️⃣ **Stage**
- A job is divided into **stages** based on **shuffle boundaries**.
- Each stage contains **narrow transformations** that can be pipelined together.
- New stage starts when Spark needs to **shuffle data** (e.g., `groupBy`, `join`).

3️⃣ **Task**
- The smallest unit of work in Spark.
- Each task runs on **one partition** of the data.
- Number of tasks = number of partitions in the stage.

---

## 🔗 Relationship

**Action ➜ Job ➜ Stages ➜ Tasks**

Example flow for a `groupBy`:
- `count()` → creates **1 Job**
- Spark breaks it into **2 Stages** (before and after shuffle)
- Each stage runs multiple **Tasks** in parallel (one per partition)

---

In [6]:
from pyspark.sql import functions as F

# Sample DataFrame
df_jobs = spark.range(0, 100).withColumn("key", F.col("id") % 10)

# Trigger an action with a shuffle (groupBy causes a wide transformation)
grouped = df_jobs.groupBy("key").agg(F.count("*").alias("cnt"))

# Action: triggers a job with multiple stages & tasks
grouped.show()

print("\n✅ Check Spark UI (driver host printed earlier) → you'll see:")
print("- 1 job created for the show() action")
print("- 2 stages (before and after shuffle)")
print("- Multiple tasks per stage (1 per partition)")


+---+---+
|key|cnt|
+---+---+
|  0| 10|
|  7| 10|
|  6| 10|
|  9| 10|
|  5| 10|
|  1| 10|
|  3| 10|
|  8| 10|
|  2| 10|
|  4| 10|
+---+---+


✅ Check Spark UI (driver host printed earlier) → you'll see:
- 1 job created for the show() action
- 2 stages (before and after shuffle)
- Multiple tasks per stage (1 per partition)
