## 📑 Table of Contents
- [Spark Dataframe Partitions](#spark-dataframe-partitions)
-  [Spark Transformation and Dependencies](#spark-transformations--dependencies)
    - [Narrow Dependency](#2-narrow-dependency-transformation)
    - [Wide Dependency](#3-wide-dependency-transformation) 
- [Spark Execution Plan](#spark-execution-plan--jobs-stages-and-tasks)
    - [How spark code executes (jobs,stages and tasks)](#how-this-spark-code-executes)


# Spark DataFrame Partitions

## 1. Data Stored in Distributed Systems
- In real life, files (CSV, JSON, etc.) are stored in **distributed storage** like HDFS or Amazon S3.  
- The file is **split into partitions** and spread across nodes.  
  - Example: 100 partitions across 10 nodes.  
- This splitting makes **parallel reading** possible.

---

## 2. Logical DataFrame
- When you call `spark.read.csv("file.csv")`, Spark creates a **logical DataFrame**:
  - Stores **metadata**: schema, partition info, and how to read them.
  - **No data is loaded yet** (lazy evaluation).  
- Think of it like a **recipe**: instructions exist, but the meal (data) isn’t cooked until you trigger an action.

---

## 3. Role of the Driver (SparkSession)
- The driver (your `spark` session) is the **brain**:
  - Contacts cluster manager + storage.
  - Collects info about partitions.
  - Creates the **plan** to process them.
- Still, the driver does not load data itself.

---

## 4. Executors (the workers)
- Executors are **JVM processes** launched by the cluster manager.  
- They do the **real work**:
  - Load their assigned partitions into memory.
  - Run tasks (filter, join, aggregate, etc.).  
- Each executor has multiple **cores**, and each core processes **one partition at a time**.

---

## 5. Answer to the Doubt ✅
- If you have **5 executors** and **5 cores per executor**:  
  - Total = `5 × 5 = 25 cores`.  
- That means **25 partitions can be processed in parallel**.  
- ⚠️ Notes:
  - If partitions < 25, some cores stay idle.  
  - If partitions > 25, Spark processes them in **waves** (25 at a time).  

---

## 6. Data Locality Optimization
- Spark tries to assign each partition to an executor **close to where the data is stored**.  
- This reduces network traffic and speeds up jobs.  
- If not possible, Spark still works, but with some data transfer over the network.

---

## 7. Final Picture
A **distributed DataFrame** is created:
- Driver manages the plan.  
- Executors (with multiple cores) load and process partitions in parallel.  
- Together, they form a **scalable system** for big data.

---

## 📊 Diagram
![Spark DataFrame Partitions](./images/driver_execution.png)

# Spark Transformations & Dependencies

## 1. What are Transformations?
- Spark DataFrames are **immutable**.  
- To process data, you don’t “modify” a DataFrame. Instead, you apply **transformations**:
  - Examples: `select()`, `filter()`, `groupBy()`, `orderBy()`.
- Each transformation produces a **new DataFrame** (logically), building a chain of operations.
- Together, transformations form a **DAG (Directed Acyclic Graph)** of operations.

📊 Example:  
![Spark Transformations](./images/transformations.png)

---

## 2. Narrow Dependency Transformation
- A transformation is **narrow** when each partition can be processed **independently**, without needing data from other partitions.
- Executors can process partitions locally → results are later combined.  
- These are usually **map-style operations**.

### 🔹 Examples of Narrow Dependencies:
- `select("col1", "col2")` → choosing columns.  
- `filter(df.col > 10)` / `where(df.col < 40)` → row filtering.  
- `withColumn("newCol", df.col * 2)` → column-level transformation.  
- `map()` / `flatMap()` → functional transformations.  
- `count()` (after shuffle stage is done) → counts within a partition.  

📊 Visualization:  
![Narrow Dependency](./images/narrow_dependency.png)

---

## 3. Wide Dependency Transformation
- A transformation is **wide** when data from **multiple partitions must be shuffled** to produce a correct result.
- Spark performs a **shuffle/sort exchange** → redistributes data across partitions.  
- These are usually **reduce-style operations**.

### 🔹 Examples of Wide Dependencies:
- `groupBy("col").count()` → needs all rows of a group together.  
- `orderBy("col")` / `sort()` → needs global ordering.  
- `distinct()` → requires shuffling duplicates across partitions.  
- `join(df1, df2, "col")` → rows with the same key must meet.  
- `reduceByKey()` / `aggregateByKey()` (in RDDs).  

📊 Problem (before shuffle):  
![Wide Dependency Problem](./images/wide_dependecy_prob.png)

📊 Solution (after shuffle + repartition):  
![Wide Dependency Solution](./images/wide_denpendency_sol.png)

---

## 4. ❓ Your Doubt — What if a Group is Too Big for One Partition?
- Spark’s default shuffle partition size is **~128 MB** (not KB).  
- After `groupBy()`, Spark ensures all rows of the same group key land in the **same partition**.  
- But if one group is **larger than 128 MB**, what happens?

### ✅ The Answer:
1. **Partition size is not a hard cap**  
   - Spark can create partitions larger than 128 MB if needed.  
   - The big group is stored in one partition, even if it exceeds the “ideal” size.

2. **Aggregation correctness is preserved**  
   - Example: `count()` runs within that partition only.  
   - So logically, `count()` is still a **narrow dependency** (independent of other partitions).

3. **Performance concerns (data skew)**  
   - One executor may get a huge partition → slower job.  
   - Spark prevents crashes by spilling to disk if memory runs out.

4. **Optimizations in practice**  
   - Increase shuffle partitions (`spark.sql.shuffle.partitions`).  
   - Apply *salting keys* to break very large groups into subgroups.  
   - Use Spark’s **skew join optimization** (Spark 3+).

👉 So: correctness is always maintained, but performance may degrade if data skew is high.

---

# 📝 Summary
- **Narrow dependency**: map-style operations → `select`, `filter`, `withColumn`, `map`, `count` (post-shuffle).  
- **Wide dependency**: reduce-style operations → `groupBy`, `orderBy`, `join`, `distinct`, `reduceByKey`.  
- **Large group in wide dependency**: Spark allows big partitions → results are correct, but skew can slow down jobs.

# Spark Execution Plan — Jobs, Stages, and Tasks

![Spark Execution Plan](./images/spark_execution_plan_.png)

## Why an “execution plan”?
Spark builds a **logical DAG** from your transformations (lazy).  
When you run an **action** (e.g., `show()`, `count()`, `collect()`, `write`), Spark **materializes** that DAG into:
1) **Jobs** → split into  
2) **Stages** (cut at shuffles) → executed as many  
3) **Tasks** (one per partition in that stage)

---

## Quick definitions
- **Job**: Triggered by **one action**. If your code has 3 actions, you’ll see **3 jobs**.
- **Stage**: A block of work that can run **without a shuffle** (only **narrow** deps). Stages are split at **wide** deps (e.g., `groupBy`, `join`, `orderBy`, `distinct`, `repartition`).
- **Task**: A per-partition execution of a stage on an executor core.  
  **#Tasks in a stage ≈ #Partitions for that stage** (e.g., default `spark.sql.shuffle.partitions = 200`).

---

## From code → plan (example)

    def count_by_country(df):
        return (
            df.where("Age < 40")                        # narrow
              .select("Age","Gender","Country","state") # narrow
              .groupBy("Country")                       # wide → shuffle
              .count()                                  # aggregate after shuffle
        )

    survey_df   = spark.read.csv(path, header=True, inferSchema=True)
    partitioned = survey_df.repartition(2)              # wide (shuffle) when action triggers
    result_df   = count_by_country(partitioned)

    result = result_df.collect()                        # ACTION → creates a Job

**What Spark does:**
- **Job A** (created by `.collect()`):
  - **Stage 0 (map stage, narrow)**: scan + `where` + `select` on each input partition → produces intermediate partitions.
  - **Shuffle boundary** (because of `groupBy("Country")`).
  - **Stage 1 (reduce stage)**: shuffle-read all rows for each key; run aggregate (`count`) → produces grouped result partitions.
  - **Collect** pulls final partitions to driver (no extra shuffle).

If you later do `result_df.write.parquet(...)`, that is **another action → another Job**, with its own stages.

---

## Narrow vs Wide in the plan
- **Narrow ops** (stay within a stage; no shuffle):  
  `select`, `filter/where`, `withColumn`, map-style ops, and post-shuffle per-partition `count`.
- **Wide ops** (cause shuffle → new stage):  
  `groupBy`/aggregations, `join`, `orderBy/sort`, `distinct`, `repartition`, `reduceByKey`/`aggregateByKey` (RDD).

---

## How tasks run
- Each stage spawns **one task per partition**.
- Tasks run in parallel across **executor cores**.  
  Example: 5 executors × 5 cores = **25 concurrent tasks** max.  
- If a stage has 100 partitions, Spark runs them in **waves** (25 at a time until 100 complete).



# How This Spark Code Executes 

![Execution Plan (Jobs)](./images/ex_plan.png)

## 1) Why are Jobs created during `read`?
- `spark.read.csv(...)` is usually **lazy**, but certain options force Spark to **touch data**:
  - `inferSchema=true` → Spark must scan sample rows to determine column types.
- This triggers internal work that shows up as **Jobs** in the UI:
  - **Job 0**: file listing / metadata scan
  - **Job 1**: schema inference scan
> You didn’t call an action yet, but these options require data inspection, so Spark runs small internal jobs.

---

## 2) Why `repartition(2)` appears first in the collect-job
- `repartition(2)` is a **wide transformation** (requires **shuffle**) that sets the target partition count to **2**.
- It’s **lazy** when written, but at the next action Spark inserts a **stage boundary** here:
  - Stage for **repartition → shuffle write** so later stages run with 2 partitions.
- Reasons to do this in demos:
  - Make parallelism explicit/controlled (2 tasks instead of many).
  - Keep the UI simple to read.

---

## 3) Narrow ops before `groupBy`
- `filter("Age < 40")` and `select("Age","Gender","Country","state")` are **narrow** transformations.
- Narrow ops run **within the same stage** (no shuffle).

---

## 4) How `groupBy().count()` is planned
- `groupBy("Country")` is **wide**: all rows of the same key must meet → **shuffle** → **new stage**.
- `.count()` **after `groupBy` is a transformation** (aggregation), **not an action**.
- So aggregation happens **inside the post-shuffle stage**, and only the final **`collect()`** is the action.

---

![Stage Breakdown](./images/stage.png)

### Putting it together (the collect job)
1. **Stage 2 — Repartition**
   - Do `repartition(2)` → **shuffle write to Exchange**.
2. **Stage 3 — Narrow ops + group keying**
   - **Read Exchange** → `where` → `select` → start `groupBy` → **shuffle write**.
3. **Stage 4 — Aggregation + Collect**
   - **Read Exchange** → per-key `count()` (aggregation) → driver **`collect()`**.

**Tasks per stage:** ≈ **# partitions** for that stage (here 1 in Stage 2’s input, then 2 after repartition, so Stage 3 and 4 show **2 tasks** in parallel).

---

## TL;DR
- Jobs during `read` appear because `inferSchema` forces scans (internal actions).
- `repartition(2)` is a shuffle point; Spark inserts it as the **first stage** of the collect job to set parallelism.
- `groupBy` causes another shuffle (new stage); `count()` here is an **aggregation transformation**, not an action.
- The **action** is `collect()` → it triggers the entire **Job** made of those **Stages**, each executed as many **Tasks** as partitions.