## 📑 Table of Contents
- [Spark Dataframe Partitions](#spark-dataframe-partitions)
-  [Spark Transformation and Dependencies](#spark-transformations--dependencies)
    - [Narrow Dependency](#2-narrow-dependency-transformation)
    - [Wide Dependency](#3-wide-dependency-transformation)

# Spark DataFrame Partitions

## 1. Data Stored in Distributed Systems
- In real life, files (CSV, JSON, etc.) are stored in **distributed storage** like HDFS or Amazon S3.  
- The file is **split into partitions** and spread across nodes.  
  - Example: 100 partitions across 10 nodes.  
- This splitting makes **parallel reading** possible.

---

## 2. Logical DataFrame
- When you call `spark.read.csv("file.csv")`, Spark creates a **logical DataFrame**:
  - Stores **metadata**: schema, partition info, and how to read them.
  - **No data is loaded yet** (lazy evaluation).  
- Think of it like a **recipe**: instructions exist, but the meal (data) isn’t cooked until you trigger an action.

---

## 3. Role of the Driver (SparkSession)
- The driver (your `spark` session) is the **brain**:
  - Contacts cluster manager + storage.
  - Collects info about partitions.
  - Creates the **plan** to process them.
- Still, the driver does not load data itself.

---

## 4. Executors (the workers)
- Executors are **JVM processes** launched by the cluster manager.  
- They do the **real work**:
  - Load their assigned partitions into memory.
  - Run tasks (filter, join, aggregate, etc.).  
- Each executor has multiple **cores**, and each core processes **one partition at a time**.

---

## 5. Answer to the Doubt ✅
- If you have **5 executors** and **5 cores per executor**:  
  - Total = `5 × 5 = 25 cores`.  
- That means **25 partitions can be processed in parallel**.  
- ⚠️ Notes:
  - If partitions < 25, some cores stay idle.  
  - If partitions > 25, Spark processes them in **waves** (25 at a time).  

---

## 6. Data Locality Optimization
- Spark tries to assign each partition to an executor **close to where the data is stored**.  
- This reduces network traffic and speeds up jobs.  
- If not possible, Spark still works, but with some data transfer over the network.

---

## 7. Final Picture
A **distributed DataFrame** is created:
- Driver manages the plan.  
- Executors (with multiple cores) load and process partitions in parallel.  
- Together, they form a **scalable system** for big data.

---

## 📊 Diagram
![Spark DataFrame Partitions](./images/driver_execution.png)

# Spark Transformations & Dependencies

## 1. What are Transformations?
- Spark DataFrames are **immutable**.  
- To process data, you don’t “modify” a DataFrame. Instead, you apply **transformations**:
  - Examples: `select()`, `filter()`, `groupBy()`, `orderBy()`.
- Each transformation produces a **new DataFrame** (logically), building a chain of operations.
- Together, transformations form a **DAG (Directed Acyclic Graph)** of operations.

📊 Example:  
![Spark Transformations](./images/transformations.png)

---

## 2. Narrow Dependency Transformation
- A transformation is **narrow** when each partition can be processed **independently**, without needing data from other partitions.
- Executors can process partitions locally → results are later combined.  
- These are usually **map-style operations**.

### 🔹 Examples of Narrow Dependencies:
- `select("col1", "col2")` → choosing columns.  
- `filter(df.col > 10)` / `where(df.col < 40)` → row filtering.  
- `withColumn("newCol", df.col * 2)` → column-level transformation.  
- `map()` / `flatMap()` → functional transformations.  
- `count()` (after shuffle stage is done) → counts within a partition.  

📊 Visualization:  
![Narrow Dependency](./images/narrow_dependency.png)

---

## 3. Wide Dependency Transformation
- A transformation is **wide** when data from **multiple partitions must be shuffled** to produce a correct result.
- Spark performs a **shuffle/sort exchange** → redistributes data across partitions.  
- These are usually **reduce-style operations**.

### 🔹 Examples of Wide Dependencies:
- `groupBy("col").count()` → needs all rows of a group together.  
- `orderBy("col")` / `sort()` → needs global ordering.  
- `distinct()` → requires shuffling duplicates across partitions.  
- `join(df1, df2, "col")` → rows with the same key must meet.  
- `reduceByKey()` / `aggregateByKey()` (in RDDs).  

📊 Problem (before shuffle):  
![Wide Dependency Problem](./images/wide_dependecy_prob.png)

📊 Solution (after shuffle + repartition):  
![Wide Dependency Solution](./images/wide_denpendency_sol.png)

---

## 4. ❓ Your Doubt — What if a Group is Too Big for One Partition?
- Spark’s default shuffle partition size is **~128 MB** (not KB).  
- After `groupBy()`, Spark ensures all rows of the same group key land in the **same partition**.  
- But if one group is **larger than 128 MB**, what happens?

### ✅ The Answer:
1. **Partition size is not a hard cap**  
   - Spark can create partitions larger than 128 MB if needed.  
   - The big group is stored in one partition, even if it exceeds the “ideal” size.

2. **Aggregation correctness is preserved**  
   - Example: `count()` runs within that partition only.  
   - So logically, `count()` is still a **narrow dependency** (independent of other partitions).

3. **Performance concerns (data skew)**  
   - One executor may get a huge partition → slower job.  
   - Spark prevents crashes by spilling to disk if memory runs out.

4. **Optimizations in practice**  
   - Increase shuffle partitions (`spark.sql.shuffle.partitions`).  
   - Apply *salting keys* to break very large groups into subgroups.  
   - Use Spark’s **skew join optimization** (Spark 3+).

👉 So: correctness is always maintained, but performance may degrade if data skew is high.

---

# 📝 Summary
- **Narrow dependency**: map-style operations → `select`, `filter`, `withColumn`, `map`, `count` (post-shuffle).  
- **Wide dependency**: reduce-style operations → `groupBy`, `orderBy`, `join`, `distinct`, `reduceByKey`.  
- **Large group in wide dependency**: Spark allows big partitions → results are correct, but skew can slow down jobs.