## 📑 Table of Contents
-[Spark Dataframe Partitions](#spark-dataframe-partitions)

# Spark DataFrame Partitions

## 1. Data Stored in Distributed Systems
- In real life, files (CSV, JSON, etc.) are stored in **distributed storage** like HDFS or Amazon S3.  
- The file is **split into partitions** and spread across nodes.  
  - Example: 100 partitions across 10 nodes.  
- This splitting makes **parallel reading** possible.

---

## 2. Logical DataFrame
- When you call `spark.read.csv("file.csv")`, Spark creates a **logical DataFrame**:
  - Stores **metadata**: schema, partition info, and how to read them.
  - **No data is loaded yet** (lazy evaluation).  
- Think of it like a **recipe**: instructions exist, but the meal (data) isn’t cooked until you trigger an action.

---

## 3. Role of the Driver (SparkSession)
- The driver (your `spark` session) is the **brain**:
  - Contacts cluster manager + storage.
  - Collects info about partitions.
  - Creates the **plan** to process them.
- Still, the driver does not load data itself.

---

## 4. Executors (the workers)
- Executors are **JVM processes** launched by the cluster manager.  
- They do the **real work**:
  - Load their assigned partitions into memory.
  - Run tasks (filter, join, aggregate, etc.).  
- Each executor has multiple **cores**, and each core processes **one partition at a time**.

---

## 5. Answer to the Doubt ✅
- If you have **5 executors** and **5 cores per executor**:  
  - Total = `5 × 5 = 25 cores`.  
- That means **25 partitions can be processed in parallel**.  
- ⚠️ Notes:
  - If partitions < 25, some cores stay idle.  
  - If partitions > 25, Spark processes them in **waves** (25 at a time).  

---

## 6. Data Locality Optimization
- Spark tries to assign each partition to an executor **close to where the data is stored**.  
- This reduces network traffic and speeds up jobs.  
- If not possible, Spark still works, but with some data transfer over the network.

---

## 7. Final Picture
A **distributed DataFrame** is created:
- Driver manages the plan.  
- Executors (with multiple cores) load and process partitions in parallel.  
- Together, they form a **scalable system** for big data.

---

## 📊 Diagram
![Spark DataFrame Partitions](./images/driver_execution.png)