# Table of Contents
- [Executing spark programs in apache spark](#executing-spark-programs-in-apache-spark)
    - [Interactve clients](#interactive-clients)
    - [Spark Submit](#spark-submit)
- [Spark on Yarn (master slave architecture)](#spark-on-yarn-driverexecutor-flow)
- [How spark runs (local,interactive client and cluster)](#how-spark-runs-local-interactive-client-and-cluster-mode)

# Executing Spark Programs in Apache Spark

There are two primary ways to run Spark code: **Interactive Clients** and **`spark-submit`**. Use interactive tools for learning/exploration, and `spark-submit` for production-grade jobs.

---

## Interactive Clients

**What it is**
- Tools like **Spark Shell** (Scala/PySpark REPL) and **Notebooks** (Jupyter, Databricks) where you run code **line by line**.

**Why**
- Fast feedback loop for **learning**, **data exploration**, and **debugging**.

**When**
- During **development** or when prototyping logic before packaging an application.

**Example (PySpark shell)**
    
    $ pyspark
    # Inside the shell:
    df = spark.read.csv("data.csv", header=True, inferSchema=True)
    df.groupBy("category").count().show()

**Example (Notebook cell)**
    
    df = spark.read.parquet("/mnt/raw/events/")
    df.select("user_id", "event_type").where(df.event_type == "purchase").display()

---

## spark-submit

**What it is**
- A **universal CLI tool** to package and send your Spark application to a cluster (local, YARN, k8s, Standalone).

**Why**
- Suitable for **production**: scheduled **batch** jobs and **streaming** apps with resource configs, retries, and logs handled by the cluster.

**When**
- After prototyping is done; for **automated**, **repeatable**, and **scalable** execution.

**Example (submit a Python app locally)**
    
    spark-submit \
      --master local[*] \
      my_app.py

**Example (submit to YARN in cluster mode)**
    
    spark-submit \
      --master yarn \
      --deploy-mode cluster \
      --conf spark.executor.memory=4g \
      --conf spark.executor.cores=2 \
      jobs/daily_agg.py

**Minimal Python app structure**
    
    # my_app.py
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("DailyAgg").getOrCreate()
    df = spark.read.parquet("s3://bucket/input/")
    out = df.groupBy("category").count()
    out.write.mode("overwrite").parquet("s3://bucket/output/daily_agg/")
    spark.stop()

---

## ✅ Quick Summary

- **Interactive Clients** → best for **exploration & development**; immediate results, iterate quickly.  
- **`spark-submit`** → best for **production**; packaged jobs (batch or streaming) running reliably on a cluster.

# Spark on YARN Driver/Executor Flow

This walk-through explains **what happens after you submit a Spark app** to **YARN / the Cluster Manager** and how the **Driver (master)** and **Executors (slaves)** are created and interact.

---

## 0) Submit
You run `spark-submit --master yarn ...`. The submission is sent to the **YARN ResourceManager (RM)**.

![Submission enters the cluster via RM](./images/a1_application.png)

---

## 1) Launch the Driver (via ApplicationMaster)
- **RM** selects a **NodeManager (NM)** with capacity and **launches one container** for your application’s **ApplicationMaster (AM)**.  
- For **Spark on YARN (cluster mode)**, the **Driver runs inside this AM container** (i.e., the AM hosts the Driver process).
- The Driver initializes the **SparkSession/SparkContext**, registers with YARN, and starts building the job DAG.

🔹 Important Clarification
	•	The AM is not separate from the Driver in Spark-on-YARN (cluster mode).
	•	The first container hosts both AM + Driver.
	•	The AM’s only job is to talk to RM for resource requests (executors).
	•	Once the Driver is running, it coordinates all Executors directly.
	•	Executors are always in separate containers (one container = one executor).

![A1 Driver starts inside the cluster](./images/driver_process_a1.png)

---

## 2) Ask for Executor Containers
- The **Driver (through the AM)** requests more containers from **RM** for **Executors** (each executor needs its own CPU/memory spec).
- **RM** allocates containers across available **NodeManagers**.

---

## 3) Launch Executors (one per container)
- Each allocated container is used to **start one Executor JVM**.
- Executors register back with the **Driver** and are ready to run tasks.
- **Important:** there is **one container for the Driver** and **one container per Executor** (Executors are *not* colocated in the Driver’s container).

![Driver with multiple Executors (each executor in its own container)](./images/a1_executor.png)

---

## 4) Schedule & Execute Tasks
- The **Driver** splits your job into **stages → tasks** based on the DAG (and shuffles).  
- Tasks are assigned to **Executors**, preferring **data locality** when possible.  
- Executors:
  - Run tasks on their **partitions** (map/filter/joins/aggregations).  
  - Materialize shuffle outputs when needed.  
  - **Report status & metrics** back to the Driver.  
- On failures, the Driver can **relaunch tasks** or **request new containers** for lost executors.

---

## 5) Completion & Cleanup
- When jobs finish:
  - Executors flush/write results (e.g., HDFS/S3/Delta/DB).
  - **Driver** stops, **AM** unregisters, and **RM** **releases all containers**.
  - Logs/metrics remain available via YARN UIs / Spark UI (per app).

---

## 6) Multiple Independent Apps
Each Spark application has **its own Driver + Executors set**, isolated from others. Submitting a second app (A2) repeats the same flow with **separate** containers.

![A1 and A2 each have their own Driver and Executors](./images/a1_a2_application.png)

---

## 🧭 Roles Recap
- **YARN (RM/NM/AM)** = **Resource orchestration** (containers, placement, lifecycle).  
- **Spark Driver** = **Coordinator / “master”** (DAG, scheduling, task assignment, app state).  
- **Spark Executors** = **Workers / “slaves”** (run tasks, cache/shuffle data, report to Driver).

**Rule of thumb:** *1 Driver container* + *N Executor containers* per Spark app.

# How Spark Runs: Local, Interactive Client, and Cluster Mode

Spark can run in different environments depending on whether you are testing locally, exploring interactively, or submitting production jobs to a cluster.

---

## 1) Spark Local Mode

**What it is**
- Runs entirely on a single machine; no real cluster manager.
- Configure with `local[n]`, where `n` is the number of threads.

**How it behaves**
- `local[1]` → Driver only (no executors); work runs serially.
- `local[n]` (n > 1) → Driver + multiple executor **threads** (parallelism simulated with threads).

**Best for**
- Quick development, debugging, unit tests.

**Example master config**
- `local[3]` (one driver thread + two executor threads)

**Diagrams**
![Local with 1 thread (Driver only)](./images/local_1.png)
![Local with 3 threads (Driver + Executor threads)](./images/local_3.png)

---

## 2) Interactive Client Mode (spark-shell / notebooks)

**What it is**
- The **Driver runs on your client machine** (laptop/IDE/notebook server).
- The Driver talks to the **cluster manager** (YARN/Kubernetes/Standalone) to start **Executors on the cluster**.

**Important**
- If the **client logs off or dies**, the **Driver dies**, and **all executors terminate** because they depend on the Driver.

**Best for**
- Ad-hoc analysis, exploration, iterative development.

**Diagrams**
![Client mode: Driver on client, Executors on cluster](./images/client_mode.png)
![If client/driver dies, executors die](./images/client_mode_dies.png)

---

## 3) Cluster Mode (production submissions)

**What it is**
- You submit a packaged app; the **Driver runs inside the cluster** (in a container), and **each Executor runs in its own container** on worker nodes.

**Why it matters**
- After submission, you can disconnect; the job continues to run.
- Preferred for **batch** and **streaming** production workloads.

**Diagram**
![Cluster mode: Driver and Executors inside the cluster](./images/cluster_mode.png)

---

## ✅ Quick Summary

- **Local mode** → Single machine; `local[1]` = driver only; `local[n]` = driver + executor threads.  
- **Interactive client mode** → Driver on client, executors on cluster; **if client logs off, the whole Spark app stops**.  
- **Cluster mode** → Driver + executors both in the cluster; safe to disconnect; ideal for production.