# 📘 Apache Beam Architecture

Welcome students 👋

In this lecture, we will discuss the **architecture of Apache Beam** and see the various components in it and their roles.

---

## 🎯 What is Apache Beam?
- **Beam** provides *abstractions for unified data processing*.  
- It is the **core** of Apache Beam framework.  
- Beam supports both:
  - **Batch processing** (finite datasets)
  - **Streaming processing** (infinite datasets)

---

## 🖼️ Beam Architecture Diagram

Below is the high-level architecture of Apache Beam 👇

![image.png](attachment:image.png)

---

## 🔹 Step 1 — Beam SDKs (Software Development Kits)

- SDK = *Software Development Kit* → library that lets the user build Beam pipelines.
- Beam SDKs provide a **unified programming model** that works for both batch and streaming.
- Current popular SDKs:
  - **Java SDK** ✅ (most mature)
  - **Python SDK** ✅ (actively used and evolving fast)
  - **Other SDKs** (Go, Scala, etc. are growing with the community)

👉 In simpler words: **Beam SDKs are just languages you can use to create pipelines.**

---

## 🔹 Step 2 — Beam Runner API

- Once a pipeline is defined in any SDK (Java/Python/etc.), it is **converted into a language-independent format**.
- This conversion is handled by the **Beam Runner API**.
- What gets converted:
  - Generic transforms like `Map`, `GroupBy`, `Filter`.
- What does **NOT** get fully converted:
  - **Language-specific functions (UDFs)** → still tied to the SDK language.

---

## 🔹 Step 3 — Runners

- Runners execute the pipeline on different distributed systems.  
- Examples:
  - **Spark**
  - **Flink**
  - **Apex**
  - **Google Cloud Dataflow**
  - **Others**

👉 The same Beam code can run on any of these runners, as long as it goes through the Beam Runner API.

---

## 🔹 Step 4 — SDK Workers

- Each SDK has a **corresponding worker** that executes language-specific code.
- Workers resolve **language primitives** and **user-defined functions (UDFs)**.
- Examples:
  - Java SDK → **Java Worker**
  - Python SDK → **Python Worker**
  - Other SDKs → **Other Workers**

---

## ✅ Recap

- **Beam SDKs**: Let users write pipelines in their favorite language.  
- **Beam Runner API**: Converts pipelines into a common representation.  
- **Runners**: Systems that actually execute the pipeline (Spark, Flink, Dataflow, etc.).  
- **Workers**: Execute language-specific functions.  

👉 Thanks to this architecture, **the same code can run on multiple runners without change.**

---


# 📘 Apache Beam — Dynamic View

---

## 🖼️ Multi-language Example

So if I can show you dynamically, this is what **Beam** is doing 👇

![3-](attachment:image.png)

- You write your program in **any language** (Python, Java, Go, etc.).  
- Example in the diagram: the **same filter operation** is written in multiple languages.  
- Beam converts your program into a **language-agnostic format**.  
- Any **language-specific primitives** are resolved by the corresponding **SDK Worker**.  
- Finally, the program can be executed on **any runner** (Spark, Flink, Apex, Dataflow, Samza, etc.).  

---

## 🔑 Why add Runner API & SDK Workers?

- **Separation of concerns**:
  - Add a new **SDK** without worrying about runners.
  - Add a new **runner** without worrying about SDK details.  

👉 This modular design lets the Beam ecosystem evolve in parallel.

---

## ❓ Common Student Question

**Q: Why do we need Beam when we already have Spark and Flink?**  
- Students often confuse **Beam** with execution engines.  

**Answer:**  
- **Beam = Programming model** ✅  
- **Spark / Flink / Dataflow = Execution engines** ⚡  

Beam’s goal is **not performance** but **portability**:  
- One code → Multiple engines (batch + streaming).  
- Performance depends on the chosen execution engine (e.g., Spark vs Flink vs Dataflow).  

---

## ⚡ Key Takeaways

- Beam is a **unified programming model**, not an execution engine.  
- It provides **language portability** (write once, run anywhere).  
- Performance depends on the **underlying runner** (Spark, Flink, Dataflow, etc.).  
- Communities of Spark, Flink, Samza, etc. are adapting their systems to support Beam.  
- Beam is the **future-proof abstraction layer** for data pipelines.  

---


# 📘 Apache Beam — Adding Fn API

---

## 🖼️ Extended Architecture with Fn API

Now, one more important component to add in the architecture is the **Fn API** 👇

![image.png](attachment:image.png)

---

## 🔹 What is Fn API?

- Fn API defines a set of **protocols**.
- It allows **SDK Workers** (Java Worker, Python Worker, etc.) to **communicate with Runners** in a consistent way.
- Purpose:
  - Provides a **common channel** for moving information between workers and runners.
  - Ensures smooth execution across different languages and engines.

---

## 🔹 Full Flow of Beam Execution

1. **User → SDK**
   - User interacts with a language-specific SDK (Java, Python, Go, etc.)  
   - Writes Beam code (also called **creating a pipeline**).

2. **SDK → Runner API**
   - The code is converted into a **language-generic format**.

3. **Runner API → Execution Engine**
   - Execution engines (Spark, Flink, Apex, Cloud Dataflow, etc.) take the generic instructions.
   - They are responsible for:
     - Running the pipeline
     - Managing resources, tasks, and worker machines.

4. **Fn API → SDK Workers**
   - SDK Workers resolve **language-specific functions (UDFs)**.
   - Fn API ensures **consistent communication** between runners and SDK workers.

---

## ✅ Recap of Beam Architecture

- **Beam SDKs** → User writes pipelines in their favorite language.  
- **Runner API** → Converts pipelines into language-independent form.  
- **Execution Engines (Runners)** → Actually run the pipelines (Spark, Flink, Dataflow, etc.).  
- **SDK Workers** → Handle language-specific details.  
- **Fn API** → Standard protocol for smooth communication between workers and runners.  

👉 With this, the **Beam architecture is complete**:  
A flexible, unified programming model where one pipeline can run on multiple engines, across multiple languages, with consistent communication.

---

🎉 That’s the **basic architecture of Apache Beam**.  
I hope you understood it — thank you!
