# 📖 Basic Terminologies in Apache Beam

In order to understand Beam deeply, we must first learn a few **key terms** that will appear repeatedly throughout this course.  

---

## 1️⃣ Pipeline

👉 **What is a Pipeline?**

- A **Pipeline** encapsulates your entire data processing task (from start to finish).  
- This includes:
  1. Reading input data  
  2. Applying transformations  
  3. Writing the output data  

📌 **Key Points**  
- All Beam driver programs must create a **Pipeline**.  
- When we create a Pipeline, we also specify **execution options**:  
  - These tell the pipeline **where and how to run**.  
  - Examples:  
    - **Direct Runner** → Local machine  
    - **Dataflow Runner** → Google Cloud Dataflow  
    - **Flink Runner** → Apache Flink  
    - **Spark Runner** → Apache Spark  

💡 Think of a **Pipeline** as the **whole Spark code** (input → transformations → output).

---

## 2️⃣ PCollection

👉 **What is a PCollection?**  

- A **PCollection** is the **data structure** used by Beam.  
- It is equivalent to **RDDs/DataFrames in Spark** or **DataSet/DataStream in Flink**.  
- A PCollection represents a **distributed dataset** your pipeline operates on.

📌 **Bounded vs Unbounded PCollections**  
- **Bounded** → Comes from a finite source (e.g., a file).  
- **Unbounded** → Comes from a continuous source (e.g., Kafka, Pub/Sub, socket).  

💡 Unlike Spark/Flink which use separate abstractions, Beam uses **one unified abstraction (PCollection)** for both bounded and unbounded data.  

---

### 🔑 Characteristics of a PCollection

1. **Immutability**  
   - Once created, you **cannot modify elements**.  
   - Applying a transform creates a **new PCollection**.  
   - Example:  
     - `P1` → Read file  
     - `P2` → Apply Filter on `P1`  
     - `P3` → Apply GroupBy on `P2`  

2. **Element Type**  
   - All elements in a PCollection must be of the **same type**.  
   - Beam SDK provides encoders for common types + supports **custom encodings**.

3. **No Fine-grained Operations**  
   - We cannot transform **specific elements only**.  
   - Transformations apply to the **entire PCollection**.

4. **Element Timestamps**  
   - Every element has a timestamp.  
   - For **unbounded sources** → timestamp is assigned when element arrives.  
   - For **bounded sources** → all elements usually share the same timestamp.  
   - Developers can **manually assign timestamps**.  

⏰ **Why timestamps matter?**  
They are essential for **Windowing**, **Watermarks**, and handling **late data**.

---

## 3️⃣ PTransform

👉 **What is a PTransform?**

- A **PTransform** is a **step in your pipeline**.  
- Equivalent to **map, filter, reduce, aggregations** in Spark/Flink.  

📌 **Definition**:  
- A PTransform takes **one or more PCollections** as input.  
- Performs an operation (processing function).  
- Produces **zero or more output PCollections**.  

💡 Developers spend most of their time **writing and chaining PTransforms** to meet business requirements.  

---

## 📝 Quick Recap

- **Pipeline** → Whole workflow (Input → Transforms → Output).  
- **PCollection** → Data container (bounded/unbounded, immutable, typed, timestamped).  
- **PTransform** → Operations/steps applied to PCollections.  

---

📌 In the next sections, we will explore these concepts with **hands-on coding** and real-world examples.
