# Evolution of Distributed Data Processing Frameworks

## 1. MapReduce (2004 - Google)
- Introduced by Google in 2004 (research paper).
- Changed the way distributed data is processed.
- Works in **two steps**:
  1. **Map Phase** → transforms input data into key-value pairs.
  2. **Reduce Phase** → aggregates results by keys.
- **Limitations**:
  - Only suitable for problems that fit Map + Reduce.
  - For real-world problems, multiple chained MapReduce jobs are required.
  - Too many **disk reads/writes**, making it slow.

---

## 2. Hadoop (Apache, Open Source)
- Open-source implementation of Google’s MapReduce paper.
- Enabled **low-cost distributed processing** using commodity hardware.
- Benefits:
  - Scalability
  - Simplicity
- Limitations:
  - Still dependent on MapReduce as the core.
  - Chains of jobs → more **HDFS I/O**, slower performance.

---

## 3. Flume (Optimization of MapReduce)
- Provided a **high-level API** for building data pipelines.
- Created a **graph of operations** instead of just Map + Reduce.
- Optimized chain of MapReduce jobs → fewer HDFS writes.
- Served as the **base programming model** for Spark and Flink.

---

## 4. Apache Spark
- Promised to be **100x faster than Hadoop**.
- Supported **both Batch and Stream processing**.
- Core Concepts:
  - **Lazy Evaluation** → builds a DAG (Directed Acyclic Graph) of operations.
  - Executes DAG efficiently.
- Spark Streaming:
  - Fast but lacked support for **late/out-of-order events**.

---

## 5. Apache Flink
- Similar to Spark but with:
  - Better **speed** and **performance**.
  - Automatic optimizations.
- **True Stream Processing Engine**:
  - Treats streams as **first-class citizens**.
  - Handles **out-of-order events** using:
    - Watermarking
    - State management
    - Triggers
- Comparable to Google’s **Millwheel**.

---

## 6. Apache Ecosystem Expansion
- Alongside Hadoop/Spark/Flink:
  - **Hive** → SQL-like queries
  - **HBase** → Columnar database
  - **Tez** → Optimized execution engine
  - **Oozie** → Workflow scheduler
- Built a complete **open-source Big Data ecosystem**.

---


![image.png](attachment:image.png)


---
## 7. Google’s Evolution
- Google internally built:
  - BigTable, Spanner, Megastore, Dremel, Colossus, Pub/Sub, Millwheel.
- **2014**: Released **Cloud Dataflow**:
  - Fully managed service for batch + stream pipelines.
  - Unified model for processing.
  - SDK + Execution engine.
- **2016**: Donated Dataflow SDK → became **Apache Beam**.

---

## 8. Apache Beam (Bridging Both Worlds)
- Based on Google Dataflow programming model.
- Provides a **unified API**:
  - Write pipelines in **any supported language**.
  - Run pipelines on **any execution engine** (Spark, Flink, Dataflow, etc.).
- Key Benefits:
  - **Language-agnostic** → choose Python, Java, Go, etc.
  - **Runner-agnostic** → run on Spark, Flink, Hadoop, or Cloud Dataflow.
  - Decouples **SDKs (programming model)** and **Runners (execution engines)**.

---

# Summary
- **MapReduce (Google, 2004)** → Introduced distributed data processing.
- **Hadoop (Apache)** → Open-source revolution, scalable but slow.
- **Flume** → Pipeline abstraction, fewer disk writes.
- **Spark** → Fast, DAG-based, batch + streaming.
- **Flink** → True streaming, handles out-of-order data.
- **Google Dataflow** → Unified batch + stream, fully managed.
- **Apache Beam** → One programming model, run anywhere.
