# Hadoop Ecosystem

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware.

## Core Hadoop Components

<img src="./pic/2_HADOOP-ECOSYSTEM.png" width=700>

- Processing Layer: MapReduce / Spark / Hive / Pig
- Resource Management: YARN (ResourceManager, NodeManager, ApplicationMaster)
- Storage Layer: HDFS (NameNode, DataNode)
- Data Ingestion: Kafka, Flume, Sqoop



## HDFS (Hadoop Distributed File System)

### Why HDFS?

**Problem**: How do you store a 1TB log file?
- Single machine can't store it safely
- Single machine isn't reliable (hardware failures)
- Single machine can't read it fast enough

**Solution**: HDFS **splits, distributes, and replicates data**


### Core Design Principles

```text
Original large File (1TB)
       │
       ▼
┌──────────────────────────────────────────────────────────────┐
│        1. Split into Blocks (default 128 MB each)            │
│  [Block 1] [Block 2] [Block 3] ... [Block 8192]              │
└──────────────────────────────────────────────────────────────┘
       │
       ▼
┌────────────────────────────────────────────────────────────────┐
│     2. Blocks are distribute Across DataNodes (machines)       │
│                                                                │
│  DataNode 1    DataNode 2    DataNode 3    DataNode 4          │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐         │
│  │Block 1  │   │Block 1  │   │Block 2  │   │Block 1  │ ←Replica│
│  │Block 2  │   │Block 3  │   │Block 3  │   │Block 4  │         │
│  │Block 5  │   │Block 4  │   │Block 6  │   │Block 5  │         │
│  └─────────┘   └─────────┘   └─────────┘   └─────────┘         │
│  3. Each block is stored with multiple replicas (default: 3)   │
└────────────────────────────────────────────────────────────────┘
```



#### Three Key Roles:   

**1. NameNode (Master)**

| Responsibility | Description |
|---------------|-------------|
| Metadata Management | Tracks which blocks make up each file |
| Block Location | Knows where each block is stored |
| Namespace Management | Maintains file system directory tree |
| Client Coordination | Directs clients to appropriate DataNodes |


NameNode Metadata Example:
```text
/user/data/sales.csv
├── Block 1 → [DataNode1, DataNode3, DataNode5]
├── Block 2 → [DataNode2, DataNode4, DataNode6]
└── Block 3 → [DataNode1, DataNode2, DataNode4]
```

**2. DataNode (Worker)**

| Responsibility | Description |
|---------------|-------------|
| Block Storage | Stores actual data blocks |
| Block Operations | Handles read/write requests |
| Heartbeat | Reports status to NameNode |
| Replication | Participates in block replication |

**3. Client**

- Applications that read/write data (**Spark, MapReduce, Hive**)
- Communicates with NameNode for metadata
- Communicates with DataNodes for actual data



#### NameNode VS DataNode 

| Aspect | NameNode | DataNode |
|------|---------|----------|
| Role | Master / Controller | Worker / Storage |
| Stores actual data | No | Yes |
| Stores metadata | Yes | No |
| Metadata includes | File names, directory structure, permissions, block locations | Block IDs only |
| Client interaction | Provides block locations to clients | Serves data directly to clients |
| Failure impact | Critical (cluster becomes unavailable) | Non-critical (data replicated elsewhere) |
| Scalability | Vertical | Horizontal |
| Typical count | 1 active (+ standby) | Many (hundreds or thousands) |



#### HDFS Configuration Defaults

| Parameter | Default Value | Description |
|-----------|---------------|-------------|
| Block Size | 128 MB | Size of each data block |
| Replication Factor | 3 | Number of copies of each block |
| Heartbeat Interval | 3 seconds | DataNode health check frequency |



## YARN (Yet Another Resource Negotiator)

### What Does YARN Do?

YARN is the **resource management** layer of Hadoop. It decides **who** runs what, 
**where** it runs, and with **how many** resources:    
- Tracks available machines in the cluster
- Manages CPU and memory resources
- Schedules jobs
- Monitors execution and retries failed tasks



### YARN Components

```text
┌─────────────────────────────────────────────────────────────────────┐
│                          YARN Architecture                          │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                    ResourceManager (RM)                        │ │
│  │                    [Global Authority]                          │ │
│  │    • Manages cluster-wide resources                            │ │
│  │    • Accepts job submissions                                   │ │
│  │    • Schedules applications                                    │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                              │                                      │
│              ┌───────────────┼───────────────┐                      │
│              ▼               ▼               ▼                      │
│  ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐     │
│  │  NodeManager 1   │ │  NodeManager 2   │ │  NodeManager 3   │     │
│  │  [Worker Node]   │ │  [Worker Node]   │ │  [Worker Node]   │     │
│  │                  │ │                  │ │                  │     │
│  │  ┌────────────┐  │ │  ┌────────────┐  │ │  ┌────────────┐  │     │
│  │  │ Container  │  │ │  │ Container  │  │ │  │ Container  │  │     │
│  │  │  (App      │  │ │  │  (Task)    │  │ │  │  (Task)    │  │     │
│  │  │  Master)   │  │ │  │            │  │ │  │            │  │     │
│  │  └────────────┘  │ │  └────────────┘  │ │  └────────────┘  │     │
│  └──────────────────┘ └──────────────────┘ └──────────────────┘     │
└─────────────────────────────────────────────────────────────────────┘
```



#### Component Responsibilities

| Component | Scope | Responsibilities |
|-----------|-------|------------------|
| **ResourceManager (RM)** | Cluster-wide | Global resource allocation, job scheduling |
| **NodeManager (NM)** | Per-machine | Reports local resources, manages containers |
| **ApplicationMaster (AM)** | Per-application | Requests resources, coordinates tasks |
| Container | Per-task | Isolated execution environment |



### YARN Job Execution Flow

<img src="./pic/2_Job-execution-process-in-YARN.png" width=700>



1. <font color=brown>Client</font> **submits application** to <font color=orange>ResourceManager</font>,   

   <font color=orange>ResourceManager</font> **allocates container** for <font color=blue>ApplicationMaster</font>

2. <font color=blue>ApplicationMaster</font> starts and **registers** with <font color=orange>ResourceManager</font>

3. <font color=blue>ApplicationMaster</font> **requests containers** for tasks

4. <font color=orange>ResourceManager</font> **allocates containers** on <font color=green>NodeManagers</font>

5. <font color=blue>ApplicationMaster</font> **launches tasks** in containers,    

   **Tasks execute and report progress** to <font color=blue>ApplicationMaster</font>

6. Application **completes**, <font color=blue>ApplicationMaster</font> **deregisters**




## MapReduce vs Spark
> talked each in previous notes

### MapReduce (First Generation Hadoop compute engine)

**Characteristics**:
- **Disk-based** execution model
- Data written to disk after each stage
- High latency due to I/O overhead
- Reliable and scalable
- **Best For**: **Long-running batch jobs** where latency isn't critical

**Execution Pattern**:
```text
Input → Map → [Disk] → Shuffle → [Disk] → Reduce → Output
                ↑                    ↑
           Disk I/O            Disk I/O
```


**Example**: Word Count in MapReduce
```text
Map Phase:    "hello world" → [("hello", 1), ("world", 1)]
Shuffle:      Group by key
Reduce Phase: [("hello", [1,1,1])] → [("hello", 3)]
```



### Spark (Next Generation Hadoop compute engine)

Apache Spark running on a Hadoop cluster (specifically using YARN as its cluster manager and often HDFS for storage).   

Designed to replace MapReduce for most workloads

**Characteristics**:
- **In-memory** execution
- DAG-based optimization
- Low latency, interactive analytics
- Unified engine (supports batch, SQL, streaming, ML)
- **Best For**: Interactive analytics, iterative algorithms, real-time processing

**Execution Pattern**:
```text
Input → Transform → Transform → Transform → Output
              ↑           ↑           ↑
           Memory     Memory     Memory
           (cached)   (cached)   (cached)
```



### Comparison Table

| Aspect | MapReduce | Spark |
|--------|-----------|-------|
| Execution Model | Disk-based | In-memory |
| Speed | Slower | 10-100x faster |
| Ease of Use | Complex (Java) | Simple (Python/Scala/SQL) |
| Iterative Algorithms | Poor | Excellent |
| Interactive Queries | Not suitable | Excellent |
| Streaming | Not native | Native support |
| Machine Learning | Limited | MLlib integrated |




# Apache Spark

## What is Apache Spark?

Apache Spark is a **unified, distributed data processing engine** designed for large-scale data analytics.   

Confusion with 'Hadoop Spark':   
- Apache Spark is the actual **distributed computing framework**, the open-source project maintained by the Apache Software Foundation. 
- It's a standalone engine for large-scale data processing that can run on **various cluster managers**. 
- It was originally developed to work with Hadoop's ecosystem to replace MapReduce.
- But Spark itself is independent and can run on other platforms like Kubernetes, Mesos, or even standalone mode without any Hadoop components.
- So there's only one Spark (Apache Spark)

### Core Capabilities

```text
┌────────────────────────────────────────────────────────────────────┐
│                        Apache Spark                                │
│                                                                    │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐   │
│  │  Spark SQL  │ │  Spark      │ │   MLlib     │ │  GraphX     │   │
│  │  (SQL &     │ │  Streaming  │ │  (Machine   │ │  (Graph     │   │
│  │  DataFrames)│ │  (Real-time)│ │  Learning)  │ │  Processing)│   │
│  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘   │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                     Spark Core (RDD)                        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │              Cluster Manager (YARN / K8s / Standalone)      │   │
│  └─────────────────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────────────────┘
```
- Large-scale batch data processing
- SQL analytics via Spark SQL
- Structured Streaming for real-time data
- Graph processing
- Machine Learning via MLlib

### Key Characteristics

| Characteristic | Description |
|---------------|-------------|
| Distributed | Processes data across multiple machines |
| Fault-tolerant | Automatically recovers from failures |
| In-memory–first execution mode | Keeps data in RAM for speed |
| Highly scalable across clusters | Handles petabytes of data |
| Multi-language | Scala, Java, Python, R |
|One of the largest open-source data projects||



## Spark Application

<img src="./pic/2_spark_app.png" width=500>

### Component Responsibilities

| Component | Responsibilities |
|-----------|------------------|
| **Driver** | Runs main program, builds DAG, splits into stages/tasks, schedules work, collects results |
| **Cluster Manager** | Allocates resources for executors (YARN, Kubernetes, Standalone) |
| **Executors** (on worker nodes) | Execute tasks in parallel, store cached data in memory, report status to driver |
| **Tasks** | Smallest unit of work, typically 1 task per partition per stage |

### Execution Flow

```text
User Code → SparkContext → DAG → Stages → Tasks → Executors
```

1. User writes transformations and actions
2. SparkContext builds **DAG (Directed Acyclic Graph)**
3. DAG is divided into **stages** (at shuffle boundaries)
4. Stages are divided into **tasks** (one per partition)
5. Tasks are sent to executors for **parallel execution**




## Why Spark is Fast

### 1. RDD-Based Data Abstraction

**RDD (Resilient Distributed Dataset)**:
- **Immutable** distributed collections
- Data split into logical partitions
- Parallel processing across cluster nodes

```python
# RDD partitioning example
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Distributed across 3 partitions
# Partition 1: [1, 2, 3, 4]
# Partition 2: [5, 6, 7]
# Partition 3: [8, 9, 10]

rdd = sc.parallelize(data, 3)  # 3 partitions
```

### 2. In-Memory Computing

- Intermediate results cached in RAM
- Reduced disk I/O
- Optimized for iterative workloads

```text
Traditional Disk-Based Processing:
Step 1 → Write to Disk → Read from Disk → Step 2 → Write to Disk → Step 3
         ~100ms          ~100ms                    ~100ms

Spark In-Memory Processing:
Step 1 → Memory Cache → Step 2 → Memory Cache → Step 3
         ~0.1ms         ~0.1ms
         
Speed Improvement: 100-1000x for iterative workloads
```


### 3. Execution Efficiency

- **Pipeline execution**: Multiple operations combined within stages
- **Lazy evaluation**: Operations only execute when results are needed
- **Minimized data movement**: Intelligent task placement
- **Optimized scheduling**: Tasks run where data resides



## Creating RDDs

### Method 1: Parallelizing Existing Collections

```python
# Convert in-memory Python data to RDD
# Best for: testing, learning, small datasets

data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# With specified partitions
rdd = sc.parallelize(data, numSlices=4)
```

### Method 2: Loading from External Storage

```python
# From local file system
rdd = sc.textFile("file:///path/to/data.txt")

# From HDFS
rdd = sc.textFile("hdfs://namenode:9000/data/file.txt")

# From Amazon S3
rdd = sc.textFile("s3a://bucket-name/data/file.txt")

# Multiple files (wildcard)
rdd = sc.textFile("hdfs:///data/logs/*.log")
```



## ⭐️ Transformations and Actions

### Lazy Evaluation

Key Rule: **No action → No execution**

Transformations build a plan (DAG)   
Actions trigger the plan execution -> stages -> tasks   

Example:
```python
rdd.map(...)      # Just builds plan, nothing executes
   .filter(...)   # Just builds plan, nothing executes
   .count()       # ACTION! Now everything executes
```

### Transformations (Lazy, build a plan)

**Return a new RDD (or DataFrame) without executing immediately**.

```python
# Common transformations
rdd.map(lambda x: x * 2)           # Apply function to each element
rdd.filter(lambda x: x > 5)        # Keep elements matching condition
rdd.flatMap(lambda x: x.split())   # Map + flatten results
rdd.distinct()                      # Remove duplicates
rdd.union(other_rdd)               # Combine two RDDs
```

### Actions (Trigger Execution)

**Start a job and return results to driver or storage**.

```python
# Common actions
rdd.count()                        # Count elements
rdd.collect()                      # Return all elements to driver
rdd.take(n)                        # Return first n elements
rdd.first()                        # Return first element
rdd.reduce(lambda a, b: a + b)     # Aggregate elements
```



## Narrow vs Wide Transformations

### Narrow Transformations

**Definition**: **Each output partitio**n depends on **ONE input partition**

**Characteristics**:
- No shuffle required
- No network data movement
- Can execute within same stage
- Fast and efficient

```text
Narrow Transformation (map):

Partition 1 ──map──▶ Partition 1'
Partition 2 ──map──▶ Partition 2'
Partition 3 ──map──▶ Partition 3'

Each output depends only on its corresponding input
```

**Examples**: `map`, `filter`, `select`, `withColumn`, `union`

### Wide Transformations

**Definition**: Output partitions depend on **MULTIPLE input partitions**

**Characteristics**:
- Requires shuffle across network
- Creates stage boundary
- More expensive (I/O and network)

```text
Wide Transformation (groupBy):

Partition 1 ─┐
             ├──shuffle──▶ Partition 1' (all keys "A")
Partition 2 ─┤
             ├──shuffle──▶ Partition 2' (all keys "B")
Partition 3 ─┘
             └──shuffle──▶ Partition 3' (all keys "C")

Data must move between partitions based on keys
```

**Examples**: `groupBy`, `reduceByKey`, `join`, `distinct`, `repartition`



## Spark Challenges

### System Dependencies

| Challenge | Description |
|-----------|-------------|
| No built-in storage | Relies on external storage (HDFS, S3) |
| Cluster manager required | Needs YARN, Kubernetes, or Standalone for scale |
| External coordination | Depends on ZooKeeper for some features |

### Security

| Challenge | Description |
|-----------|-------------|
| Limited native security | Basic authentication and authorization |
| Platform-dependent | Security typically from: Kerberos, IAM, Ranger |
| Network security | Requires proper firewall configuration |

### Performance & Cost

| Challenge | Description |
|-----------|-------------|
| Performance tuning is hard | Complex: partitions, shuffle, skew, memory |
| Capacity management | Difficult to right-size clusters |
| Cost optimization | Challenging in cloud environments |
| Data skew | Uneven data distribution causes bottlenecks |



## Complete Spark Example

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, count

# Initialize Spark
spark = SparkSession.builder \
    .appName("Sales Analysis") \
    .getOrCreate()

# Read data (Transformation - lazy)
df = spark.read.parquet("s3://data/sales/")

# Apply transformations (all lazy)
result = (
    df
    .filter(col("year") == 2024)              # Narrow
    .filter(col("amount") > 0)                 # Narrow
    .groupBy("region", "product_category")     # Wide (shuffle)
    .agg(
        sum("amount").alias("total_sales"),
        avg("amount").alias("avg_sale"),
        count("*").alias("num_transactions")
    )
    .orderBy(col("total_sales").desc())        # Wide (shuffle)
)

# Action - triggers execution
result.show(10)

# Another action - write results
result.write.mode("overwrite").parquet("s3://output/sales_summary/")

# Stop Spark
spark.stop()
```

---

