<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>


#### ‚öñÔ∏è Dynamic Resource Allocation in a Shared Cluster

This notebook demonstrates how to configure **Spark Dynamic Resource Allocation**
so a job can **scale up when busy** and **release resources when idle**
in a **shared YARN / Kubernetes cluster**.

We use a **shuffle-heavy aggregation** on `big_events_50k.csv`
to illustrate how Spark dynamically adjusts executors.


## üìÇ Dataset

**Dataset Name:** `big_events_50k.csv`  

### Example Columns:
- `event_id`
- `event_time`
- `country`
- `device`
- `amount`

The dataset is large enough to:
- create **shuffle-heavy stages**
- demonstrate executor scaling behavior

> ‚ö†Ô∏è In real production systems, this dataset could be **hundreds of GBs**.
> The same dynamic allocation pattern applies regardless of size.


## üóÇÔ∏è Scenario

Your Spark job runs in a **shared cluster** (YARN or Kubernetes)
used by multiple teams.

Observed issues:
- Sometimes your job **consumes too many executors**
- Sometimes it **runs slowly due to lack of resources**
- Static executor allocation causes:
  - wasted resources when the job is idle
  - unfair usage during peak times

You want Spark to:
- automatically scale up during heavy processing
- release executors when work finishes
- avoid impacting other teams‚Äô jobs

---

## üéØ Task

Using `big_events_50k.csv`, design a Spark job that:

1. Enables **dynamic executor allocation**
2. Sets sensible **min / initial / max executors**
3. Safely handles shuffle data when executors are removed
4. Releases idle executors automatically
5. Works well in a **multi-tenant cluster**

---

## üß© Assumptions

- Spark runs on **YARN or Kubernetes**
- Cluster is shared by multiple teams
- The job contains **shuffle-heavy operations**
- Spark configs are set via:
  - `spark-submit`
  - cluster defaults
  - or SparkSession builder

> ‚ö†Ô∏è Databricks Serverless manages resources automatically
> and does **not allow manual dynamic allocation configuration**.


---

## üì¶ Deliverables

- A Spark job using **dynamic resource allocation**
- Executors scale up during shuffle-heavy stages
- Executors are released when idle
- Job avoids starving other teams

### Expected Behavior

| Workload State | Executor Behavior |
|---------------|-------------------|
High shuffle load | Executors scale up |
Idle periods | Executors released |
Shared cluster | Fair resource usage |

---

## üß† Notes

- Dynamic allocation is most useful in **shared clusters**
- Spark decides executor count **at runtime**
- Shuffle handling is critical for safe executor removal
- This is a **configuration-level optimization**





## üß† Solution Strategy (High-Level)

1. Enable Spark dynamic allocation
2. Configure min, initial, and max executors
3. Enable safe shuffle handling:
   - external shuffle service (YARN)
   - shuffle tracking (Kubernetes)
4. Tune idle and backlog timeouts
5. Run a shuffle-heavy aggregation on `big_events_50k.csv`

Spark automatically:
- adds executors when tasks back up
- removes executors after idle timeout


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F


## ‚öôÔ∏è Spark Configuration (Conceptual ‚Äì Shared Cluster)

‚ö†Ô∏è These settings apply to **YARN / Kubernetes clusters**.  
They are typically set via `spark-submit` or cluster config.



In [0]:
spark = (
    SparkSession.builder
        .appName("dynamic-allocation-big-events")

        # Enable dynamic allocation
        .config("spark.dynamicAllocation.enabled", "true")

        # Executor bounds
        .config("spark.dynamicAllocation.minExecutors", "2")
        .config("spark.dynamicAllocation.initialExecutors", "4")
        .config("spark.dynamicAllocation.maxExecutors", "20")

        # Backlog & idle handling
        .config("spark.dynamicAllocation.schedulerBacklogTimeout", "5s")
        .config("spark.dynamicAllocation.executorIdleTimeout", "60s")

        # Kubernetes (use this)
        .config("spark.dynamicAllocation.shuffleTracking.enabled", "true")

        # YARN alternative (use instead of shuffleTracking)
        # .config("spark.shuffle.service.enabled", "true")

        .getOrCreate()
)


## üõ¢Ô∏è Input Data


In [0]:
events_df = (
    spark.read
         .option("header", "true")
         .option("inferSchema", "true")
         .csv("your_data")
)

display(events_df.limit(5))


## üîÑ Shuffle-Heavy Aggregation

This aggregation forces a **wide shuffle**, which allows us to
observe dynamic executor scaling.


In [0]:
agg_df = (
    events_df
        .groupBy("country", "device")
        .agg(
            F.count("*").alias("event_count"),
            F.sum("amount").alias("total_amount")
        )
)

# Action to trigger execution
agg_df.count()


## üîç What Happens at Runtime

- Spark starts with **initialExecutors**
- If shuffle tasks queue up:
  - Spark requests more executors (up to maxExecutors)
- After tasks complete:
  - Idle executors are released after executorIdleTimeout
- Shuffle tracking / external shuffle service ensures safety


## üß† Shared Cluster Best Practices

- Set a **reasonable maxExecutors**
- Use **fair scheduler / queues**
- Avoid large static executor counts
- Let Spark adapt to workload size

Dynamic allocation benefits everyone in the cluster.


## ‚úÖ Summary

- `big_events_50k.csv` creates shuffle-heavy workloads
- Dynamic allocation enables elastic executor scaling
- Idle executors are released automatically
- Best suited for shared YARN / Kubernetes clusters

This is a **production-grade pattern** for multi-tenant Spark environments.
