<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>


#### üê¢ Optimizing a Slow Daily Spark Batch Job

This notebook demonstrates how to **diagnose and optimize a slow-running daily Spark batch job**
using a combination of **Spark UI analysis** and **code-level optimizations**.

The job aggregates **daily sales by region** and has gradually slowed down
as data volume increased.


## üìÇ Dataset

**Dataset Name:** `sales_orders_large.csv`  
**Description:** 30 days of historical sales data

**Example Columns:**
- `order_id`
- `order_date`
- `region`
- `customer_id`
- `category`
- `quantity`
- `amount`

The dataset is assumed to be available in **your catalog / database storage**.

> ‚ö†Ô∏è In real production systems, this dataset typically grows every day,
which can cause batch jobs to slow down if not designed correctly.


## üóÇÔ∏è Scenario

You own a **daily Spark batch job** that aggregates total sales by region.

Initially:
- Runtime ‚âà **15 minutes**

Over time:
- Runtime slowly increased to **2 hours**

No single change caused the slowdown ‚Äî instead, it happened gradually as:
- Data volume increased
- More historical data accumulated
- Inefficient read patterns became more expensive
- Shuffle-heavy operations started processing much more data

This is a **very common real-world problem** in batch data pipelines.

Your goal is to:
- **Diagnose** the performance bottlenecks using Spark UI
- **Optimize** the job so it continues to scale as data grows

---

## üéØ Task

Perform the following steps:

1. Use **Spark UI** to identify where the job is slow.
2. Determine whether the job is:
   - reading too much data
   - performing large or unnecessary shuffles
   - affected by data skew
3. Apply **code-level optimizations** to:
   - reduce the amount of data scanned
   - reduce shuffle cost during aggregation
   - use efficient storage formats
4. Validate that the optimized job processes **only the required day‚Äôs data**.

---

## üß© Assumptions

- The job runs **once per day** for a single `process_date`.
- Historical sales data is stored in **cloud object storage**.
- Data volume **grows continuously** as new days are added.
- Spark Serverless or classic Spark clusters may be used.
- No changes were made to business logic ‚Äî only data volume increased.

---

## üì¶ Deliverables

- An **optimized daily aggregation job** that scales with growing data.
- Aggregated output containing **total sales by region for a single day**.
- A clear demonstration of **performance improvement** compared to the naive approach.

### **Expected Output Columns**

| order_date | region | total_sales_amount |
|-----------|--------|--------------------|

---

## üß† Notes 

- Performance problems usually appear **gradually** as data grows, not suddenly.
- Always start optimization by **observing Spark UI**, not changing code blindly.
- The biggest performance wins usually come from **reading less data**, not adding more compute.
- Using **Parquet or Delta** enables:
  - column pruning
  - predicate pushdown
- Partitioning data by frequently filtered columns (such as `order_date`) enables **partition pruning**.
- Code that works well for small datasets may **fail to scale** without these optimizations.

> üí° Tip:  
> If a batch job becomes slower over time, assume a **data growth and data layout problem first**, not a Spark bug.



## üß† Solution Strategy (High-Level)

1. Use **Spark UI** to identify slow stages and expensive operations.
2. Confirm whether the job is scanning **more data than necessary**.
3. Convert raw CSV data into an **efficient columnar format** (Parquet or Delta).
4. Ensure the dataset is **partitioned by order_date**.
5. Modify the job to:
   - read only the required date partition
   - select only necessary columns
6. Reduce shuffle overhead during aggregation.
7. Write optimized output for downstream consumption.

Spark handles:
- Distributed execution
- Partition pruning
- Column pruning
- Shuffle-based aggregations


In [0]:
from pyspark.sql import functions as F


## üß± One-Time Setup: Raw ‚Üí Parquet (Recommended)

This conversion is done **once**, not daily.


In [0]:
raw_df = (
    spark.read
         .option("header", "true")
         .option("inferSchema", "true")
         .csv("your_data")
)

(
    raw_df.write
          .mode("overwrite")
          .partitionBy("order_date")
          .parquet("your_directory")
)


## üõ¢Ô∏è Input Data


In [0]:
display(raw_df.limit(5))     


## ‚ùå Naive (Slow) Daily Job

This version represents how many batch jobs are initially written.


In [0]:
process_date = "2025-01-20"

# ‚ùå BAD: reads all data, all columns, CSV format
sales_df = (
    spark.read
         .option("header", "true")
         .option("inferSchema", "true")
         .csv("your_data")
)

agg_df = (
    sales_df
        .groupBy("region")
        .agg(F.sum("amount").alias("total_sales_amount"))
)

agg_df.show()


### ‚ùå Problems with the Naive Version

- Reads **all historical data**, not just one day
- Uses **CSV** (no predicate pushdown, no column pruning)
- Triggers a **large shuffle** during aggregation
- Job runtime increases as data grows


## üîç Diagnosing the Problem Using Spark UI

When analyzing this job in Spark UI, look for:

### Stages Tab
- One or more stages taking significantly longer
- Aggregation (`groupBy`) stages causing wide shuffles

### Tasks Tab
- Tasks with very large input sizes
- Long-running tasks compared to others (possible skew)

### SQL / Query Tab
- Full table scan instead of partition pruning
- Large shuffle read/write sizes

This confirms the job is **doing more work than necessary**.


## ‚úÖ Optimized Daily Job


In [0]:
process_date = "2025-01-20"

# ‚úÖ Read only required partition and columns from Parquet
sales_df = (
    spark.read
         .parquet("your_directory")
         .where(F.col("order_date") == process_date)
         .select("order_date", "region", "amount")
)

agg_df = (
    sales_df
        .groupBy("order_date", "region")
        .agg(F.sum("amount").alias("total_sales_amount"))
)

display(agg_df)


## üíæ Writing Optimized Output


In [0]:
(
    agg_df.write
          .mode("overwrite")
          .partitionBy("order_date")
          .format("delta")
          .save("your_directory")
)


## üß† Why This Is Faster

- **Partition pruning**: only one day‚Äôs data is read
- **Column pruning**: only required columns are scanned
- **Parquet/Delta**: efficient columnar reads
- **Reduced shuffle size**: less data moved across the network
- Job runtime remains stable even as data grows


## ‚öôÔ∏è Additional Optimizations (Conceptual)

Depending on the workload, you may also:
- Tune shuffle partitions (on classic clusters)
- Broadcast small dimension tables (if joins exist)
- Handle skewed keys if one region dominates
- Remove repeated expensive transformations

‚ö†Ô∏è On **Databricks Serverless**, execution configs are platform-managed,
so focus on **data layout and query design**.


## ‚úÖ Summary

- Slow batch jobs usually degrade due to **data growth**, not bugs.
- Spark UI is the first tool to identify performance bottlenecks.
- Reading less data is the **biggest optimization**.
- Partitioning and columnar formats are critical for scalable batch jobs.
- This pattern is widely used in **production data platforms**.

This notebook demonstrates a **real-world, industry-standard approach**
to diagnosing and optimizing slow Spark batch workloads.
