<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>


#### üìÖ Daily Sales Aggregation (Batch)

This notebook shows how to design a **daily Spark batch job**
that processes **one day‚Äôs sales CSV file**, aggregates total sales
by region, and writes the results to a curated table used by BI dashboards.


## üìÇ Dataset

Daily sales files arrive as **individual CSVs**:

- `sales_orders_2025-01-01.csv`
- `sales_orders_2025-01-02.csv`
- `sales_orders_2025-01-03.csv`
- `sales_orders_2025-01-04.csv`
- `sales_orders_2025-01-05.csv`

**Location (example ‚Äì Databricks Volume)**


### Schema
- `order_id`
- `order_date`
- `region`
- `customer_id`
- `amount`



## üóÇÔ∏è Scenario

Your company receives **one sales CSV per day** in cloud storage
(ADLS / Blob Storage or S3).

Business requirements:
- Process **only the new day‚Äôs file**
- Compute **total sales amount per region**
- Store results in a **curated table** for BI dashboards
- Ensure the job is **safe to re-run** for the same day (idempotent)

---


## üéØ Task

Build a Spark batch job that:

1. Accepts a `process_date` parameter
2. Reads only the corresponding daily CSV file
3. Aggregates total sales per region
4. Writes results to a curated table
5. Overwrites that day‚Äôs data if the job is re-run

---


## üß© Assumptions

- Storage (ADLS / S3 / Volume) is already accessible in Spark
- One CSV file exists per day
- Job is scheduled daily via Airflow / ADF / cron
- BI dashboards read from the curated table

---

## üì¶ Deliverables

- Daily sales totals per region
- Output stored in Delta / Parquet
- Data partitioned by `order_date`

### Expected Output Schema

| order_date | region | total_sales_amount |
|------------|--------|--------------------|

---

## üß† Notes

- Always **parameterize dates** in batch jobs
- Never scan all files when only one day is required
- File naming conventions (`YYYY-MM-DD`) are enough for daily jobs
- Design jobs to be **idempotent** so re-runs don‚Äôt duplicate data




## üß† Solution Strategy (High-Level)

1. Receive `process_date` from scheduler
2. Build input file path using the date
3. Read only that file
4. Aggregate sales by region
5. Write output partitioned by `order_date`
6. Overwrite the day‚Äôs partition for idempotency


In [0]:
from pyspark.sql import functions as F


## ‚öôÔ∏è Job Parameters


In [0]:
# In production, this comes from Airflow / ADF / scheduler
process_date = "2025-01-02"

# Build input path from the date
input_path = f"your_data"

# Curated output location
output_path = "your_directory"


## üõ¢Ô∏è Read Only That Day‚Äôs File


In [0]:
sales_df = (
    spark.read
         .option("header", "true")
         .option("inferSchema", "true")
         .csv(input_path)
)

display(sales_df.limit(5))


## üîÑ Aggregation Logic

Business Question:  
**What is the total sales amount per region for the given day?**


In [0]:
agg_df = (
    sales_df
        .groupBy("order_date", "region")
        .agg(F.sum("amount").alias("total_sales_amount"))
)

display(agg_df)


## üíæ Write to Curated Layer (Idempotent)

We overwrite **only the current day‚Äôs data**
so the job can be safely re-run.


In [0]:
(
    agg_df
        .write
        .mode("overwrite")
        .partitionBy("order_date")
        .format("delta")          # parquet also acceptable
        .save(output_path)
)


## üîÅ What Happens on Re-run?

- Same `process_date`
- Same input file
- Same output partition
- Old data is replaced

‚úÖ No duplicates  
‚úÖ Safe reprocessing


## ‚úÖ Summary

- Daily batch jobs should be **date-driven**
- Read less data, not more
- Partition outputs for BI performance
- Always design for safe re-runs

This is a **production-grade Spark batch pattern**
used across real data platforms.
