<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>


#### üîÅ Reusing Computation Efficiently in Spark

This notebook demonstrates how Apache Spark can **avoid recomputing the same expensive transformations**
when the **same cleaned DataFrame is used multiple times** within a single job.

We focus on **performance optimization** using Spark‚Äôs **cache / persist** mechanism,
which is critical when working with **large datasets**.


## üìÇ Dataset

**Primary Dataset:** `sales_orders_raw_with_issues.csv`  
**Optional Large Dataset:** `sales_orders_large.csv`

> ‚ö†Ô∏è In real-world scenarios, sales datasets can be **very large (GBs or more)**.  
To keep this exercise practical, we assume the dataset already exists in  
**your catalog / database storage**.

### Example Columns:
- `order_id`
- `order_date`
- `customer_id`
- `region`
- `category`
- `amount`


## üóÇÔ∏è Scenario

You are working with a **sales orders dataset** that requires multiple cleaning steps
before it can be used for reporting.

After cleaning, the same cleaned DataFrame is used to generate **5 different reports**
inside the **same Spark job**.

Without optimization, Spark will **re-run the same cleaning transformations**
every time a report is computed, leading to:
- unnecessary recomputation
- increased job runtime
- wasted cluster resources

Your goal is to **optimize the job** so that the cleaning logic runs **only once**.

The input data already exists in **your catalog / database storage** and needs to be
cleaned, reused, and analyzed efficiently.

---

## üéØ Task

Perform the following steps using Spark:

1. **Read** the raw sales orders dataset.
2. Apply all **cleaning and standard transformations** to create a cleaned DataFrame.
3. **Cache or persist** the cleaned DataFrame so Spark materializes it once.
4. Reuse the cleaned DataFrame to generate **multiple reports**.
5. **Unpersist** the DataFrame after all reports are completed to free resources.

---

## üß© Assumptions

- The raw dataset contains data quality issues (nulls, duplicates, invalid values).
- The cleaned DataFrame is reused multiple times within the same Spark job.
- Spark uses **lazy evaluation**, so transformations are not executed until an action occurs.
- The dataset is large enough that recomputation would be expensive.

---

## üì¶ Deliverables

- **Cleaned DataFrame:** reused across multiple reports  
- **Reports:** Aggregations derived from the same cached DataFrame  

### **Example Reports**
- Total sales by region
- Total sales by category
- Daily sales trends
- (Additional reports can be added without re-running cleaning logic)

---

## üß† Notes

- Spark does **not** automatically remember intermediate results.
- Without caching, Spark **recomputes transformations for every action**.
- Caching is useful when:
  - a DataFrame is expensive to compute
  - the same DataFrame is reused multiple times
- Always unpersist cached data when it is no longer needed.


## üß† Solution Strategy (High-Level)

1. **Read the raw sales orders dataset** from your catalog / database storage using Spark.
2. Apply all **data cleaning and standard transformations** once to create a cleaned DataFrame (`clean_df`).
3. **Cache or persist** the cleaned DataFrame so Spark stores the computed result after the first action.
4. Trigger an **action** (such as `count()` or the first report) to materialize the cached DataFrame.
5. Reuse the cached `clean_df` across **multiple reports** (aggregations, groupings, joins).
6. After all reports are generated, **unpersist** the DataFrame to free up cluster memory.

Spark handles:
- Avoiding repeated recomputation of expensive transformations
- Efficient reuse of intermediate results
- Memory management for cached data
- Parallel execution across executors
