<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>


#### ‚öñÔ∏è Handling Skewed Keys in Spark Joins

This notebook demonstrates how **data skew** can severely impact Spark join performance
and how to **detect and handle skewed keys** using proven Spark techniques.

We focus on a common real-world issue where **a few keys dominate the data distribution**
and cause one or more tasks to run significantly slower than others.


## üìÇ Dataset

### Dataset A (Large & Skewed)
**Dataset Name:** `transactions_a_skewed_large.csv`

### Dataset B (Large & Skewed)
**Dataset Name:** `transactions_b_skewed_large.csv`

> ‚ö†Ô∏è These datasets simulate a real-world scenario where  
a small number of `customer_id`s have **millions of records**,  
while most customers have only a few.

Both datasets are assumed to be available in **your catalog / database storage**.

### Example Columns:
- `transaction_id`
- `customer_id`
- `transaction_date`
- `amount`


## üóÇÔ∏è Scenario

You are joining **two large transactional DataFrames** on `customer_id`.

During execution, you notice that:
- One or two tasks take **much longer** than others
- Most tasks finish quickly, but a few keep running
- Overall job time is dominated by a **single slow task**

This usually indicates **data skew**, where a small number of keys
(e.g., certain customers) have a **disproportionately large number of records**.

Your goal is to:
- **Detect** the skew
- **Understand** why it happens
- **Fix** the skew so the join runs efficiently

---

## üéØ Task

Perform the following steps using Spark:

1. **Read** both skewed transaction datasets.
2. **Detect skewed keys** by analyzing record counts per `customer_id`.
3. Confirm skew symptoms using **Spark UI**.
4. **Handle skew** using key salting.
5. (Optional) Enable **Adaptive Query Execution (AQE)** for automatic skew handling.
6. Perform the join efficiently.

---

## üß© Assumptions

- Both datasets are large and distributed across the cluster.
- A small number of `customer_id`s are extremely frequent (hot keys).
- Spark join performance is impacted by uneven partition sizes.
- Spark Serverless or classic clusters may be used.

---

## üì¶ Deliverables

- **Joined DataFrame** with balanced execution
- Reduced task skew and improved join performance

### **Join Key**
- `customer_id`

---

## üß† Notes 

- Spark distributes work **by key** during joins.
- If one key has far more records, **one task gets overloaded**.
- Skew causes:
  - slow tasks
  - poor CPU utilization
  - long job runtimes
- Detecting skew early is critical for scalable pipelines.


## üß† Solution Strategy (High-Level)

1. Read both large transaction datasets into Spark DataFrames.
2. Detect skew by grouping on `customer_id` and identifying unusually high counts.
3. Validate skew by observing slow or heavy tasks in the Spark UI.
4. Identify **hot keys** (customers with extremely high record counts).
5. Apply **salting** to the hot keys to spread their records across multiple partitions.
6. Join the salted DataFrames on both `customer_id` and `salt`.
7. Optionally enable **Adaptive Query Execution (AQE)** to let Spark handle skew automatically.

Spark handles:
- Distributed join execution
- Task scheduling and partitioning
- Optimizations via AQE when enabled
