<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>


#### ‚öñÔ∏è Handling Skewed Keys in Spark Joins

This notebook demonstrates how **data skew** can severely impact Spark join performance
and how to **detect and handle skewed keys** using proven Spark techniques.

We focus on a common real-world issue where **a few keys dominate the data distribution**
and cause one or more tasks to run significantly slower than others.


## üìÇ Dataset

### Dataset A (Large & Skewed)
**Dataset Name:** `transactions_a_skewed_large.csv`

### Dataset B (Large & Skewed)
**Dataset Name:** `transactions_b_skewed_large.csv`

> ‚ö†Ô∏è These datasets simulate a real-world scenario where  
a small number of `customer_id`s have **millions of records**,  
while most customers have only a few.

Both datasets are assumed to be available in **your catalog / database storage**.

### Example Columns:
- `transaction_id`
- `customer_id`
- `transaction_date`
- `amount`


## üóÇÔ∏è Scenario

You are joining **two large transactional DataFrames** on `customer_id`.

During execution, you notice that:
- One or two tasks take **much longer** than others
- Most tasks finish quickly, but a few keep running
- Overall job time is dominated by a **single slow task**

This usually indicates **data skew**, where a small number of keys
(e.g., certain customers) have a **disproportionately large number of records**.

Your goal is to:
- **Detect** the skew
- **Understand** why it happens
- **Fix** the skew so the join runs efficiently

---

## üéØ Task

Perform the following steps using Spark:

1. **Read** both skewed transaction datasets.
2. **Detect skewed keys** by analyzing record counts per `customer_id`.
3. Confirm skew symptoms using **Spark UI**.
4. **Handle skew** using key salting.
5. (Optional) Enable **Adaptive Query Execution (AQE)** for automatic skew handling.
6. Perform the join efficiently.

---

## üß© Assumptions

- Both datasets are large and distributed across the cluster.
- A small number of `customer_id`s are extremely frequent (hot keys).
- Spark join performance is impacted by uneven partition sizes.
- Spark Serverless or classic clusters may be used.

---

## üì¶ Deliverables

- **Joined DataFrame** with balanced execution
- Reduced task skew and improved join performance

### **Join Key**
- `customer_id`

---

## üß† Notes 

- Spark distributes work **by key** during joins.
- If one key has far more records, **one task gets overloaded**.
- Skew causes:
  - slow tasks
  - poor CPU utilization
  - long job runtimes
- Detecting skew early is critical for scalable pipelines.


## üß† Solution Strategy (High-Level)

1. Read both large transaction datasets into Spark DataFrames.
2. Detect skew by grouping on `customer_id` and identifying unusually high counts.
3. Validate skew by observing slow or heavy tasks in the Spark UI.
4. Identify **hot keys** (customers with extremely high record counts).
5. Apply **salting** to the hot keys to spread their records across multiple partitions.
6. Join the salted DataFrames on both `customer_id` and `salt`.
7. Optionally enable **Adaptive Query Execution (AQE)** to let Spark handle skew automatically.

Spark handles:
- Distributed join execution
- Task scheduling and partitioning
- Optimizations via AQE when enabled


In [0]:
from pyspark.sql import functions as F


In [0]:
# Read skewed datasets
df_a = (
    spark.read
         .option("header", "true")
         .option("inferSchema", "true")
         .csv("your_data")
)

df_b = (
    spark.read
         .option("header", "true")
         .option("inferSchema", "true")
         .csv("your_data")
)


In [0]:
df_a.printSchema()
df_b.printSchema()


## üõ¢Ô∏è Input Data


In [0]:
display(df_a.limit(5))
display(df_b.limit(5))


## üîç Detecting Skewed Keys


In [0]:
skew_stats = (
    df_a
        .groupBy("customer_id")
        .count()
        .orderBy(F.desc("count"))
)

skew_stats.show(10)


### üîé What You‚Äôll Observe

- A few `customer_id`s appear at the top with **very high counts**
- In the Spark UI:
  - One or two join tasks take much longer
  - These tasks process far more data than others

This confirms **data skew**.


## ‚öñÔ∏è Handling Skew with Salting

To distribute the load of hot keys across multiple tasks,
we use **key salting**.


In [0]:
HOT_CUSTOMERS = ["cust_hot_1", "cust_hot_2"]
N_SALTS = 8


In [0]:
df_a_salted = df_a.withColumn(
    "salt",
    F.when(
        F.col("customer_id").isin(HOT_CUSTOMERS),
        (F.rand() * N_SALTS).cast("int")
    ).otherwise(F.lit(0))
)


In [0]:
df_b_salted = df_b.withColumn(
    "salt",
    F.when(
        F.col("customer_id").isin(HOT_CUSTOMERS),
        (F.rand() * N_SALTS).cast("int")
    ).otherwise(F.lit(0))
)


## üîó Join Using Salted Keys


In [0]:
joined_df = (
    df_a_salted.alias("a")
        .join(
            df_b_salted.alias("b"),
            on=[
                F.col("a.customer_id") == F.col("b.customer_id"),
                F.col("a.salt") == F.col("b.salt")
            ],
            how="inner"
        )
)


## ‚öôÔ∏è About Adaptive Query Execution (AQE)

In some Spark environments, **Adaptive Query Execution (AQE)** can automatically
detect and mitigate skewed joins at runtime.

‚ö†Ô∏è On **Databricks Serverless compute**, Spark execution configurations
(such as `spark.sql.adaptive.*`) are **managed by the platform** and
cannot be manually enabled or disabled by users.

For this reason, we rely on **explicit techniques like key salting**
to handle skew in a predictable and portable way.


## üß† Why This Works

- Skewed keys are **split across multiple partitions**
- No single task is overloaded
- Cluster resources are used more evenly
- Job runtime improves significantly

Salting is a **manual but reliable** solution,  
while AQE provides **automatic skew mitigation** in many cases.


## ‚úÖ Summary

- Data skew is a common cause of slow Spark joins.
- Skew can be detected using aggregation and Spark UI.
- Key salting spreads hot keys across partitions.
- AQE can automatically mitigate skew in supported environments.

This notebook demonstrates a **production-grade strategy**
for handling skewed joins in large-scale Spark workloads.
