# Identifying and Debugging Data Skew - Interactive Demo

Welcome! This demo will teach you how to identify and debug one of the most common Spark performance issues: **data skew**.

---

## üìà What is Data Skew?

**Data skew** occurs when data is unevenly distributed across partitions, causing some tasks to process significantly more data than others.

**Example:**
* Partition 1: 1,000 records
* Partition 2: 1,000 records  
* Partition 3: 1,000 records
* Partition 4: **1,000,000 records** ‚Üê SKEWED!

The job runs as slow as the slowest task (Partition 4), wasting resources on idle executors.

---

## ‚ö†Ô∏è Why Does Data Skew Matter?

* üêå **Slow Performance** - One task takes 100x longer than others
* üí∏ **Wasted Resources** - Most executors sit idle waiting for the slow task
* üö´ **Job Failures** - Skewed tasks may run out of memory (OOM errors)
* üîÑ **Stragglers** - The "straggler" task delays the entire job

---

## üéØ What You'll Learn

1. üì¶ **Generate skewed data** for demonstration
2. üî¥ **Observe the problem** - See skew in action
3. üîç **Identify skew in Spark UI** - Learn what to look for
4. üìä **Detect skew programmatically** - Check data distribution
5. ‚úÖ **Fix data skew** - Apply mitigation techniques

---

**Let's get started!** üöÄ

## 1. Generate Skewed Data üì¶

We'll create a realistic e-commerce dataset with **intentional data skew**.

**Scenario:** Customer order data where:
* Most customers have 1-10 orders
* One "power user" (customer_id = 1) has 500,000 orders
* This creates severe skew when grouping by customer_id

**Why this matters:** This pattern is common in real-world data:
* Popular products with millions of sales
* High-volume customers or accounts
* Hot keys in time-series data (e.g., peak hours)
* Geographic concentrations (e.g., major cities)

In [0]:
from pyspark.sql.functions import col, rand, when, lit, monotonically_increasing_id
import random

print("Generating skewed customer order data...")
print("This will take a moment to create enough data to demonstrate skew.\n")

# Create a large dataset with intentional skew
# Customer 1 will have 500,000 orders (the skewed key)
# Other customers (2-1000) will have 1-10 orders each

# Generate the skewed customer (customer_id = 1) with 500,000 orders
skewed_orders = spark.range(0, 500000).select(
    lit(1).alias("customer_id"),
    (rand() * 1000).cast("int").alias("order_amount"),
    (rand() * 100).cast("int").alias("product_id")
)

# Generate normal customers (customer_id 2-1000) with 1-10 orders each
# Total: ~5,000 orders across 999 customers
normal_orders = spark.range(0, 5000).select(
    ((rand() * 999) + 2).cast("int").alias("customer_id"),
    (rand() * 1000).cast("int").alias("order_amount"),
    (rand() * 100).cast("int").alias("product_id")
)

# Combine both datasets
orders_df = skewed_orders.union(normal_orders)

print(f"‚úÖ Generated {orders_df.count():,} total orders")
print(f"   - Customer 1 (skewed): 500,000 orders")
print(f"   - Other customers: ~5,000 orders")
print(f"\n‚ö†Ô∏è  Customer 1 has 100x more data than all other customers combined!")

In [0]:
# Let's look at the data
print("Sample of the orders data:")
display(orders_df.limit(20))

In [0]:
# Let's see the distribution of orders per customer
print("Orders per customer (top 10):")

customer_counts = orders_df.groupBy("customer_id").count().orderBy(col("count").desc())
display(customer_counts.limit(10))

print("\n‚ö†Ô∏è  Notice: Customer 1 has 500,000 orders while others have < 20!")
print("This is SEVERE data skew!")

## 2. Observe the Skew Problem üî¥

Now let's perform operations that will suffer from data skew.

**Operations that cause skew:**
* **groupBy()** - Shuffles data by key, skewed keys go to one partition
* **join()** - Skewed join keys create hot partitions
* **distinct()** - Deduplication shuffles by value
* **repartition()** - Can create uneven partitions

We'll run a **groupBy** aggregation that will clearly show the skew problem.

In [0]:
from pyspark.sql.functions import sum, avg, count

print("üî¥ Running aggregation with skewed data...")
print("This will trigger a shuffle and you'll see skew in the Spark UI.\n")

# This aggregation will suffer from skew
# Customer 1's data will all go to ONE partition
customer_summary = orders_df.groupBy("customer_id").agg(
    count("*").alias("total_orders"),
    sum("order_amount").alias("total_spent"),
    avg("order_amount").alias("avg_order_value")
).orderBy(col("total_orders").desc())

print("‚è≥ Running query... Watch the Spark UI!")
print("\nüëâ IMPORTANT: Open the Spark UI now to see the skew!")
print("   Click on the 'Spark UI' link that appears below after running this cell.\n")

# Force execution with an action
result = customer_summary.collect()

print(f"\n‚úÖ Query completed!")
print(f"   Processed {len(result):,} customers")
print("\n‚ö†Ô∏è  Did you notice one task took much longer than others?")

In [0]:
# Show the results
print("Top customers by order count:")
display(customer_summary.limit(10))

print("\nüìä Notice: Customer 1 has 500,000 orders while others have < 20")

## 3. Identify Skew in Spark UI üîç

The **Spark UI** is your primary tool for identifying data skew. Let's learn how to use it!

### üìç How to Access Spark UI

1. **During job execution:** Click the "Spark UI" link that appears below the cell
2. **After execution:** Go to the cluster page and click "Spark UI"
3. **From notebook:** Look for the Spark UI icon in the cell output

---

### üîç What to Look For in Spark UI

Data skew shows up in several places in the Spark UI. Here's where to look:

### üìä 1. Jobs Tab - Overview

**Navigation:** Spark UI ‚Üí Jobs Tab

**What to look for:**

* **Job Duration** - One job taking much longer than expected
* **Active Tasks** - Most tasks complete quickly, but 1-2 tasks still running
* **Event Timeline** - Visual representation shows tasks finishing at different times

**Signs of Skew:**
* ‚ö†Ô∏è Most tasks complete in seconds, but a few take minutes/hours
* ‚ö†Ô∏è Job progress stuck at 99% for a long time (waiting for straggler tasks)
* ‚ö†Ô∏è Event timeline shows long tail of tasks

**Example:**
```
Task 1: ========== (10 seconds)
Task 2: ========== (10 seconds)
Task 3: ========== (10 seconds)
Task 4: ============================================== (5 minutes) ‚Üê SKEWED!
```

### üìä 2. Stages Tab - Detailed View

**Navigation:** Spark UI ‚Üí Stages Tab ‚Üí Click on a Stage

**What to look for:**

#### **Summary Metrics Section:**

Look at the **Task Metrics** table:

| Metric | Min | 25th % | Median | 75th % | Max |
|--------|-----|--------|--------|--------|-----|
| Duration | 1s | 2s | 2s | 3s | **300s** ‚Üê SKEW! |
| Input Size | 1 MB | 2 MB | 2 MB | 3 MB | **500 MB** ‚Üê SKEW! |
| Records | 1K | 2K | 2K | 3K | **500K** ‚Üê SKEW! |

**Signs of Skew:**
* ‚ö†Ô∏è **Max >> Median** - Max is 10x-100x larger than median
* ‚ö†Ô∏è **Max >> 75th percentile** - One task is an outlier
* ‚ö†Ô∏è **Shuffle Read Size** - One task reads much more data
* ‚ö†Ô∏è **Shuffle Write Size** - One task writes much more data

#### **Tasks Table:**

Scroll down to see individual tasks:

* **Sort by Duration** - Find the slowest tasks
* **Sort by Input Size** - Find tasks processing the most data
* **Look for outliers** - One task with 100x more data than others

**Example:**
```
Task 0: Duration: 2s,  Input: 2 MB,   Records: 2,000
Task 1: Duration: 2s,  Input: 2 MB,   Records: 2,000
Task 2: Duration: 2s,  Input: 2 MB,   Records: 2,000
Task 3: Duration: 180s, Input: 500 MB, Records: 500,000 ‚Üê SKEWED!
```

### üìä 3. SQL Tab - Query Execution

**Navigation:** Spark UI ‚Üí SQL Tab ‚Üí Click on a Query

**What to look for:**

#### **Query Execution Plan:**

Look at the visual DAG (Directed Acyclic Graph):

* **Exchange nodes** - These are shuffle operations (potential skew points)
* **HashAggregate** - GroupBy operations (common skew source)
* **SortMergeJoin** - Join operations (common skew source)

#### **Metrics for Each Stage:**

Click on a stage in the DAG to see:

* **Number of output rows** - Shows data distribution
* **Data size** - Shows how much data each stage processes
* **Time spent** - Shows which stages are slow

**Signs of Skew:**
* ‚ö†Ô∏è One partition in Exchange has 100x more rows than others
* ‚ö†Ô∏è HashAggregate shows uneven data distribution
* ‚ö†Ô∏è SortMergeJoin has one side much larger than expected

### üìä 4. Executors Tab - Resource Usage

**Navigation:** Spark UI ‚Üí Executors Tab

**What to look for:**

#### **Executor Metrics:**

| Executor | Tasks | Duration | Input | Shuffle Read | Shuffle Write |
|----------|-------|----------|-------|--------------|---------------|
| 0 | 10 | 20s | 20 MB | 20 MB | 20 MB |
| 1 | 10 | 20s | 20 MB | 20 MB | 20 MB |
| 2 | 10 | 20s | 20 MB | 20 MB | 20 MB |
| 3 | 1 | **300s** | **500 MB** | **500 MB** | **500 MB** ‚Üê SKEWED! |

**Signs of Skew:**
* ‚ö†Ô∏è One executor has much longer task duration
* ‚ö†Ô∏è One executor processes much more data
* ‚ö†Ô∏è Uneven task distribution across executors
* ‚ö†Ô∏è Most executors idle while one is working

### ‚úÖ Quick Skew Identification Checklist

Use this checklist when debugging slow Spark jobs:

**‚òê Jobs Tab:**
- [ ] Is the job stuck at 99% for a long time?
- [ ] Are most tasks complete but 1-2 still running?
- [ ] Is there a long tail in the event timeline?

**‚òê Stages Tab:**
- [ ] Is Max duration >> Median duration (10x or more)?
- [ ] Is Max input size >> Median input size?
- [ ] Are there outlier tasks in the tasks table?
- [ ] Is shuffle read/write size uneven?

**‚òê SQL Tab:**
- [ ] Do Exchange nodes show uneven data distribution?
- [ ] Are HashAggregate or Join operations slow?
- [ ] Is one partition processing 100x more rows?

**‚òê Executors Tab:**
- [ ] Is one executor much busier than others?
- [ ] Are most executors idle?
- [ ] Is task distribution uneven?

**If you checked 3+ boxes, you likely have data skew!**

## 4. Detect Skew Programmatically üìä

You can detect skew **before** running expensive operations by analyzing data distribution.

**Why detect skew programmatically?**
* ‚úÖ Catch skew early in development
* ‚úÖ Automate skew detection in data pipelines
* ‚úÖ Monitor data quality over time
* ‚úÖ Make informed decisions about optimization strategies

In [0]:
from pyspark.sql.functions import col, count, max as spark_max, min as spark_min, avg, stddev

print("üîç Method 1: Analyze key distribution\n")

# Count records per key
key_distribution = orders_df.groupBy("customer_id").count()

# Calculate statistics
stats = key_distribution.select(
    spark_min("count").alias("min_records"),
    avg("count").alias("avg_records"),
    spark_max("count").alias("max_records"),
    stddev("count").alias("stddev_records")
).collect()[0]

print(f"Key Distribution Statistics:")
print(f"  Min records per key:    {stats['min_records']:,}")
print(f"  Avg records per key:    {stats['avg_records']:,.2f}")
print(f"  Max records per key:    {stats['max_records']:,}")
print(f"  Std deviation:          {stats['stddev_records']:,.2f}")
print(f"\n  Skew ratio (max/avg):   {stats['max_records'] / stats['avg_records']:.2f}x")

if stats['max_records'] / stats['avg_records'] > 10:
    print("\n‚ö†Ô∏è  SEVERE SKEW DETECTED! Max is >10x the average.")
elif stats['max_records'] / stats['avg_records'] > 3:
    print("\n‚ö†Ô∏è  MODERATE SKEW DETECTED! Max is >3x the average.")
else:
    print("\n‚úÖ Data distribution looks balanced.")

In [0]:
print("üîç Method 2: Identify hot keys (top skewed keys)\n")

# Find the top 10 keys with the most records
hot_keys = orders_df.groupBy("customer_id") \
    .count() \
    .orderBy(col("count").desc()) \
    .limit(10)

print("Top 10 customers by order count:")
display(hot_keys)

print("\n‚ö†Ô∏è  These are your 'hot keys' that will cause skew in groupBy/join operations!")

In [0]:
print("üîç Method 3: Calculate what % of data is in the top key\n")

# Get total records
total_records = orders_df.count()

# Get records in the top key
top_key_records = hot_keys.first()['count']
top_key_id = hot_keys.first()['customer_id']

# Calculate percentage
skew_percentage = (top_key_records / total_records) * 100

print(f"Total records:           {total_records:,}")
print(f"Top key (customer {top_key_id}):  {top_key_records:,} records")
print(f"Skew percentage:         {skew_percentage:.2f}%")

if skew_percentage > 50:
    print("\n‚ö†Ô∏è  CRITICAL! One key contains >50% of all data!")
elif skew_percentage > 20:
    print("\n‚ö†Ô∏è  WARNING! One key contains >20% of all data!")
else:
    print("\n‚úÖ Data distribution is acceptable.")

In [0]:
print("üîç Method 4: Analyze partition sizes (advanced)\n")

# Repartition by the skewed key to see partition distribution
from pyspark.sql.functions import spark_partition_id

partitioned_df = orders_df.repartition(4, "customer_id")

# Count records per partition
partition_sizes = partitioned_df.groupBy(spark_partition_id().alias("partition_id")) \
    .count() \
    .orderBy("partition_id")

print("Records per partition after repartitioning by customer_id:")
display(partition_sizes)

print("\n‚ö†Ô∏è  Notice: One partition has 500,000 records while others have ~1,600!")
print("This is what causes the performance problem in Spark.")

### üõ†Ô∏è Reusable Skew Detection Function

Here's a function you can use in your own projects:

In [0]:
def detect_skew(df, key_column, threshold=10):
    """
    Detect data skew in a DataFrame by analyzing key distribution.
    
    Args:
        df: Input DataFrame
        key_column: Column to check for skew
        threshold: Skew ratio threshold (default: 10x)
    
    Returns:
        Dictionary with skew metrics and recommendations
    """
    from pyspark.sql.functions import col, count, max as spark_max, avg
    
    # Calculate key distribution
    key_dist = df.groupBy(key_column).count()
    
    # Get statistics
    stats = key_dist.select(
        avg("count").alias("avg"),
        spark_max("count").alias("max")
    ).collect()[0]
    
    skew_ratio = stats['max'] / stats['avg']
    has_skew = skew_ratio > threshold
    
    # Get hot keys
    hot_keys = key_dist.orderBy(col("count").desc()).limit(5).collect()
    
    result = {
        'has_skew': has_skew,
        'skew_ratio': skew_ratio,
        'avg_records': stats['avg'],
        'max_records': stats['max'],
        'hot_keys': [(row[key_column], row['count']) for row in hot_keys],
        'recommendation': 'Apply skew mitigation techniques' if has_skew else 'No action needed'
    }
    
    return result

# Test the function
print("üõ†Ô∏è Testing skew detection function:\n")
result = detect_skew(orders_df, "customer_id", threshold=10)

print(f"Has skew: {result['has_skew']}")
print(f"Skew ratio: {result['skew_ratio']:.2f}x")
print(f"Avg records per key: {result['avg_records']:,.2f}")
print(f"Max records per key: {result['max_records']:,}")
print(f"\nTop 5 hot keys:")
for key, count in result['hot_keys']:
    print(f"  Key {key}: {count:,} records")
print(f"\nRecommendation: {result['recommendation']}")

## 5. Fix Data Skew ‚úÖ

Now that we can identify skew, let's learn how to fix it!

**Common mitigation strategies:**

1. **Salting** - Add random values to break up hot keys
2. **Adaptive Query Execution (AQE)** - Let Spark optimize automatically
3. **Broadcast Joins** - Avoid shuffling small tables
4. **Isolated Processing** - Handle hot keys separately
5. **Repartitioning** - Increase parallelism
6. **Data Preprocessing** - Fix skew at the source

Let's explore each technique!

### üßÇ Strategy 1: Salting

**What is salting?**

Add a random "salt" value to skewed keys to distribute them across multiple partitions.

**How it works:**
1. Add a random number (0-N) to the skewed key
2. Process data with the salted key
3. Aggregate results across salt values

**When to use:**
* GroupBy operations with hot keys
* Aggregations with severe skew
* When you can't change the data source

In [0]:
from pyspark.sql.functions import rand, floor, concat, lit

print("üßÇ Applying salting technique...\n")

# Add a salt column (random number 0-9)
# This splits customer 1's data across 10 partitions
salted_df = orders_df.withColumn(
    "salt",
    floor(rand() * 10).cast("int")
).withColumn(
    "salted_key",
    concat(col("customer_id"), lit("_"), col("salt"))
)

print("Original key distribution (customer 1):")
print(f"  Customer 1: 500,000 records in 1 partition\n")

print("After salting (customer 1):")
salt_distribution = salted_df.filter(col("customer_id") == 1) \
    .groupBy("salted_key").count() \
    .orderBy("salted_key")

display(salt_distribution)

print("\n‚úÖ Customer 1's data is now split across 10 keys (~50,000 each)!")

In [0]:
print("üßÇ Performing aggregation with salted keys...\n")

# Step 1: Aggregate by salted key
salted_agg = salted_df.groupBy("customer_id", "salt").agg(
    count("*").alias("partial_count"),
    sum("order_amount").alias("partial_sum")
)

# Step 2: Aggregate across salt values to get final result
final_result = salted_agg.groupBy("customer_id").agg(
    sum("partial_count").alias("total_orders"),
    sum("partial_sum").alias("total_spent")
).orderBy(col("total_orders").desc())

print("‚è≥ Running salted aggregation... This should be faster!\n")

result = final_result.collect()

print(f"‚úÖ Completed! Processed {len(result):,} customers")
print("\nTop customers:")
display(final_result.limit(5))

print("\nüìä Check the Spark UI - tasks should be more evenly distributed!")

### ‚ö° Strategy 2: Adaptive Query Execution (AQE)

**What is AQE?**

Spark's built-in optimization that automatically handles skew at runtime.

**How it works:**
* Spark monitors task execution
* Detects skewed partitions automatically
* Splits large partitions into smaller ones
* Reoptimizes the query plan dynamically

**When to use:**
* Databricks Runtime 7.3+ (enabled by default)
* When you want automatic optimization
* For complex queries with multiple stages

**Configuration:**
```python
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
```

In [0]:
print("‚ö° Enabling Adaptive Query Execution (AQE)...\n")

# Enable AQE (usually enabled by default in Databricks)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

# Check current settings
print("AQE Configuration:")
print(f"  adaptive.enabled: {spark.conf.get('spark.sql.adaptive.enabled')}")
print(f"  skewJoin.enabled: {spark.conf.get('spark.sql.adaptive.skewJoin.enabled')}")
print(f"  coalescePartitions.enabled: {spark.conf.get('spark.sql.adaptive.coalescePartitions.enabled')}")

print("\n‚úÖ AQE is enabled! Spark will automatically handle skew in joins and aggregations.")

In [0]:
print("‚ö° Running aggregation with AQE enabled...\n")

# Run the same aggregation - AQE will optimize it automatically
aqe_result = orders_df.groupBy("customer_id").agg(
    count("*").alias("total_orders"),
    sum("order_amount").alias("total_spent"),
    avg("order_amount").alias("avg_order_value")
).orderBy(col("total_orders").desc())

print("‚è≥ Running query with AQE...\n")
result = aqe_result.collect()

print(f"‚úÖ Completed! AQE automatically optimized the query.")
print("\nüìä Check the Spark UI SQL tab:")
print("   Look for 'AQE' annotations in the query plan")
print("   You may see 'OptimizeSkewedJoin' or 'CoalesceShufflePartitions'")

### üì° Strategy 3: Broadcast Join

**What is a broadcast join?**

Send a small table to all executors instead of shuffling data.

**How it works:**
* Small table is copied to every executor
* No shuffle needed for the small table
* Avoids skew in join operations

**When to use:**
* Joining a large skewed table with a small table
* Small table < 10 MB (configurable)
* Dimension tables in star schema

**Example:**
```python
from pyspark.sql.functions import broadcast

# Force broadcast of small table
result = large_df.join(broadcast(small_df), "key")
```

**Benefits:**
* No shuffle for small table
* Avoids skew in join keys
* Much faster for small dimension tables

### üéØ Strategy 4: Isolated Processing

**What is isolated processing?**

Handle hot keys separately from normal keys.

**How it works:**
1. Identify hot keys
2. Filter data into two DataFrames: hot keys and normal keys
3. Process each separately (possibly with different strategies)
4. Union the results

**When to use:**
* Few hot keys with extreme skew
* Different processing logic for hot keys
* When salting isn't enough

**Example:**
```python
# Separate hot keys
hot_key_df = df.filter(col("key") == hot_key_value)
normal_df = df.filter(col("key") != hot_key_value)

# Process separately
hot_result = hot_key_df.groupBy("key").agg(...)
normal_result = normal_df.groupBy("key").agg(...)

# Combine
final_result = hot_result.union(normal_result)
```

### üìä Comparison of Skew Mitigation Strategies

| Strategy | Complexity | Effectiveness | When to Use |
|----------|------------|---------------|-------------|
| **AQE** | üü¢ Low | üü° Medium-High | Default choice, works automatically |
| **Salting** | üü° Medium | üü¢ High | GroupBy with hot keys |
| **Broadcast Join** | üü¢ Low | üü¢ High | Joins with small tables |
| **Isolated Processing** | üî¥ High | üü¢ High | Few extreme hot keys |
| **Repartitioning** | üü¢ Low | üü° Medium | Increase parallelism |
| **Data Preprocessing** | üî¥ High | üü¢ High | Fix at source if possible |

### üí° Decision Tree

```
Do you have data skew?
‚îî‚îÄ YES ‚Üí Is AQE enabled?
    ‚îú‚îÄ NO ‚Üí Enable AQE first!
    ‚îî‚îÄ YES ‚Üí Is it a join?
        ‚îú‚îÄ YES ‚Üí Is one table small?
        ‚îÇ   ‚îú‚îÄ YES ‚Üí Use Broadcast Join
        ‚îÇ   ‚îî‚îÄ NO ‚Üí Use Salting or Isolated Processing
        ‚îî‚îÄ NO ‚Üí Is it a GroupBy?
            ‚îî‚îÄ YES ‚Üí Use Salting
```

## üéâ Summary

### What We Learned:

‚úÖ **What is data skew** - Uneven data distribution across partitions  
‚úÖ **Why it matters** - Causes slow performance and wasted resources  
‚úÖ **How to identify in Spark UI** - Check Jobs, Stages, SQL, and Executors tabs  
‚úÖ **How to detect programmatically** - Analyze key distribution and partition sizes  
‚úÖ **How to fix it** - Salting, AQE, broadcast joins, and more  

---

### üìö Best Practices:

1. **Enable AQE** - Let Spark handle skew automatically (enabled by default in Databricks)
2. **Monitor regularly** - Check Spark UI for skew patterns
3. **Detect early** - Use programmatic detection in development
4. **Choose the right strategy** - Match the solution to your specific skew pattern
5. **Test and measure** - Compare performance before and after optimization
6. **Document hot keys** - Keep track of known skewed keys in your data

---

### üöÄ Next Steps:

* Apply these techniques to your own datasets
* Experiment with different strategies
* Monitor Spark UI regularly
* Share knowledge with your team
* Consider data preprocessing to prevent skew at the source

---

**Remember:** Data skew is one of the most common Spark performance issues. 
Mastering these techniques will make you a more effective data engineer! üí™