# Adaptive Query Execution for Skewed Joins

Skewed data sets are a big issue, especially for join operations. Before Spark 3, it was difficult to optimize such situations, which could easily end up either in very long running jobs where only a single task dominates the overall runtime or even in OOMs. In order to cope with such situations, people increased the number of Spark partitions via `spark.sql.shuffle.partitions` or salted the join keys (i.e. added random bits). While the first approach will affect all Spark operations, the second one is complex to implement.

Luckily with Spark 3 the situation improved a lot, thanks to the new AQE (Adaptive Query Execution). This Spark internal framework allows Spark to dynamically change the execution plan of a query once some parts are executed and additional information is available to the query planner. And this framework provides support for skewed joins, in which case it will automatically split up huge partitions into smaller ones and still correctly execute the join operation.

Let's have a look how this works. This notebook is heavily influenced by [a Medium article by Mario Cartia](https://medium.com/agile-lab-engineering/spark-3-0-first-hands-on-approach-with-adaptive-query-execution-part-3-ea6012a8f216)

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pandas as pd

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","24G") \
        .getOrCreate()

spark

# 1 Create Skewed Test Data

First we need to have a skewed data set. We create our own data set about cars. 

## 1.1 Create Car Models

First we create a small data set with car models, which will serve as the join key of two additional tables, which will be created afterwards. We will also implement a small function `random_make` which returns a random entry of the table - but with a small twist. With a chance of over 50%, the returned car model will be a Ford, which will later be responsible for the skewed partition. We also implement an additional function `random_config` which simply creates a random string representing the configuration of a specific car (colour, sports package, interior, ...)

In [None]:
import random
import string

makes = [
    "Ford",
    "Nissan",
    "Hyundai",
    "Suzuki",
    "Mercedes-Benz",
    "Fiat",
    "Skoda",
    "Kia",
    "Vw",
    "Porsche"
]


# Helper function to create random make & model
def random_make(id):
    is_ford = (id % 2 == 1)
    if is_ford:
        return makes[0]
    else:
        rnd = random.randint(0, len(makes) - 1)
        return makes[rnd]

def random_config(id):
    letters = string.ascii_uppercase
    reg = ""
    for number in range(8):
          reg += random.choice(letters)

    return reg


for i in range(0,10):
    print(random_make(i))

for i in range(0,10):
    print(random_config(i))

### Pandas UDF

Now create some Pandas UDFs from the previous pure Python functions

In [None]:
from pyspark.sql.functions import pandas_udf

@pandas_udf('string')
def random_make_udf(ids:pd.Series) -> pd.Series:
    return ids.apply(random_make)

@pandas_udf('string')
def random_config_udf(ids:pd.Series) -> pd.Series:
    return ids.apply(random_config)

### Create DataFrame / Table

Now create a DataFrame containing lots of cars. These represent specific configurations from specific manufacturers. Each line is identified by an `id` column.

In [None]:
num_cars = 100000000

cars = spark.range(0,num_cars).select(
        f.col("id").alias("car_id"),
        random_make_udf('id').alias('make'),
        random_config_udf('id').alias('config')
    )

cars.limit(10).toPandas()

## 1.2 Create Sales Table

Now we create an additional table containing car informations, again highly skewed. We use a fictional `sales` table, where each line has again a `sales_id`, a reference to a speciifc car configuration via `car_id` and a sales date. Again, we create a highly skewed table by assigning 80% of all entries to the car with id `100`.

In [None]:
num_sales = 100000000

sales = spark.range(0,num_sales).select(
        f.col("id").alias("sales_id"),
        f.date_add(f.current_date(), -(f.rand() * 360).cast('int')).alias("sales_date"),
        f.when(f.rand() < 0.8, 100).otherwise((f.rand()*num_cars).cast('long')).alias("car_id")
    )

sales.limit(10).toPandas()

# 2 Peform JOIN

Finally we will join the two tables on the join key `id` / `car_id`. Note that the join key is not unique in the second DataFrame and note that the join key is highly skewed in both DataFrames.

## 2.1 Unoptimized Skewed Join

First we will use a non-adaptive join as the performance baseline.

In [None]:
# Disable automatic broadcast. Default: 10MB
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

# Enable AQE. Default: False
spark.conf.set("spark.sql.adaptive.enabled", False)

In [None]:
%%time
# Count the number of sales per manufacterer. You need to join the sales table with the cars table and then count the number of records per make
# YOUR CODE HERE

result.toPandas()

### Execution Plan

Let's inspect the execution plan after the query has succeeded.

In [None]:
# YOUR CODE HERE

## 2.2 Optimized Skewed Join (AQE)

Now we will enable the Adaptive Query Execution in Spark and configure some thresholds such that it will work nicely with our rather small data sets.

In [None]:
# Enable AQE. Ddefault: False
spark.conf.set("spark.sql.adaptive.enabled", True)
# Enable skewed join optimization. Default: True
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", True)

# The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true).
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "8KB")
# A partition is considered as skewed if its size is larger than this factor multiplying the median partition size and also larger than 'spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes'. Default: 5
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", 2)
# A partition is considered as skewed if its size in bytes is larger than this threshold and also larger than 'spark.sql.adaptive.skewJoin.skewedPartitionFactor' multiplying the median partition size. Default: 256MB
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "16KB")

In [None]:
%%time
# Perform the very same query, but now with a different configuration
# YOUR CODE HERE

result.toPandas()

### Execution Plan

Again let's inspect the execution plan after the query has succeeded. Note that it will look significantly different now.

In [None]:
# YOUR CODE HERE