# Adaptive Query Execution for Skewed Joins

Skewed data sets are a big issue, especially for join operations. Before Spark 3, it was difficult to optimize such situations, which could easily end up either in very long running jobs where only a single task dominates the overall runtime or even in OOMs. In order to cope with such situations, people increased the number of Spark partitions via `spark.sql.shuffle.partitions` or salted the join keys (i.e. added random bits). While the first approach will affect all Spark operations, the second one is complex to implement.

Luckily with Spark 3 the situation improved a lot, thanks to the new AQE (Adaptive Query Execution). This Spark internal framework allows Spark to dynamically change the execution plan of a query once some parts are executed and additional information is available to the query planner. And this framework provides support for skewed joins, in which case it will automatically split up huge partitions into smaller ones and still correctly execute the join operation.

Let's have a look how this works. This notebook is heavily influenced by [a Medium article by Mario Cartia](https://medium.com/agile-lab-engineering/spark-3-0-first-hands-on-approach-with-adaptive-query-execution-part-3-ea6012a8f216)

In [2]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","24G") \
        .getOrCreate()

spark

# 1 Create Skewed Test Data

First we need to have a skewed data set. We create our own data set about cars. 

## 1.1 Create Car Models

First we create a small data set with car models, which will serve as the join key of two additional tables, which will be created afterwards. We will also implement a small function `random_make_model` which returns a random entry of the table - but with a small twist. With a chance of over 50%, the returned car model will be a Ford Fiesta, which will later be responsible for the skewed partition.

In [3]:
from pyspark.sql import Row
import random

MakeModel = Row("make", "model")

make_models = [
    MakeModel("FORD", "FIESTA"),
    MakeModel("NISSAN", "QASHQAI"),
    MakeModel("HYUNDAI", "I20"),
    MakeModel("SUZUKI", "SWIFT"),
    MakeModel("MERCEDES-BENZ", "E CLASS"),
    MakeModel("FIAT", "500"),
    MakeModel("SKODA", "OCTAVIA"),
    MakeModel("KIA", "RIO"),
    MakeModel("VW", "TIGUAN"),
    MakeModel("PORSCHE", "911"),
]


# Helper function to create random make & model
def random_make_model():
    is_ford = random.choice([True, False])
    if is_ford:
        return make_models[0]
    else:
        rnd = random.randint(0, len(make_models) - 1)
        return make_models[rnd]
    
random_make_model()

Row(make='FORD', model='FIESTA')

## 1.2 Create First Car Table

The first table simply contains car registrations, which will randomly generate a registration and pick a random car model from above. Remember that Fird Fiestas will be over-represented by over 50%, so the data set is already skewed in regards to the car model.

In [4]:
import string

Table1 = Row("registration", "make", "model", "engine_size")

def random_t1():
    def random_registration():
        letters = string.ascii_uppercase
        reg = ""
        for number in range(8):
              reg += random.choice(letters)

        return reg

    def random_engine_size():
        return 1 + random.randint(0,9)/10.0
    
    make_model = random_make_model()
    return Table1( random_registration(), make_model.make, make_model.model, random_engine_size())

random_t1()

Row(registration='FAABIXHZ', make='PORSCHE', model='911', engine_size=1.1)

### Create DataFrame

With the definitions above, let's create a Spark DataFrame containing random car registrations.

In [5]:
t1 = spark.createDataFrame([random_t1() for i in range(20000)])

### Inspect DataFrame

Now let's count the occurances of each car model. We expect that the Ford Fiesta will make up over 50%.

In [6]:
t1.groupBy(["make", "model"]).count().orderBy(f.col("count").desc()).show()

+-------------+-------+-----+
|         make|  model|count|
+-------------+-------+-----+
|         FORD| FIESTA|11080|
|       SUZUKI|  SWIFT| 1028|
|      HYUNDAI|    I20| 1021|
|      PORSCHE|    911| 1016|
|       NISSAN|QASHQAI| 1006|
|         FIAT|    500| 1003|
|MERCEDES-BENZ|E CLASS| 1001|
|           VW| TIGUAN|  961|
|        SKODA|OCTAVIA|  961|
|          KIA|    RIO|  923|
+-------------+-------+-----+



## 1.3 Create Second Table

Now we create an additional table containing car informations, again highly skewed.

In [8]:
Table2 = Row("make", "model", "engine_size", "sales_price")

def random_t2():
    def random_engine_size():
        return 1 + random.randint(0,9)/10.0

    def random_sales_price():
        return random.randint(10000, 40000)
    
    make_model = random_make_model()
    return Table2(make_model.make, make_model.model, random_engine_size(), random_sales_price())

random_t2()

Row(make='FORD', model='FIESTA', engine_size=1.8, sales_price=22637)

### Create DataFrame

In [9]:
t2 = spark.createDataFrame([random_t2() for i in range(200000)])

### Inspect DataFrame

In [10]:
t2.groupBy(["make", "model"]).count().orderBy(f.col("count").desc()).show()

+-------------+-------+------+
|         make|  model| count|
+-------------+-------+------+
|         FORD| FIESTA|110152|
|      HYUNDAI|    I20| 10077|
|      PORSCHE|    911| 10049|
|          KIA|    RIO| 10048|
|       SUZUKI|  SWIFT| 10020|
|           VW| TIGUAN|  9982|
|         FIAT|    500|  9962|
|       NISSAN|QASHQAI|  9954|
|MERCEDES-BENZ|E CLASS|  9884|
|        SKODA|OCTAVIA|  9872|
+-------------+-------+------+



# 2 Peform JOIN

Finally we will join the two tables on the join keys `make` and `model`. Note that the join keys are not unique in neither DataFrame and note that the join key is highly skewed in both DataFrames.

## 2.1 Unoptimized Skewed Join

First we will use a non-adaptive join as the performance baseline.

In [11]:
# Disable automatic broadcast. Default: 10MB
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

# Enable AQE. Default: False
spark.conf.set("spark.sql.adaptive.enabled", False)

In [12]:
joined = t1.join(t2, ["make", "model"]) \
    .filter(f.abs(t1["engine_size"] - t2["engine_size"]) < 0.1) \
    .groupBy("registration") \
    .agg(f.avg("sales_price").alias("avg_sales_price"))

joined.explain()

== Physical Plan ==
*(6) HashAggregate(keys=[registration#0], functions=[avg(sales_price#35L)])
+- Exchange hashpartitioning(registration#0, 200), true, [id=#109]
   +- *(5) HashAggregate(keys=[registration#0], functions=[partial_avg(sales_price#35L)])
      +- *(5) Project [registration#0, sales_price#35L]
         +- *(5) SortMergeJoin [make#1, model#2], [make#32, model#33], Inner, (abs((engine_size#3 - engine_size#34)) < 0.1)
            :- *(2) Sort [make#1 ASC NULLS FIRST, model#2 ASC NULLS FIRST], false, 0
            :  +- Exchange hashpartitioning(make#1, model#2, 200), true, [id=#94]
            :     +- *(1) Filter ((isnotnull(engine_size#3) AND isnotnull(make#1)) AND isnotnull(model#2))
            :        +- *(1) Scan ExistingRDD[registration#0,make#1,model#2,engine_size#3]
            +- *(4) Sort [make#32 ASC NULLS FIRST, model#33 ASC NULLS FIRST], false, 0
               +- Exchange hashpartitioning(make#32, model#33, 200), true, [id=#100]
                  +- *(3) Filt

In [13]:
%%time

joined.count()

CPU times: user 46.2 ms, sys: 28.4 ms, total: 74.6 ms
Wall time: 1min 15s


20000

## 2.2 Optimized Skewed Join (AQE)

Now we will enable the Adaptive Query Execution in Spark and configure some thresholds such that it will work nicely with our rather small data sets.

In [14]:
# Enable AQE. Ddefault: False
spark.conf.set("spark.sql.adaptive.enabled", True)
# Enable skewed join optimization. Default: True
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", True)

# The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true).
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "8KB")
# A partition is considered as skewed if its size is larger than this factor multiplying the median partition size and also larger than 'spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes'. Default: 5
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", 2)
# A partition is considered as skewed if its size in bytes is larger than this threshold and also larger than 'spark.sql.adaptive.skewJoin.skewedPartitionFactor' multiplying the median partition size. Default: 256MB
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "16KB")

In [15]:
joined = t1.join(t2, ["make", "model"]) \
    .filter(f.abs(t1["engine_size"] - t2["engine_size"]) < 0.1) \
    .groupBy("registration") \
    .agg(f.avg("sales_price").alias("avg_sales_price"))

joined.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[registration#0], functions=[avg(sales_price#35L)])
   +- Exchange hashpartitioning(registration#0, 200), true, [id=#234]
      +- HashAggregate(keys=[registration#0], functions=[partial_avg(sales_price#35L)])
         +- Project [registration#0, sales_price#35L]
            +- SortMergeJoin [make#1, model#2], [make#32, model#33], Inner, (abs((engine_size#3 - engine_size#34)) < 0.1)
               :- Sort [make#1 ASC NULLS FIRST, model#2 ASC NULLS FIRST], false, 0
               :  +- Exchange hashpartitioning(make#1, model#2, 200), true, [id=#226]
               :     +- Filter ((isnotnull(engine_size#3) AND isnotnull(make#1)) AND isnotnull(model#2))
               :        +- Scan ExistingRDD[registration#0,make#1,model#2,engine_size#3]
               +- Sort [make#32 ASC NULLS FIRST, model#33 ASC NULLS FIRST], false, 0
                  +- Exchange hashpartitioning(make#32, model#33, 200), true, [id=#227]


In [16]:
%%time

joined.count()

CPU times: user 116 ms, sys: 79.8 ms, total: 196 ms
Wall time: 14.8 s


20000