# Lab : Spark AQE

Experiment with Spark Adaptive Query Engine

Refereces:
- http://blog.madhukaraphatak.com/spark-aqe-part-2/
- https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
- https://docs.databricks.com/spark/latest/spark-sql/aqe.html
- https://docs.databricks.com/_static/notebooks/aqe-demo.html
- https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution

## Step-1: Enable AQE in Spark Config

By default AQE is not on.  Turn it on by setting `spark.sql.adaptive.enabled=true` in Spark config

In [None]:
import findspark
findspark.init()  # uses SPARK_HOME
print("Spark found in : ", findspark.find())

import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession



# use a unique tmep dir for warehouse dir, so we can run multiple spark sessions in one dir
import tempfile
tmpdir = tempfile.TemporaryDirectory()

config = ( SparkConf()
         .setAppName("TestApp")
         .setMaster("local[*]")
         #.setMaster("spark://f96e0987354e:7077")
         .set('executor.memory', '2g')
         .set('spark.sql.warehouse.dir', tmpdir.name)
         .set('spark.sql.adaptive.enabled', 'true')
         .set('spark.sql.adaptive.coalescePartitions.enabled', 'true')
         )

print("Spark config:\n\t", config.toDebugString().replace("\n", "\n\t"))
spark = SparkSession.builder.config(conf=config).getOrCreate()
print('Spark UI running on port ' + spark.sparkContext.uiWebUrl.split(':')[2])

In [None]:
# check if AQE is enabled
spark.conf.get('spark.sql.adaptive.enabled')

# spark.conf.set('spark.sql.adaptive.coalescePartitions.minPartitionNum', 1)

### Verify in Spark App UI

Check the **environment** tab to see if Adaptive mode is turned on.

![](../assets/images/aqe-4.png)

## Step-2: Generate some large data

We will generate some clickstream data

In [None]:
%%time 
# generate large clickstream data


! [ ! -d /data/click-stream/json/ ] && cd /data/click-stream  && python gen-clickstream-json.py 

! ls -lh  /data/click-stream/json/

## Step-3: Load Data

In [None]:
%%time

# # load clickstream json -- this is a large table about 1.4 GB in size
clickstream = spark.read.json("../data/click-stream/json/")

## The folowing is to test 'spark.sql.adaptive.coalescePartitions.enabled' optimization
## we are creating too many small partitions
# clickstream = spark.read.json("../data/click-stream/json/").repartition(500)

print ("Partitions # : ", clickstream.rdd.getNumPartitions())

## Step-4: Query

In [None]:
clickstream.createOrReplaceTempView("clickstream")

In [None]:
s = """
select domain, count(*) total
from clickstream
where cost > 100
group by domain
order by total DESC
"""

## Step-5: See Exeucution Plan

you will see **AdaptiveSparkPlan** implying that AQE is active.

And notice also **isFinalPlan=false**

```text
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
```

In [None]:
spark.sql(s).explain(extended=True)

## Step-6: Execute Query

In [None]:
spark.sql(s).show()

## Step-7: Inspect Spark Application UI (SQL Tab)

In the query, expand **Details** tab in the bottom.

**==>Compare the initial plan and final plan.**

![](../assets/images/dataframe-7-sql.png)

![](../assets/images/aqe-2-final-plan.png)

## Step-8: Experiment

We can test if AQE can handle too many small partitions.  This is controlled by `spark.sql.adaptive.coalescePartitions.enabled` property.

On step-3: loading data, change it as follows.

```python
clickstream = spark.read.json("../data/click-stream/json/").repartition(500)
```

Here, we are creating 500 partitions (too many and too small)

And rerun the notebook by **Kernel --> restart kernel and run all cells**

Inspect the query on SQL tab.

Search for 'partition' keyword.  You will see how the partitions are being changed.

![](../assets/images/aqe-3-partitions.png)