# Execution Plan

In this notebook we try to understand Spark execution plans. We will use the weather example and analyse all the steps in order to get a better understanding.

## Exeuction Model of Spark

In contrast to many other (mainly non-distributed) frameworks, Spark does not execute any transformation immediately, but only records the step and builds a so called execution plan. This plan is the basis for Sparks resilience against failure of individual nodes (since the result can be reconstructed from the execution plan), but also allows Spark to perform optimizations which span all transformation steps.

Specifically with Spark DataFrames (as opposed to the more low level RDD interface), Spark uses an advanced optimizer. The general steps of query processing in response to an action (like a "show" or "save" action)" are always as follows:
1. Parse logical execution plan
2. Analyze logical execution plan and resolve all symbols (tables, columns, functions)
3. Optimize logical execution plan
4. Create physical execution plan by mapping all steps to RDD operations

## Relation to RDDs
Note that RDDs are only used in the very last step, although the general conception is that DataFrames sit on top of RDDs. But the point is, that a DataFrame first collects all transformations on a higher level of abstraction and RDDs only come into play in this very last step.

Actually you can access an RDD of any DataFrame. BUT: This access will actually create the physical execution plan for this specific RDD. Before accessing this RDD it even didn't exist. This also means that using a DataFrames RDD actually is an optimization barrier.

## Weather Example

In the following steps, we will try to understand how Spark executes a simplified version of the weather analysis including aggregations and joins.

In [None]:
spark.conf.set("spark.sql.adaptive.enabled", False)

# 1. Load Data

First we load the weather data, which consists of the measurement data and some station metadata.

In [None]:
storageLocation = "s3://dimajix-training/data/weather"

## 1.1 Load Measurements

Measurements are stored in multiple directories (one per year)

In [None]:
from pyspark.sql.functions import *
from functools import reduce

# Read in all years, store them in an Python array
raw_weather_per_year = [spark.read.text(storageLocation + "/" + str(i)).withColumn("year", lit(i)) for i in range(2003,2006)]

# Union all years together
raw_weather = reduce(lambda l,r: l.union(r), raw_weather_per_year)                        

# Display first 10 records
raw_weather.limit(10).toPandas()

### Extract Measurements

Measurements were stored in a proprietary text based format, with some values at fixed positions. We need to extract these values with a simple `SELECT` statement.

In [None]:
weather = raw_weather.select(
    col("year"),
    substring(col("value"),5,6).alias("usaf"),
    substring(col("value"),11,5).alias("wban"),
    substring(col("value"),16,8).alias("date"),
    substring(col("value"),24,4).alias("time"),
    substring(col("value"),42,5).alias("report_type"),
    substring(col("value"),61,3).alias("wind_direction"),
    substring(col("value"),64,1).alias("wind_direction_qual"),
    substring(col("value"),65,1).alias("wind_observation"),
    (substring(col("value"),66,4).cast("float") / lit(10.0)).alias("wind_speed"),
    substring(col("value"),70,1).alias("wind_speed_qual"),
    (substring(col("value"),88,5).cast("float") / lit(10.0)).alias("air_temperature"),
    substring(col("value"),93,1).alias("air_temperature_qual")
)
    
weather.limit(10).toPandas()

## 1.2 Load Station Metadata

We also need to load the weather station meta data containing information about the geo location, country etc of individual weather stations.

In [None]:
stations = spark.read \
    .option("header", True) \
    .csv(storageLocation + "/isd-history")

# Display first 10 records    
stations.limit(10).toPandas()

## 1.3 Perform Analysis

Now for completeness sake, let's reperform the analysis (minimum and maximum temperature per year and country) using `JOIN` and `GROUP BY` operations.

In [None]:
df = weather.join(stations, (weather.usaf == stations.USAF) & (weather.wban == stations.WBAN))
result = df.groupBy(df.CTRY, df.year).agg(
        min(when(df.air_temperature_qual == lit(1), df.air_temperature)).alias('min_temp'),
        max(when(df.air_temperature_qual == lit(1), df.air_temperature)).alias('max_temp'),
        min(when(df.wind_speed_qual == lit(1), df.wind_speed)).alias('min_wind'),
        max(when(df.wind_speed_qual == lit(1), df.wind_speed)).alias('max_wind')
    )

pdf = result.toPandas()    
pdf

# 2 Investigate Execution Plans

Now that we have redone the whole analysis, let's try to understand how Spark actually executes these steps. In order to understand the whole aggregation, we start simple and add one step after the other and look how execution plans change.

## 2.1 Reading Data

The first step is to read in data. In order to start simple, we only load a single year into a DataFrame called `raw_weather_2003`. We can inspect the execution plan that would create the records of that DataFrame with the `explain()` method.

In [None]:
raw_weather_2003 = spark.read.text(storageLocation + "/2003")
## YOUR CODE HERE

As we can see, the execution plan actually contains a single operation - reading data from disk. Note two things:
* The phyiscal execution plan has been created specifically for the `explain()` command. It is not stored in the DataFrame, the DataFrame only contains the basis for a *parsed logical plan*
* The plan is not executed, only printed to the console

We can also inspect a more detailed execition plan, if we pass `True` to the `explain()` method as follows:

In [None]:
## YOUR CODE HERE

As you can see, the explanation now contains all four steps:
* Parsed logical execution plan. This directly corresponds to the operations as specified.
* Analyzed logical plan. This resolves all relations and columns and data types.
* Optimized logical plan. This plan is already optimized (we'll see some optimizations later)
* Physical execution plan. This maps all operations and transformations to RDD operations.

## 2.2 Adding Columns

Let's see how the execution plan changes if we add a new column.

In [None]:
raw_weather_2003 = spark.read.text(storageLocation + "/2003").withColumn("year", lit(2003))
## YOUR CODE HERE

### Remarks
We see that a `Project` operation was inserted to all execution plans which is responsible for adding the `year` column.

## 2.3 SELECT Operation

Now let's perform an additional `SELECT` operation after adding the year. We do not add all columns yet in order to keep the output small and more readable. We will add more columns later when we really require them.

In [None]:
weather_2003 = raw_weather_2003.select(
    col("year"),
    substring(col("value"),5,6).alias("usaf"),
    substring(col("value"),11,5).alias("wban")
)
## YOUR CODE HERE

### Remarks
Here we see that the original parsed plan and analyzed plan actually contains two `Project` operations. Each of them corresponds to a single transformation (`withColumn` and `select`). But the optimizer merged these operations into a single one, thus simplifying execution.

## 2.4 UNION Operation

Just for completeness, let's see what a `UNION` operation does. We required it after loading all years into individual DataFrames.

In [None]:
# Read in all years, store them in an Python array
raw_weather_per_year = [spark.read.text(storageLocation + "/" + str(i)).withColumn("year", lit(i)) for i in range(2003,2015)]

# Union all years together
raw_weather = reduce(lambda l,r: l.union(r), raw_weather_per_year)                        

# Print execution plan
## YOUR CODE HERE

## 2.5 JOIN Operation

The next operation we had to perform was a `JOIN` between the measurements and the station metadata. We will use only a single year instead of the unioned data to keep output small and thereby increase readability of the execution plans.

In [None]:
df = ## YOUR CODE HERE

### Remarks
Now a `JOIN` results in an interesting execution plan:
* Spark filters columns, since an inner JOIN require non-null values
* Filtering is actually pushed down before the projection. This reduces amount of data as soon as possible
* JOIN operation is performed in two steps:
  * Load data and broadcast it to all nodes (`BroadcastExchange`)
  * Perform the join (`BroadcastHashJoin`)

In addition to the *broadcast join* Spark also supports a different join implementation - more on that later.

### Implicit Filtering

Actually let's have a look at what happens with a left outer join. This should not filter away `NULL` values on the left side:

In [None]:
## YOUR CODE HERE

## 2.6 Aggregation

Finally we want to perform an aggregation on the joined data. We need to restart from measurement extraction, since we did not extract all required columns so far. So we will perform the following steps
* Reuse `raw_weather_2003` which already contains the `year` column
* Extract all requirement measurements
* Join with stations metadata
* Perform grouped aggregation
Again we will only analyze the temperature, just to keep execution plans a little bit smaller. This means that some columns are missing, but the basic operations are all the same.

### Extract Measurements

In [None]:
weather_2003 = raw_weather_2003.select(
    col("year"),
    substring(col("value"),5,6).alias("usaf"),
    substring(col("value"),11,5).alias("wban"),
    substring(col("value"),16,8).alias("date"),
    substring(col("value"),24,4).alias("time"),
    substring(col("value"),42,5).alias("report_type"),
    substring(col("value"),61,3).alias("wind_direction"),
    substring(col("value"),64,1).alias("wind_direction_qual"),
    substring(col("value"),65,1).alias("wind_observation"),
    (substring(col("value"),66,4).cast("float") / lit(10.0)).alias("wind_speed"),
    substring(col("value"),70,1).alias("wind_speed_qual"),
    (substring(col("value"),88,5).cast("float") / lit(10.0)).alias("air_temperature"),
    substring(col("value"),93,1).alias("air_temperature_qual")
)
## YOUR CODE HERE

### Join with Stations Metadata

In [None]:
df = weather_2003.join(stations, (weather_2003.usaf == stations.USAF) & (weather_2003.wban == stations.WBAN))
## YOUR CODE HERE

### Perform Grouped Aggregation

In [None]:
## YOUR CODE HERE

### Remarks

Again we can see that Spark performs some simple but clever optiomizations:
* Projections only contains the columns required, not all available columns of df. The required columns are recursively *pushed up* the transformation chain from the last operation (grouped aggregation) to the first transformations
* The aggregation is performed in three steps: 
  * Partial aggregation (`HashAggregate` with `partial_...` functions)
  * Shuffle (`Exchange hashpartitioning`)
  * Final aggregation of partial results (`HashAggregate`)

## 2.7 Sorting

The last operation we like to analyze is sorting. To keep execution plans simple, we just sort the `stations` DataFrame by the stations IDs.

In [None]:
## YOUR CODE HERE

### Remarks

In order to have a globally sorted result, it is not enough to sort within each Spark partition. This implies that some kind of shuffle operation has to be executed. In contrast to all our previous examples, this time Spark uses a `rangepartitioning` by which it simply splits up all data according to the range of the sorting key. After that is done, records will be sorted independently within each partition. Since the ranges were non-overlapping this is enough for a global ordering covering all partitions.