# Repartitioning DataFrames

Partitions are a central concept in Apache Spark. They are used for distributing and parallelizing work onto different executors, which run on multiple servers. 

### Determining Partitions
Basically Spark uses two different strategies for splitting up data into multiple partitions:
1. When Spark loads data, the records are put into partitions along natural borders. For example every HDFS block (and thereby every file) is represented by a different partition. Therefore the number of partitions of a DataFrame read from disk is solely determined by the number of HDFS blocks
2. Certain operations like `JOIN`s and aggregations require that records with the same key are physically in the same partition. This is achieved by a shuffle phase. The number of partitions is specified by the global Spark configuration variable `spark.sql.shuffle.partitions` which has a default value of 200.

### Repartitiong Data
Since partitions have a huge influence on the execution, Spark also allows you to explicitly change the partitioning schema of a DataFrame. This makes sense only in a very limited (but still important) set of cases, which we will discuss in this notebook.

### Weather Example
Surprise, surprise, we will again use the weather example and see what explicit repartitioning gives us.

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","24G") \
        .getOrCreate()

spark

### Disable Automatic Broadcast JOINs
In order to see the shuffle operations, we need to prevent Spark from executiong `JOIN` operations as broadcast joins. Again this can be turned off by setting the Spark configuration variable `spark.sql.autoBroadcastJoinThreshold` to -1.

In [2]:
spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

# 1 Load Data

First we load the weather data, which consists of the measurement data and some station metadata.

In [3]:
storageLocation = "s3://dimajix-training/data/weather"
# storageLocation = "/dimajix/data/weather-noaa-sample"

## 1.1 Load Measurements

Measurements are stored in multiple directories (one per year). But we will limit ourselves to a single year in the analysis to improve readability of execution plans.

In [5]:
from functools import reduce

# Read in all years, store them in an Python array
raw_weather_per_year = [spark.read.text(storageLocation + "/" + str(i)).withColumn("year", f.lit(i)) for i in range(2003,2006)]

# Union all years together
raw_weather = reduce(lambda l,r: l.union(r), raw_weather_per_year)                        

Use a single year to keep execution plans small

In [6]:
raw_weather = spark.read.text(storageLocation + "/2003").withColumn("year", f.lit(2003))

### Extract Measurements

Measurements were stored in a proprietary text based format, with some values at fixed positions. We need to extract these values with a simple `SELECT` statement.

In [7]:
weather = raw_weather.select(
    f.col("year"),
    f.substring(f.col("value"),5,6).alias("usaf"),
    f.substring(f.col("value"),11,5).alias("wban"),
    f.substring(f.col("value"),16,8).alias("date"),
    f.substring(f.col("value"),24,4).alias("time"),
    f.substring(f.col("value"),42,5).alias("report_type"),
    f.substring(f.col("value"),61,3).alias("wind_direction"),
    f.substring(f.col("value"),64,1).alias("wind_direction_qual"),
    f.substring(f.col("value"),65,1).alias("wind_observation"),
    (f.substring(f.col("value"),66,4).cast("float") / f.lit(10.0)).alias("wind_speed"),
    f.substring(f.col("value"),70,1).alias("wind_speed_qual"),
    (f.substring(f.col("value"),88,5).cast("float") / f.lit(10.0)).alias("air_temperature"),
    f.substring(f.col("value"),93,1).alias("air_temperature_qual")
)

## 1.2 Load Station Metadata

We also need to load the weather station meta data containing information about the geo location, country etc of individual weather stations.

In [8]:
stations = spark.read \
    .option("header", True) \
    .csv(storageLocation + "/isd-history")

                                                                                

# 2 Partitions

Since partitions is a concept at the RDD level and a DataFrame per se does not contain an RDD, we need to access the RDD in order to inspect the number of partitions.

In [9]:
weather.rdd.getNumPartitions()

5

## 2.1 Repartitioning Data

You can repartition any DataFrame by specifying the target number of partitions and the partitioning columns. While it should be clear what *number of partitions* actually means, the term *partitionng columns* might require some explanation.

### Partitioning Columns
Except for the case when Spark initially reads data, all DataFrames are partitioned along *partitioning columns*, which means that all records having the same values in the corresponding columns will end up in the same partition. Spark implicitly performs such repartitioning as shuffle operations for `JOIN`s and grouped aggregation (except when a DataFrame already has the correct partitioning columns and number of partitions)

### Manual Repartitioning
As already mentioned, you can explicitly repartition a DataFrame using teh `repartition()` method.

In [10]:
result = weather.repartition(10, weather["usaf"], weather["wban"])
result.rdd.getNumPartitions()

10

### Effect of Repartition

Apart from introducing an additional shuffle operation, repartitioning a dataset will effectevely control the level of parallelism

In [11]:
result = weather.repartition(20).select(f.count("*"))
result.toPandas()

                                                                                

Unnamed: 0,count(1)
0,1807253


In [12]:
result.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=66]
   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
      +- Exchange RoundRobinPartitioning(20), REPARTITION_BY_NUM, [plan_id=58]
         +- FileScan text [] Batched: false, DataFilters: [], Format: Text, Location: InMemoryFileIndex(1 paths)[s3://dimajix-training/data/weather/2003], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>




# 3 Repartition & Joins

As already mentioned, Spark implicitly performs a repartitioning aka shuffle for `JOIN` operations.

### Execution Plan

So let us inspect the execution plan of a `JOIN` operation.

In [13]:
result = weather.join(stations, ["usaf", "wban"])
result.explain()

== Physical Plan ==
*(5) Project [usaf#122, wban#123, 2003 AS year#118, date#124, time#125, report_type#126, wind_direction#127, wind_direction_qual#128, wind_observation#129, wind_speed#130, wind_speed_qual#131, air_temperature#132, air_temperature_qual#133, STATION NAME#166, CTRY#167, STATE#168, ICAO#169, LAT#170, LON#171, ELEV(M)#172, BEGIN#173, END#174]
+- *(5) SortMergeJoin [usaf#122, wban#123], [USAF#164, WBAN#165], Inner
   :- *(2) Sort [usaf#122 ASC NULLS FIRST, wban#123 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(usaf#122, wban#123, 200), ENSURE_REQUIREMENTS, [plan_id=114]
   :     +- *(1) Project [substring(value#116, 5, 6) AS usaf#122, substring(value#116, 11, 5) AS wban#123, substring(value#116, 16, 8) AS date#124, substring(value#116, 24, 4) AS time#125, substring(value#116, 42, 5) AS report_type#126, substring(value#116, 61, 3) AS wind_direction#127, substring(value#116, 64, 1) AS wind_direction_qual#128, substring(value#116, 65, 1) AS wind_observation#1

### Remarks

As we already discussed, each `JOIN` is executed with the following steps
1. Filter `NULL` values (it's an inner join)
2. Repartition DataFrame on the join columns with 200 partitions
3. Sort each partition independently
4. Perform a `SortMergeJoin`

## 3.1 Pre-partition data (first try)

Now let us try what happens when we explicitly repartition the data before the join operation.

In [14]:
weather_rep = weather.repartition(10, weather["usaf"], weather["wban"])
weather_rep.rdd.getNumPartitions()

10

In [15]:
stations_rep = stations.repartition(10, stations["usaf"], stations["wban"])
stations_rep.rdd.getNumPartitions()

10

#### Execution Plan

Let's analyze the resulting execution plan. Ideally all the preparation work before the `SortMergeJoin` happens before the `cache` operation.

In [16]:
result = weather_rep.join(stations_rep, ["usaf","wban"])
result.explain()

== Physical Plan ==
*(5) Project [usaf#122, wban#123, 2003 AS year#118, date#124, time#125, report_type#126, wind_direction#127, wind_direction_qual#128, wind_observation#129, wind_speed#130, wind_speed_qual#131, air_temperature#132, air_temperature_qual#133, STATION NAME#166, CTRY#167, STATE#168, ICAO#169, LAT#170, LON#171, ELEV(M)#172, BEGIN#173, END#174]
+- *(5) SortMergeJoin [usaf#122, wban#123], [USAF#164, WBAN#165], Inner
   :- *(2) Sort [usaf#122 ASC NULLS FIRST, wban#123 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(usaf#122, wban#123, 200), REPARTITION_BY_NUM, [plan_id=315]
   :     +- *(1) Project [substring(value#116, 5, 6) AS usaf#122, substring(value#116, 11, 5) AS wban#123, substring(value#116, 16, 8) AS date#124, substring(value#116, 24, 4) AS time#125, substring(value#116, 42, 5) AS report_type#126, substring(value#116, 61, 3) AS wind_direction#127, substring(value#116, 64, 1) AS wind_direction_qual#128, substring(value#116, 65, 1) AS wind_observation#12

### Observations

Spark removed our explicit repartition, since it doesn't help and replaced it with the implicit repartition with 200 partitions

## 3.2 Pre-partition and Cache (second try)

Now let us try if we can cache the shuffle (repartition) and sort operation. This is useful in cases, where you have to perform multiple joins on the same set of columns, for example with different DataFrames.

So let's simply repartition the `weather` DataFrame on the two columns `usaf` and `wban`.

In [17]:
weather_rep = weather.repartition(20, weather["usaf"], weather["wban"])
weather_rep.cache()

DataFrame[year: int, usaf: string, wban: string, date: string, time: string, report_type: string, wind_direction: string, wind_direction_qual: string, wind_observation: string, wind_speed: double, wind_speed_qual: string, air_temperature: double, air_temperature_qual: string]

#### Execution Plan

Let's analyze the resulting execution plan. Ideally all the preparation work before the `SortMergeJoin` happens before the `cache` operation.

In [18]:
result = weather_rep.join(stations, ["usaf","wban"])
result.explain()

== Physical Plan ==
*(4) Project [usaf#122, wban#123, year#118, date#124, time#125, report_type#126, wind_direction#127, wind_direction_qual#128, wind_observation#129, wind_speed#130, wind_speed_qual#131, air_temperature#132, air_temperature_qual#133, STATION NAME#166, CTRY#167, STATE#168, ICAO#169, LAT#170, LON#171, ELEV(M)#172, BEGIN#173, END#174]
+- *(4) SortMergeJoin [usaf#122, wban#123], [USAF#164, WBAN#165], Inner
   :- *(1) Sort [usaf#122 ASC NULLS FIRST, wban#123 ASC NULLS FIRST], false, 0
   :  +- *(1) Filter (isnotnull(usaf#122) AND isnotnull(wban#123))
   :     +- InMemoryTableScan [year#118, usaf#122, wban#123, date#124, time#125, report_type#126, wind_direction#127, wind_direction_qual#128, wind_observation#129, wind_speed#130, wind_speed_qual#131, air_temperature#132, air_temperature_qual#133], [isnotnull(usaf#122), isnotnull(wban#123)]
   :           +- InMemoryRelation [year#118, usaf#122, wban#123, date#124, time#125, report_type#126, wind_direction#127, wind_direction

### Remarks

Caching seems to fix the new partitioning, and the second DataFrame (`stations`) will be repartitioned accordingly.

## 3.3 Pre-partition and Cache (third try)

We already partially achieved our goal of caching all preparational work of the `SortMergeJoin`, but the sorting was still preformed after the caching. So let's try to insert an appropriate sort operation.

In [19]:
# Release cache to simplify execution plan
weather_rep.unpersist()

weather_rep = weather.repartition(200, weather["usaf"], weather["wban"]) \
    .orderBy(weather["usaf"], weather["wban"])
weather_rep.cache()

DataFrame[year: int, usaf: string, wban: string, date: string, time: string, report_type: string, wind_direction: string, wind_direction_qual: string, wind_observation: string, wind_speed: double, wind_speed_qual: string, air_temperature: double, air_temperature_qual: string]

#### Execution Plan

In [20]:
result = weather_rep.join(stations, ["usaf","wban"])
result.explain()

== Physical Plan ==
*(5) Project [usaf#122, wban#123, year#118, date#124, time#125, report_type#126, wind_direction#127, wind_direction_qual#128, wind_observation#129, wind_speed#130, wind_speed_qual#131, air_temperature#132, air_temperature_qual#133, STATION NAME#166, CTRY#167, STATE#168, ICAO#169, LAT#170, LON#171, ELEV(M)#172, BEGIN#173, END#174]
+- *(5) SortMergeJoin [usaf#122, wban#123], [USAF#164, WBAN#165], Inner
   :- *(2) Sort [usaf#122 ASC NULLS FIRST, wban#123 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(usaf#122, wban#123, 200), ENSURE_REQUIREMENTS, [plan_id=479]
   :     +- *(1) Filter (isnotnull(usaf#122) AND isnotnull(wban#123))
   :        +- InMemoryTableScan [year#118, usaf#122, wban#123, date#124, time#125, report_type#126, wind_direction#127, wind_direction_qual#128, wind_observation#129, wind_speed#130, wind_speed_qual#131, air_temperature#132, air_temperature_qual#133], [isnotnull(usaf#122), isnotnull(wban#123)]
   :              +- InMemoryRelati

#### Remarks

We actually created a worse situation: Now we have two sort operations! Definately not what we wanted to have.

So let's think for a moment: The `SortMergeJoin` requires that each partition is sorted, but after the repartioning occured. The `orderBy` operation we used above will create a global order over all partitions (and thereby destroy all the repartition work immediately). So we need something else, which still keeps the current partitions but only sort in each partition independently.

## 3.4 Pre-partition and Cache (final try)

Fortunately Spark provides a `sortWithinPartitions` method, which does exactly what it sounds like.

In [21]:
# Release cache to simplify execution plan
weather_rep.unpersist()

weather_rep = weather.repartition(200, weather["usaf"], weather["wban"]) \
    .sortWithinPartitions(weather["usaf"], weather["wban"])
weather_rep.cache()

DataFrame[year: int, usaf: string, wban: string, date: string, time: string, report_type: string, wind_direction: string, wind_direction_qual: string, wind_observation: string, wind_speed: double, wind_speed_qual: string, air_temperature: double, air_temperature_qual: string]

#### Execution Plan

In [22]:
result = weather_rep.join(stations, (weather["usaf"] == stations["usaf"]) & (weather["wban"] == stations["wban"]))
result.explain()

== Physical Plan ==
*(4) SortMergeJoin [usaf#122, wban#123], [usaf#164, wban#165], Inner
:- *(1) Filter (isnotnull(usaf#122) AND isnotnull(wban#123))
:  +- InMemoryTableScan [year#118, usaf#122, wban#123, date#124, time#125, report_type#126, wind_direction#127, wind_direction_qual#128, wind_observation#129, wind_speed#130, wind_speed_qual#131, air_temperature#132, air_temperature_qual#133], [isnotnull(usaf#122), isnotnull(wban#123)]
:        +- InMemoryRelation [year#118, usaf#122, wban#123, date#124, time#125, report_type#126, wind_direction#127, wind_direction_qual#128, wind_observation#129, wind_speed#130, wind_speed_qual#131, air_temperature#132, air_temperature_qual#133], StorageLevel(disk, memory, deserialized, 1 replicas)
:              +- *(2) Sort [usaf#122 ASC NULLS FIRST, wban#123 ASC NULLS FIRST], false, 0
:                 +- Exchange hashpartitioning(usaf#122, wban#123, 200), REPARTITION_BY_NUM, [plan_id=593]
:                    +- *(1) Project [2003 AS year#118, substri

### Remarks

That looks really good. The filter operation is still executed after the cache, but that cannot be cached such that Spark uses this information.

So whenever you want to prepartition data, you need to execute the following steps:
* repartition with the join columns and default number of partitions
* sortWithinPartitions with the join columns
* probably cache (otherwise there is no benefit at all)

### Inspect WebUI

We can also inspect the WebUI and see how everything is executed.

Phase 1: Build cache

In [23]:
result.count()

                                                                                

1807253

Phase 2: Use cache

In [24]:
result.count()

                                                                                

1807253

# 4 Repartition & Aggregations

Similar to `JOIN` operations, Spark also requires an appropriate partitioning in grouped aggregations. Again, we can use the same strategy and appropriateky prepartition data in cases where multiple joins and aggregations are performed using the same columns.

## 4.1 Simple Aggregation

So let's perform the usual aggregation (but this time without a previous `JOIN`) with groups defined by the station id (`usaf` and `wban`).

In [25]:
result = weather.groupBy(weather["usaf"], weather["wban"]).agg(
        f.min(f.when(weather.air_temperature_qual == f.lit(1), weather.air_temperature)).alias('min_temp'),
        f.max(f.when(weather.air_temperature_qual == f.lit(1), weather.air_temperature)).alias('max_temp'),
)
result.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[usaf#122, wban#123], functions=[min(CASE WHEN (cast(air_temperature_qual#133 as int) = 1) THEN air_temperature#132 END), max(CASE WHEN (cast(air_temperature_qual#133 as int) = 1) THEN air_temperature#132 END)])
+- Exchange hashpartitioning(usaf#122, wban#123, 200), ENSURE_REQUIREMENTS, [plan_id=830]
   +- *(1) HashAggregate(keys=[usaf#122, wban#123], functions=[partial_min(CASE WHEN (cast(air_temperature_qual#133 as int) = 1) THEN air_temperature#132 END), partial_max(CASE WHEN (cast(air_temperature_qual#133 as int) = 1) THEN air_temperature#132 END)])
      +- *(1) Project [substring(value#116, 5, 6) AS usaf#122, substring(value#116, 11, 5) AS wban#123, (cast(cast(substring(value#116, 88, 5) as float) as double) / 10.0) AS air_temperature#132, substring(value#116, 93, 1) AS air_temperature_qual#133]
         +- FileScan text [value#116] Batched: false, DataFilters: [], Format: Text, Location: InMemoryFileIndex(1 paths)[s3://dimajix-training

### Remarks
Each grouped aggregation is executed with the following steps:
1. Perform partial aggregation (`HashAggregate`)
2. Shuffle intermediate result (`Exchange hashpartitioning`)
3. Perform final aggregation (`HashAggregate`)

## 4.2 Aggregation after repartition

Now let us perform the same aggregation, but this time let's use the preaggregated weather data set `weather_rep` instead.

In [26]:
weather_rep = weather.repartition(87, weather["usaf"], weather["wban"])
weather_rep.unpersist()

DataFrame[year: int, usaf: string, wban: string, date: string, time: string, report_type: string, wind_direction: string, wind_direction_qual: string, wind_observation: string, wind_speed: double, wind_speed_qual: string, air_temperature: double, air_temperature_qual: string]

In [27]:
result = weather_rep.groupBy(weather["usaf"], weather["wban"]).agg(
        f.min(f.when(weather_rep.air_temperature_qual == f.lit(1), weather_rep.air_temperature)).alias('min_temp'),
        f.max(f.when(weather_rep.air_temperature_qual == f.lit(1), weather_rep.air_temperature)).alias('max_temp'),
)
result.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[usaf#122, wban#123], functions=[min(CASE WHEN (cast(air_temperature_qual#133 as int) = 1) THEN air_temperature#132 END), max(CASE WHEN (cast(air_temperature_qual#133 as int) = 1) THEN air_temperature#132 END)])
+- *(2) HashAggregate(keys=[usaf#122, wban#123], functions=[partial_min(CASE WHEN (cast(air_temperature_qual#133 as int) = 1) THEN air_temperature#132 END), partial_max(CASE WHEN (cast(air_temperature_qual#133 as int) = 1) THEN air_temperature#132 END)])
   +- Exchange hashpartitioning(usaf#122, wban#123, 87), REPARTITION_BY_NUM, [plan_id=862]
      +- *(1) Project [substring(value#116, 5, 6) AS usaf#122, substring(value#116, 11, 5) AS wban#123, (cast(cast(substring(value#116, 88, 5) as float) as double) / 10.0) AS air_temperature#132, substring(value#116, 93, 1) AS air_temperature_qual#133]
         +- FileScan text [value#116] Batched: false, DataFilters: [], Format: Text, Location: InMemoryFileIndex(1 paths)[s3://dimajix-training/d

### Remarks
Spark obviously detects the correct partitioning of the `weather_rep` DataFrame. The sorting actually is not required, but does not hurt either (except performance...). Therefore only two steps are executed after the cache operation:
1. Partial aggregation (`HashAggregate`)
2. Final aggregation (`HashAggregate`)

But note that although you saved a shuffle operation of partial aggregates, in most cases it is not adviseable to prepartition data only for aggregations for the following reasons:
* You could perform all aggregations in a single `groupBy` and `agg` chain
* In most cases the preaggregated data is significantly smaller than the original data, therefore the shuffle doesn't hurt that much

# 5 Interaction between Join, Aggregate & Repartition

Now we have seen two operations which require a shuffle of the data. Of course Spark is clever enough to avoid an additional shuffle operation in chains of `JOIN` and grouped aggregations, which use the same aggregation columns.

## 5.1 Aggregation after Join on same key

So let's see what happens with a grouped aggregation after a join operation.

In [29]:
joined = weather.join(stations, (weather["usaf"] == stations["usaf"]) & (weather["wban"] == stations["wban"]))
result = joined.groupBy(weather["usaf"], weather["wban"]).agg(
        f.min(f.when(joined.air_temperature_qual == f.lit(1), joined.air_temperature)).alias('min_temp'),
        f.max(f.when(joined.air_temperature_qual == f.lit(1), joined.air_temperature)).alias('max_temp'),
)
result.explain()

== Physical Plan ==
*(5) HashAggregate(keys=[usaf#100, wban#101], functions=[min(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END), max(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END)])
+- *(5) HashAggregate(keys=[usaf#100, wban#101], functions=[partial_min(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END), partial_max(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END)])
   +- *(5) Project [usaf#100, wban#101, air_temperature#110, air_temperature_qual#111]
      +- *(5) SortMergeJoin [usaf#100, wban#101], [usaf#142, wban#143], Inner
         :- *(2) Sort [usaf#100 ASC NULLS FIRST, wban#101 ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(usaf#100, wban#101, 200), ENSURE_REQUIREMENTS, [plan_id=790]
         :     +- *(1) Project [substring(value#94, 5, 6) AS usaf#100, substring(value#94, 11, 5) AS wban#101, (cast(cast(substring(value

### Remarks

As you can see, Spark performs a single shuffle operation. The order of operation is as follows:
1. Filter `NULL` values (it's an inner join)
2. Shuffle data on `usaf` and `wban`
3. Sort partitions by `usaf` and `wban`
4. Perform `SortMergeJoin`
5. Perform partial aggregation `HashAggregate`
6. Perform final aggregation `HashAggregate`

## 5.2 Aggregation after Join using repartitioned data

Of course we can also use the pre-repartitioned weather DataFrame. This will work as expected, Spark does not add any additional shuffle operation.

In [30]:
weather_rep = weather.repartition(84, weather["usaf"], weather["wban"])

joined = weather_rep.join(stations, ["usaf","wban"])
result = joined.groupBy(weather["usaf"], weather["wban"]).agg(
        f.min(f.when(joined.air_temperature_qual == f.lit(1), joined.air_temperature)).alias('min_temp'),
        f.max(f.when(joined.air_temperature_qual == f.lit(1), joined.air_temperature)).alias('max_temp'),
)
result.explain()

== Physical Plan ==
*(5) HashAggregate(keys=[usaf#100, wban#101], functions=[min(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END), max(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END)])
+- *(5) HashAggregate(keys=[usaf#100, wban#101], functions=[partial_min(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END), partial_max(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END)])
   +- *(5) Project [usaf#100, wban#101, air_temperature#110, air_temperature_qual#111]
      +- *(5) SortMergeJoin [usaf#100, wban#101], [USAF#142, WBAN#143], Inner
         :- *(2) Sort [usaf#100 ASC NULLS FIRST, wban#101 ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(usaf#100, wban#101, 200), REPARTITION_BY_NUM, [plan_id=875]
         :     +- *(1) Project [substring(value#94, 5, 6) AS usaf#100, substring(value#94, 11, 5) AS wban#101, (cast(cast(substring(value#

Note that the explicit repartition has been removed by Spark - therefore it doesn't make any sense to `repartition` before a join operation.

## 5.3 Aggregation after Join with different key

So far we only looked at join and grouping operations using the same keys. If we use different keys (for example the country) in both operations, we expect Spark to add an additional shuffle operations. Let's see...

In [31]:
joined = weather.join(stations, ["usaf","wban"])
result = joined.groupBy(stations["ctry"]).agg(
        f.min(f.when(joined.air_temperature_qual == f.lit(1), joined.air_temperature)).alias('min_temp'),
        f.max(f.when(joined.air_temperature_qual == f.lit(1), joined.air_temperature)).alias('max_temp'),
)
result.explain()

== Physical Plan ==
*(6) HashAggregate(keys=[ctry#145], functions=[min(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END), max(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END)])
+- Exchange hashpartitioning(ctry#145, 200), ENSURE_REQUIREMENTS, [plan_id=975]
   +- *(5) HashAggregate(keys=[ctry#145], functions=[partial_min(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END), partial_max(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END)])
      +- *(5) Project [air_temperature#110, air_temperature_qual#111, CTRY#145]
         +- *(5) SortMergeJoin [usaf#100, wban#101], [USAF#142, WBAN#143], Inner
            :- *(2) Sort [usaf#100 ASC NULLS FIRST, wban#101 ASC NULLS FIRST], false, 0
            :  +- Exchange hashpartitioning(usaf#100, wban#101, 200), ENSURE_REQUIREMENTS, [plan_id=958]
            :     +- *(1) Project [substring(value#94, 5, 6) AS usaf#100

## 5.4 Aggregation after Broadcast-Join 

If we use a broadcast join instead of a sort merge join, the we will have a shuffle operation for the aggregation again (since the broadcast join just avoids the shuffle). Let's verify that theory...

In [32]:
joined = weather.join(f.broadcast(stations), ["usaf","wban"])
result = joined.groupBy(weather["usaf"], weather["wban"]).agg(
        f.min(f.when(joined.air_temperature_qual == f.lit(1), joined.air_temperature)).alias('min_temp'),
        f.max(f.when(joined.air_temperature_qual == f.lit(1), joined.air_temperature)).alias('max_temp'),
)
result.explain()

== Physical Plan ==
*(3) HashAggregate(keys=[usaf#100, wban#101], functions=[min(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END), max(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END)])
+- Exchange hashpartitioning(usaf#100, wban#101, 200), ENSURE_REQUIREMENTS, [plan_id=1067]
   +- *(2) HashAggregate(keys=[usaf#100, wban#101], functions=[partial_min(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END), partial_max(CASE WHEN (cast(air_temperature_qual#111 as int) = 1) THEN air_temperature#110 END)])
      +- *(2) Project [usaf#100, wban#101, air_temperature#110, air_temperature_qual#111]
         +- *(2) BroadcastHashJoin [usaf#100, wban#101], [USAF#142, WBAN#143], Inner, BuildRight, false
            :- *(2) Project [substring(value#94, 5, 6) AS usaf#100, substring(value#94, 11, 5) AS wban#101, (cast(cast(substring(value#94, 88, 5) as float) as double) / 10.0) AS air_temperature#110, su

# 6 Coalesce

There is another use case for changing the number of partitions: Writing results to HDFS/S3/whatever. Per design Spark writes each partition into a separate file, and there is no way around that. But when partitions do not contain many records, this may not only be ugly, but also unperformant and might cause additional trouble. Specifically currently HDFS is not designed to handle many small files, but prefers fewer large files instead.

Therefore it is often desireable to reduce the number of partitions of a DataFrame just before writing the result to disk. You could perform this task by a `repartition` operation, but this is an expensive operation requiring an additional shuffle operation. Therefore Spark provides an additional method called `coalesce` which can be used to reduce the number of partitions without incurring an additional shuffle. Spark simply logically concatenates multiple partitions into new partitions.

### Inspect Number of Partitions

For this example, we will use the `weather_rep` DataFrame, which contains exactly 200 partitions.

In [28]:
weather_rep = weather.repartition(200, weather["usaf"], weather["wban"])
weather_rep.cache()

DataFrame[year: int, usaf: string, wban: string, date: string, time: string, report_type: string, wind_direction: string, wind_direction_qual: string, wind_observation: string, wind_speed: double, wind_speed_qual: string, air_temperature: double, air_temperature_qual: string]

In [29]:
weather_rep.rdd.getNumPartitions()

200

## 6.1 Reducing partitions before writing

In order to reduce the number of partitions, we simply use the `coalesce` method. This is often used to reduce the number of files before writing. But we will see that this might not be the best option and using a `repartiton` might be faster in some cases.

### Write without `coalesce`

In [30]:
joined = weather_rep.join(stations, ["usaf", "wban"])
joined.write.mode("overwrite").parquet("/tmp/weather_200")

                                                                                

In [31]:
!hdfs dfs -ls /tmp/weather_200

Found 91 items
-rw-r--r--   1 hadoop hdfsadmingroup          0 2023-12-05 13:53 /tmp/weather_200/_SUCCESS
-rw-r--r--   1 hadoop hdfsadmingroup       2187 2023-12-05 13:53 /tmp/weather_200/part-00000-561efd9a-67f4-4ea2-b070-fba80ab86166-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup      27660 2023-12-05 13:53 /tmp/weather_200/part-00003-561efd9a-67f4-4ea2-b070-fba80ab86166-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup      70444 2023-12-05 13:53 /tmp/weather_200/part-00005-561efd9a-67f4-4ea2-b070-fba80ab86166-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup      36221 2023-12-05 13:53 /tmp/weather_200/part-00006-561efd9a-67f4-4ea2-b070-fba80ab86166-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup     136849 2023-12-05 13:53 /tmp/weather_200/part-00011-561efd9a-67f4-4ea2-b070-fba80ab86166-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup      95227 2023-12-05 13:53 /tmp/weather_200/part-00013-561efd9a-67f4-4ea2-b070-fba80ab86166-c000.sn

### Write with `coalesce`

First we try to reduce the number of partitions by using `coalesce`.

In [36]:
result = joined.coalesce(8)
result.write.mode("overwrite").parquet("/tmp/weather_coalesce")

                                                                                

Now let's inspect the result. Check both the files in HDFS and the Spark web UI for performance metrics. Note the uneven file size and that the (expensive) join operation is executed using only 8 tasks.

In [38]:
!hdfs dfs -ls /tmp/weather_coalesce

Found 9 items
-rw-r--r--   1 hadoop hdfsadmingroup          0 2023-12-05 13:57 /tmp/weather_coalesce/_SUCCESS
-rw-r--r--   1 hadoop hdfsadmingroup     581677 2023-12-05 13:57 /tmp/weather_coalesce/part-00000-8331a7ca-8e0f-49b2-9b25-3f15bbb3dddb-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup    1370508 2023-12-05 13:57 /tmp/weather_coalesce/part-00001-8331a7ca-8e0f-49b2-9b25-3f15bbb3dddb-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup     622394 2023-12-05 13:57 /tmp/weather_coalesce/part-00002-8331a7ca-8e0f-49b2-9b25-3f15bbb3dddb-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup     338500 2023-12-05 13:57 /tmp/weather_coalesce/part-00003-8331a7ca-8e0f-49b2-9b25-3f15bbb3dddb-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup     954745 2023-12-05 13:57 /tmp/weather_coalesce/part-00004-8331a7ca-8e0f-49b2-9b25-3f15bbb3dddb-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup     738033 2023-12-05 13:57 /tmp/weather_coalesce/part-00005-8331a7ca-8

### Write with `repartition`

In [37]:
result = joined.repartition(8)
result.write.mode("overwrite").parquet("/tmp/weather_repartition")

                                                                                

Now let's inspect the result. Check both the files in HDFS and the Spark web UI for performance metrics. Note that the files are much more balanced in size and also note that the join operation is now executed using a 200 tasks.

In [39]:
!hdfs dfs -ls /tmp/weather_repartition

Found 9 items
-rw-r--r--   1 hadoop hdfsadmingroup          0 2023-12-05 13:57 /tmp/weather_repartition/_SUCCESS
-rw-r--r--   1 hadoop hdfsadmingroup    1856429 2023-12-05 13:57 /tmp/weather_repartition/part-00000-dabeabe3-b108-4413-9c51-c4142528e112-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup    1871473 2023-12-05 13:57 /tmp/weather_repartition/part-00001-dabeabe3-b108-4413-9c51-c4142528e112-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup    1831624 2023-12-05 13:57 /tmp/weather_repartition/part-00002-dabeabe3-b108-4413-9c51-c4142528e112-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup    1874020 2023-12-05 13:57 /tmp/weather_repartition/part-00003-dabeabe3-b108-4413-9c51-c4142528e112-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup    1859269 2023-12-05 13:57 /tmp/weather_repartition/part-00004-dabeabe3-b108-4413-9c51-c4142528e112-c000.snappy.parquet
-rw-r--r--   1 hadoop hdfsadmingroup    1877768 2023-12-05 13:57 /tmp/weather_repartition/

## 6.2 `coalesce` vs `repartition`

At first sight, `coalesce` seems to be the more performant option, because it is essentially only a management operation which merges multiple tasks into one. But that also implies that the work of these tasks won't be executed with a parallelism higher than the number of tasks. And if the last operation is expensive (i.e. a join), then overall performance might suffer. In these cases, a repartition might be the better option, altough it will introduce an additional shuffle.

## 6.3 Sort before Writing

When writing Parquet or ORC files (even indirectly when writing to Hive), you can reduce the file size by sorting the data by appropriate keys. This will help compression algorithms to achieve better file sizes.

In [None]:
result = joined.repartition(8).sortWithinPartitions("wban", "usaf", "date", "time")
result.write.mode("overwrite").parquet("/tmp/weather_sorted")

In [None]:
!hdfs dfs -ls /tmp/weather_sorted