# Checkpointing DataFrames

Sometimes execution plans can get pretty long and Spark might run into trouble. Common scenarios are iterative algorithms like ML or graph algorithms, which contain a big outer loop and iteratively transform a DataFrame over and over again. This would result in a really huge execution plan.

In these cases you could use `cache()` or `persist()` in order to improve performance (otherwise all steps of the loop would be executed again from the very beginning leading to a runtime of O(n^2)). But this will not cut off the lineage.

Checkpointing is the right solution for these cases. It will persist the data of a DataFrame in a reliable distributed storage (most commonly HDFS) and cut off the lineage.

## Create or Reuse Spark Session

In [1]:
from pyspark.sql import SparkSession

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","24G") \
        .getOrCreate()

spark

In [None]:
spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

# 1 Load Data

We will load the weather data again for this example.

In [None]:
storageLocation = "s3://dimajix-training/data/weather"

## 1.1 Load Measurements

In [None]:
from pyspark.sql.functions import *

raw_weather = spark.read.text(storageLocation + "/2003").withColumn("year", lit(2003))

### Extract Measurements

Measurements were stored in a proprietary text based format, with some values at fixed positions. We need to extract these values with a simple `SELECT` statement.

In [None]:
weather = raw_weather.select(
    col("year"),
    substring(col("value"),5,6).alias("usaf"),
    substring(col("value"),11,5).alias("wban"),
    substring(col("value"),16,8).alias("date"),
    substring(col("value"),24,4).alias("time"),
    substring(col("value"),42,5).alias("report_type"),
    substring(col("value"),61,3).alias("wind_direction"),
    substring(col("value"),64,1).alias("wind_direction_qual"),
    substring(col("value"),65,1).alias("wind_observation"),
    (substring(col("value"),66,4).cast("float") / lit(10.0)).alias("wind_speed"),
    substring(col("value"),70,1).alias("wind_speed_qual"),
    (substring(col("value"),88,5).cast("float") / lit(10.0)).alias("air_temperature"),
    substring(col("value"),93,1).alias("air_temperature_qual")
)
    
weather.limit(10).toPandas()

## 1.2 Load Station Metadata

In [None]:
stations = spark.read \
    .option("header", True) \
    .csv(storageLocation + "/isd-history")

# Display first 10 records    
stations.limit(10).toPandas()

# 2 Join Data

Now we perform the join between the station master data and the measurements, as we did before.

In [None]:
joined_weather = weather.join(stations, (weather["usaf"] == stations["usaf"]) & (weather["wban"] == stations["wban"]))

# 3 Truncating Execution Plans

Now we want to understand the effect of checkpointing. First we will use the traditional aggregation and print the execution plan.

## 3.1 Traditional Aggregation

In [None]:
result = joined_weather.groupBy(joined_weather["ctry"], joined_weather["year"]).agg(
        min(when(joined_weather.air_temperature_qual == lit(1), joined_weather.air_temperature)).alias('min_temp'),
        max(when(joined_weather.air_temperature_qual == lit(1), joined_weather.air_temperature)).alias('max_temp')
)

result.explain(True)

## 3.2 Reliable Checkpointing

Now we first checkpoint the joined weather data set and then perform the aggregation on the checkpointed DataFrame.

### Set Checkpoint directory

First we need to specify a checkpoint directory on a reliable shared file system.

In [None]:
# YOUR CODE HERE

### Create checkpoint

Now we can create a checkpoint for the joined weather. Note that this takes some time, as checkpointing is not a lazy operation, it will be executed immediately. This is also conceptionally neccessary, because one aspect of checkpointing is that the whole lineage gets cut off. So there is no way around executing the computation for materializing the DataFrame inside the checkpoint directory

In [None]:
# YOUR CODE HERE

### Inspect Checkpoint directory

In [None]:
# YOUR CODE HERE

### Inspect execution plan

Let us have a look at the execution plan of the checkpointed DataFrame

In [None]:
# YOUR CODE HERE

As you can see, the lineage got lost and is replaced by a `Scan ExistingRDD` which refers to the data in the checkpoint directory.

### Perform aggregation

Now we can perform the aggregation with the checkpointed variant of the joined weather DataFrame.

In [None]:
result = cp_weather.groupBy(cp_weather["ctry"], cp_weather["year"]).agg(
        min(when(cp_weather.air_temperature_qual == lit(1), cp_weather.air_temperature)).alias('min_temp'),
        max(when(cp_weather.air_temperature_qual == lit(1), cp_weather.air_temperature)).alias('max_temp')
)

# YOUR CODE HERE

As expected, the execution plan now essentially only contains the aggregation in three steps (partial aggregation, shuffle, final aggregation). The lineage of the join is not present any more.

## 3.3 Unreliable Checkpointing

In addition to *reliable* checkpointing, Spark also supports *unreliable* checkpointing, where the data is not stored in HDFS but on the local worker nodes instead using the caching backend.

Note that it is stronlgly discouraged to use unreliable checkpointing with dynamic execution mode, where executors can be freed up again.

In [None]:
# YOUR CODE HERE

### Inspect Checkpoint data

Now you can see the checkpointed data in the "Storage" section of the web interface.

### Perform aggregation

Now we can perform the aggregation with the checkpointed variant of the joined weather DataFrame.

In [None]:
result = cpu_weather.groupBy(cpu_weather["ctry"], cpu_weather["year"]).agg(
        min(when(cpu_weather.air_temperature_qual == lit(1), cpu_weather.air_temperature)).alias('min_temp'),
        max(when(cpu_weather.air_temperature_qual == lit(1), cpu_weather.air_temperature)).alias('max_temp')
)

# YOUR CODE HERE

In [None]:
result.limit(5).toPandas()

## 3.4 Checkpoint cleanup

Spark can automatically remove checkpoint directories, if the configuration property `spark.cleaner.referenceTracking.cleanCheckpoints` is set to `True` (default is `False` as of Spark 2.3). Otherwise you have to manually remove checkpoint data from HDFS.