<div align="center">
  <img src="https://officenationalstatistics.sharepoint.com/:i:/r/sites/itoDSTPMO/DAPCATS/04.%20Technical/02.%20Development_Test/images_for_gitlab/dap-cats-ds-logo.png"/>
</div>

## Tip of the Week: Use checkpoint to break DataFrame lineage

### What is lineage?

As we know, Spark uses lazy evaluation. As we build up many transformations, e.g. `.withColumn()`, Spark creates an execution plan for the DataFrame and the plan is executed when an action is called, e.g. `show()`. This execution plan represents the DataFrame's lineage.

Sometimes the DataFrame's lineage can grow long and complex, which will slow down the processing and maybe even throw an (unhelpful) error, however we can get around this by breaking the lineage.

### What is `.checkpoint()`?

There is more than one way of breaking the lineage, we've covered some ideas already in tip of the week, e.g. [using staging tables](http://np2rvlapxx507/DAP_CATS/troubleshooting/tip-of-the-week/blob/master/tip_15_staging_tables.ipynb). Here's one called checkpoint, which is essentially a short cut to writing the DataFrame out to disk and reading it back in. 

### How we will test it

To demonstrate the benefit of using `.checkpoint()` is we'll time how long it takes to create a DataFrame without `.checkpoint()` then again with using `.checkpoint()`. 

We'll create a new Spark session each time just in case there's an advantage when processing the DataFrame a second time in the same session. For this reason we will put the Spark session into a function. We will also use the Python module `time` to measure the time taken to create the DataFrame. 

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from time import time

def spark_session():
    return (
        SparkSession.builder.appName("checkpoint-tip")
        .config("spark.executor.memory", "1g")
        .config("spark.executor.cores", 1)
        .config("spark.dynamicAllocation.enabled", "true")
        .config("spark.dynamicAllocation.maxExecutors", 3)
        .config("spark.sql.shuffle.partitions", 12)
        .config("spark.shuffle.service.enabled", "true")
        .config("spark.ui.showConsoleProgress", "false")
        .enableHiveSupport()
        .getOrCreate()
            )

In [2]:
spark = spark_session()

We're going to create a new DataFrame with an `id` column and a column called `col_0` that will consist of random numbers. We'll then create a loop to add new columns where the values depend on a previous column. In general, we try to avoid using loops with Spark and this example shows why. But sometimes using loops to create new columns is the only way of achieving what we need.

We will set a `seed_num` when creating the random numbers to make the results repeatable. The DataFrame will have `num_rows` amount of rows, which we will set to a thousand and the loop will iterate 11 times to create `col_1` to `col_11`.

The contents of the columns isn't important here. What is important is that Spark is creating an execution plan that it getting longer with each iteration of the loop.

In [3]:
new_cols = 12
seed_num = 42
num_rows = 10**3

### Without `.checkpoint()`

In [4]:
start_time = time()

df = spark.range(num_rows)
df = df.withColumn("col_0", F.ceil(F.rand(seed_num) * new_cols))

for i in range(1, new_cols):
    df = df.withColumn("col_"+str(i), F.when(F.col("col_"+str(i-1)) > i, F.col("col_"+str(i-1))).otherwise(0))

df.show()

time_taken = time() - start_time
print(f"Time taken to create the DataFrame:  {time_taken}")

+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+
| id|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|col_11|
+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+
|  0|    8|    8|    8|    8|    8|    8|    8|    8|    0|    0|     0|     0|
|  1|   11|   11|   11|   11|   11|   11|   11|   11|   11|   11|    11|     0|
|  2|   11|   11|   11|   11|   11|   11|   11|   11|   11|   11|    11|     0|
|  3|   11|   11|   11|   11|   11|   11|   11|   11|   11|   11|    11|     0|
|  4|    6|    6|    6|    6|    6|    6|    0|    0|    0|    0|     0|     0|
|  5|    7|    7|    7|    7|    7|    7|    7|    0|    0|    0|     0|     0|
|  6|    1|    0|    0|    0|    0|    0|    0|    0|    0|    0|     0|     0|
|  7|    2|    2|    0|    0|    0|    0|    0|    0|    0|    0|     0|     0|
|  8|    4|    4|    4|    4|    0|    0|    0|    0|    0|    0|     0|     0|
|  9|    9|    9|    9|    9|    9|    9

The result above shows how long Spark took to create the plan and execute it to show the top 20 rows.

In [5]:
spark.stop()

spark = spark_session()

### With `.checkpoint()`

Now let's do the same thing again, but this time we will checkpoint the DataFrame every 3 interations of the loop so that the lineage doesn't grow as long.

To perform a checkpoint we need to set up a checkpoint directory on HDFS, which is where the checkpointed DataFrames will be stored. It's important to practice good housekeeping with this directory because new files are created with every checkpoint, but they are **not** automatically deleted.

In [6]:
import os
username = os.getenv("HADOOP_USER_NAME")
checkpoint_path = f"/user/{username}/checkpoints"
spark.sparkContext.setCheckpointDir(checkpoint_path)

In [7]:
start_time = time()

df = spark.range(num_rows)
df = df.withColumn("col_0", F.ceil(F.rand(seed_num) * new_cols))

for i in range(1, new_cols):
    df = df.withColumn("col_"+str(i), F.when(F.col("col_"+str(i-1)) > i, F.col("col_"+str(i-1))).otherwise(0))
    if i % 3 == 0: # this means if i is divisable by three then...
        df = df.checkpoint() # here is the checkpoint
        
df.show()

time_taken = time() - start_time
print(f"Time taken to create the DataFrame:  {time_taken}")

+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+
| id|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|col_11|
+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+
|  0|    8|    8|    8|    8|    8|    8|    8|    8|    0|    0|     0|     0|
|  1|   11|   11|   11|   11|   11|   11|   11|   11|   11|   11|    11|     0|
|  2|   11|   11|   11|   11|   11|   11|   11|   11|   11|   11|    11|     0|
|  3|   11|   11|   11|   11|   11|   11|   11|   11|   11|   11|    11|     0|
|  4|    6|    6|    6|    6|    6|    6|    0|    0|    0|    0|     0|     0|
|  5|    7|    7|    7|    7|    7|    7|    7|    0|    0|    0|     0|     0|
|  6|    1|    0|    0|    0|    0|    0|    0|    0|    0|    0|     0|     0|
|  7|    2|    2|    0|    0|    0|    0|    0|    0|    0|    0|     0|     0|
|  8|    4|    4|    4|    4|    0|    0|    0|    0|    0|    0|     0|     0|
|  9|    9|    9|    9|    9|    9|    9

The exact times will vary with each run of this notebook, but hopefully you will see that using the `.checkpoint()` was more efficient.

As mentioned earlier, the checkpoint files are not deleted on HDFS automatically. The files are not intended to be used after you stop the Spark session, so make sure you delete these files after a session. There have been cases in ONS where checkpoint directories have grown to be terabytes in size, which can create problems for other DAP users.

The easiest way to delete files is in HUE, but the cell below is handy to have at the end of your scripts to make sure you don't forget to empty the checkpoint folder.

In [8]:
import subprocess
cmd = f'hdfs dfs -rm -r -skipTrash {checkpoint_path}' 
p = subprocess.run(cmd, shell=True)

where `checkpoint_path` is the path defined in `.setCheckpointDir()`

### Is 3 the magic number?

How did we come up with the number 3 for number of iterations to checkpoint? Trial and error. Unfortunately, you may not have the luxury of trying to find the optimum number, but have a go at checkpointing and see if you can get any improvements in performance.

### Why not use `.cache()`?

There are various pros and cons involved between the different ways of persisting and `.cache()` and `.checkpoint()` work in different ways. For more information on the different ways of persisting data in Spark, please see our [persist](http://np2rvlapxx507/DAP_CATS/Training/more-spark/blob/master/more_spark/notebooks/material/persist.ipynb) notebook in the [More Spark](http://np2rvlapxx507/DAP_CATS/Training/more-spark) courses.

### Documentation

[`.checkpoint()`](https://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html?highlight=checkpoint#pyspark.sql.DataFrame.checkpoint)