# Checkpoint and Staging tables

## Persisting to disk

Spark uses lazy evaluation. As we build up many transformations, e.g. `.withColumn()`, Spark creates an execution plan for the DataFrame and the plan is executed when an action is called, e.g. `show()`. This execution plan represents the DataFrame's lineage.

Sometimes the DataFrame's lineage can grow long and complex, which will slow down the processing and maybe even return an error. However, we can get around this by breaking the lineage.

There is more than one way of breaking the lineage, this is discussed in more detail in the **Persistance** article. In this article we cover two methods of persisting to disk. We will first look at `checkpoint`, which is using a shortcut to a write/read operation called `checkpoint`, the other is a more manual process to do essentially the same thing by writing a Hive table.

## Experiment

To demonstrate the benefit of persisting to disk we'll time how long it takes to create a DataFrame using an iterative calculation. We will run the process without persisting, then again using `.checkpoint()` , then a third time using staging tables. 

We'll create a new Spark session each time just in case there's an advantage when processing the DataFrame a second time in the same session. For this reason we will put the Spark session into a function. We will also use the Python module `time` to measure the time taken to create the DataFrame. 

### No persistance

In [8]:
import os
from pyspark.sql import SparkSession, functions as F
from time import time
import yaml

spark = (SparkSession.builder.master("local[2]")
         .appName("checkpoint")
         .getOrCreate())

We're going to create a new DataFrame with an `id` column and a column called `col_0` that will consist of random numbers. We'll then create a loop to add new columns where the values depend on a previous column. In general, we try to avoid using loops with Spark and this example shows why. But sometimes using loops to create new columns is the only way of achieving what we need.

We will set a `seed_num` when creating the random numbers to make the results repeatable. The DataFrame will have `num_rows` amount of rows, which we will set to a thousand and the loop will iterate 11 times to create `col_1` to `col_11`.

The contents of the columns isn't important here. What is important is that Spark is creating an execution plan that it getting longer with each iteration of the loop.

In [9]:
new_cols = 12
seed_num = 42
num_rows = 10**3

with open("../../../config.yaml") as f:
    config = yaml.safe_load(f)

checkpoint_path = config["checkpoint_path"] + "/temp.parquet"

def write_delete(sdf, checkpoint_path):
    sdf.write.mode("overwrite").parquet(checkpoint_path)
    cmd = f'rm -r -skipTrash {path}'
    p = subprocess.run(cmd, shell=True)

### Without `.checkpoint()`

In [18]:
start_time = time()

df = spark.range(num_rows)
df = df.withColumn("col_0", F.ceil(F.rand(seed_num) * new_cols))

for i in range(1, new_cols):
    df = df.withColumn("col_"+str(i), F.when(F.col("col_"+str(i-1)) > i, F.col("col_"+str(i-1))).otherwise(0))

df.show()

time_taken = time() - start_time
print(f"Time taken to create the DataFrame:  {time_taken}")

+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+
| id|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|col_11|
+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+
|  0|    8|    8|    8|    8|    8|    8|    8|    8|    0|    0|     0|     0|
|  1|   11|   11|   11|   11|   11|   11|   11|   11|   11|   11|    11|     0|
|  2|   11|   11|   11|   11|   11|   11|   11|   11|   11|   11|    11|     0|
|  3|   11|   11|   11|   11|   11|   11|   11|   11|   11|   11|    11|     0|
|  4|    6|    6|    6|    6|    6|    6|    0|    0|    0|    0|     0|     0|
|  5|    7|    7|    7|    7|    7|    7|    7|    0|    0|    0|     0|     0|
|  6|    1|    0|    0|    0|    0|    0|    0|    0|    0|    0|     0|     0|
|  7|    2|    2|    0|    0|    0|    0|    0|    0|    0|    0|     0|     0|
|  8|    4|    4|    4|    4|    0|    0|    0|    0|    0|    0|     0|     0|
|  9|    9|    9|    9|    9|    9|    9

The result above shows how long Spark took to create the plan and execute it to show the top 20 rows.

In [19]:
spark.stop()

spark = (SparkSession.builder.master("local[2]")
         .appName("checkpoint")
         .getOrCreate())

### With `.checkpoint()`

Now let's do the same thing again, but this time we will checkpoint the DataFrame every 3 interations of the loop so that the lineage doesn't grow as long.

To perform a checkpoint we need to set up a checkpoint directory on HDFS, which is where the checkpointed DataFrames will be stored. It's important to practice good housekeeping with this directory because new files are created with every checkpoint, but they are **not** automatically deleted.

In [20]:
import os
username = os.getenv("HADOOP_USER_NAME")
checkpoint_path = f"/user/{username}/checkpoints"
spark.sparkContext.setCheckpointDir(checkpoint_path)

In [21]:
start_time = time()

df = spark.range(num_rows)
df = df.withColumn("col_0", F.ceil(F.rand(seed_num) * new_cols))

for i in range(1, new_cols):
    df = df.withColumn("col_"+str(i), F.when(F.col("col_"+str(i-1)) > i, F.col("col_"+str(i-1))).otherwise(0))
    if i % 3 == 0: # this means if i is divisable by three then...
        df = df.checkpoint() # here is the checkpoint
        
df.show()

time_taken = time() - start_time
print(f"Time taken to create the DataFrame:  {time_taken}")

+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+
| id|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|col_11|
+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+
|  0|    8|    8|    8|    8|    8|    8|    8|    8|    0|    0|     0|     0|
|  1|   11|   11|   11|   11|   11|   11|   11|   11|   11|   11|    11|     0|
|  2|   11|   11|   11|   11|   11|   11|   11|   11|   11|   11|    11|     0|
|  3|   11|   11|   11|   11|   11|   11|   11|   11|   11|   11|    11|     0|
|  4|    6|    6|    6|    6|    6|    6|    0|    0|    0|    0|     0|     0|
|  5|    7|    7|    7|    7|    7|    7|    7|    0|    0|    0|     0|     0|
|  6|    1|    0|    0|    0|    0|    0|    0|    0|    0|    0|     0|     0|
|  7|    2|    2|    0|    0|    0|    0|    0|    0|    0|    0|     0|     0|
|  8|    4|    4|    4|    4|    0|    0|    0|    0|    0|    0|     0|     0|
|  9|    9|    9|    9|    9|    9|    9

The exact times will vary with each run of this notebook, but hopefully you will see that using the `.checkpoint()` was more efficient.

As mentioned earlier, the checkpoint files are not deleted on HDFS automatically. The files are not intended to be used after you stop the Spark session, so make sure you delete these files after a session. There have been cases in ONS where checkpoint directories have grown to be terabytes in size, which can create problems for other DAP users.

The easiest way to delete files is in HUE, but the cell below is handy to have at the end of your scripts to make sure you don't forget to empty the checkpoint folder.

In [16]:
import subprocess
cmd = f'hdfs dfs -rm -r -skipTrash {checkpoint_path}' 
p = subprocess.run(cmd, shell=True)

where `checkpoint_path` is the path defined in `.setCheckpointDir()`

### Is 3 the magic number?

How did we come up with the number 3 for number of iterations to checkpoint? Trial and error. Unfortunately, you may not have the luxury of trying to find the optimum number, but have a go at checkpointing and see if you can get any improvements in performance.

### Why not use `.cache()`?

There are various pros and cons involved between the different ways of persisting and `.cache()` and `.checkpoint()` work in different ways. For more information on the different ways of persisting data in Spark, please see our [persist](http://np2rvlapxx507/DAP_CATS/Training/more-spark/blob/master/more_spark/notebooks/material/persist.ipynb) notebook in the [More Spark](http://np2rvlapxx507/DAP_CATS/Training/more-spark) courses.

### Documentation

[`.checkpoint()`](https://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html?highlight=checkpoint#pyspark.sql.DataFrame.checkpoint)

## Staging Tables

Staging tables are an alternative way of checkpointing data in Spark, in which the data is written out as a named Hive table in a database, rather than to the checkpointing location.

### A quick recap of persisting

Persisting in Spark is where we store the data at an intermediate point of the code in memory or on disk. This is generally done with `.cache()` (to store the data in memory) or `.checkpoint()` (to write it to disk). If you are not familiar with the concept of persisting, please read the [Persisting in Spark notebook](http://np2rvlapxx507/DAP_CATS/Training/more-spark/blob/master/more_spark/notebooks/material/persist.ipynb) which explains these concepts in more detail.

Importantly, if your code is short and non-complex then it is unlikely you will need any form of persisting in your code, but if you have long and complex code sensible persisting can help make it more efficient.

### Staging tables: the concept

You can write a staging table to HDFS with `df.write.mode("overwrite").saveAsTable(table_name, format="parquet")` or `df.write.insertInto(table_name, overwrite=True)`(of course, if using `.insertInto()` you will need to create the table first). You can then read the table back in with `spark.read.table()`. Like with checkpointing, this will break the lineage of the DataFrame, and therefore they can be useful in large, complex pipelines, or those that involve processes in a loop. As Spark is more efficient at reading in tables than CSV files, another use case is staging CSV files as tables at the start of your code before doing any complex calculations.

Staging has some advantages over checkpointing:
- The same table can be overwritten, meaning there is no need to clean up old checkpointed files
- It is stored in a location that is easier to access, rather than the checkpointing folder, which can help with debugging and testing changes to the code
- They can be re-used elsewhere
- If `.insertInto()` is used, you can take advantage of the table schema, as an exception will be raised if the DataFrame and table schemas do not match
- It is more efficient for Spark to read Hive tables than CSV files as the underlying format is Parquet, so if your data are delivered as CSV files you may want to stage them as Hive tables first. For more information see the [Storing as a Parquet tip](http://np2rvlapxx507/DAP_CATS/troubleshooting/tip-of-the-week/blob/master/tip_18_parquet.ipynb).

There are also some disadvantages:
- Takes longer to write the code
- More difficult to maintain, especially if `.insertInto()` is used, as you will have to alter the table if the DataFrame structure changes
- Ensure that you are not using them unnecessarily (the same is true with any method of persisting data)

The examples here use PySpark, but the same principles apply to R users who are using sparklyr in DAP.

In [24]:
table_name = "train_tmp.checkpoint"

In [26]:
start_time = time()

df = spark.range(num_rows)
df = df.withColumn("col_0", F.ceil(F.rand(seed_num) * new_cols))

for i in range(1, new_cols):
    df = df.withColumn("col_"+str(i), F.when(F.col("col_"+str(i-1)) > i, F.col("col_"+str(i-1))).otherwise(0))
    if i % 3 == 0: # this means if i is divisable by three then...
        df.withColumn("iteration", F.lit(i)).write.mode("overwrite").saveAsTable(table_name, format="parquet")
        df = spark.read.table(table_name)
        
df.show()

time_taken = time() - start_time
print(f"Time taken to create the DataFrame:  {time_taken}")

AnalysisException: 'Cannot overwrite table train_tmp.checkpoint that is also being read from;'