# Append table with deletion and vacuum

In [None]:
from utils.spark import get_spark

from pyspark.sql import functions as F
from delta import DeltaTable

from datetime import datetime
from random import randint
from time import sleep

In [None]:
spark = get_spark()

## Create some mock data

In this tutorial we will assume that we get data in form of snapshots that provide us with the state of some operations.
The data might be retrieved actively by scraping it from a server but we could also imagine a streaming solution.

In this usecase, we are not only interested in the most recent state but also the history of the states. This means unlike the previous tutorial we would not want to overwrite the values.

In [None]:
def create_data():
    return spark.createDataFrame([
            {"operation": 1, "state": randint(0, 5)},
            {"operation": 2, "state": randint(0, 5)},
            {"operation": 3, "state": randint(0, 5)},
            {"operation": 4, "state": randint(0, 5)},
        ]).withColumn("timestamp", F.lit(datetime.utcnow().isoformat()))

In [None]:
df = create_data()
df.show()

## Write into Delta Table

We will append this data into a delta table. Delta will enforce a schema, meaning if the schema differs between appends, an error will be raised.
Here we also partition by operation for demonstration purposes. This might make sense especially if we often filter for operations.

In [None]:
!rm -rf /data/operation-history

In [None]:
df.write.format("delta").mode("append").partitionBy("operation").save("/data/operation-history")

In [None]:
!ls /data/operation-history

In [None]:
deltaTable = DeltaTable.forPath(spark, "/data/operation-history")

In [None]:
deltaTable.history().toPandas()

Now we will append a bit more data with a loop. Since it is a delta table we can inspect the history as usual:

In [None]:
for i in range(4):
    df = create_data()
    df.write.format("delta").mode("append").partitionBy("operation").save("/data/operation-history")
    sleep(3)

In [None]:
deltaTable.history().orderBy("version").toPandas()

## Visualize the state history

Let's now read this table and make a line plot in pandas to visualize the state history

In [None]:
df = spark.read.format("delta").load("/data/operation-history")

In [None]:
df.orderBy("timestamp", "operation").toPandas().head()

In [None]:
df.toPandas().pivot(index='timestamp', columns='operation', values='state').plot()

## Delete some data

Delta tables allow deletion of data. We could for example imagine that there is a GDPR-request to delete operation 1.
This seems to work, we do not see the operation in the plot neither.

In [None]:
deltaTable.delete('operation=1')

In [None]:
df = spark.read.format("delta").load("/data/operation-history")

In [None]:
df.toPandas().pivot(index='timestamp', columns='operation', values='state').plot()

In [None]:
deltaTable.history().orderBy("version").toPandas()

## Vacuum data

However when we look into the data folder, we see that the data is still there. 
Using time travel, we can also reproduce the previous state and see the plot with operation 1.
This is useful in case somebody made a mistake!

In [None]:
!ls /data/operation-history/operation=1

In [None]:
df = spark.read.format("delta").option("versionAsOf", 4).load("/data/operation-history")

In [None]:
df.toPandas().pivot(index='timestamp', columns='operation', values='state').plot()

But in case of GDPR-requests, this is still a problem, the data needs to be deleted.
There is a vacuum command that cleans the history.
By default, the history of >178 hours is deleted.
Here, we want to delete all history. For this we have to tweak the spark configs.

In [None]:
deltaTable.vacuum(0)

In [None]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

In [None]:
deltaTable.vacuum(0)

Now vacuuming works and as we see below the data is really deleted.

In [None]:
!ls /data/operation-history/operation=1