# Deleting at Partition Boundaries

While we've deleted our PII from our silver tables, we haven't dealt with the fact that this data still exists in our **`bronze`** table.

Note that because of stream composability and the design choice to use a multiplex bronze pattern, enabling Delta Change Data Feed (CDF) to propagate delete information would require redesigning each of our pipelines to take advantage of this output. Without using CDF, modification of data in a table will break downstream composability.

<img src="https://files.training.databricks.com/images/ade/ADE_arch_bronze.png" width="60%" />

In this notebook, you'll learn how to delete partitions of data from Delta Tables and how to configure incremental reads to allow for these deletes.

This functionality is not only useful for permanently deleting PII, but this same pattern can be applied in companies that just want to expunge data older than a certain age from a given table. Similarly, data could be backed up to a cheaper storage tier, and then safely deleted from "active" or "hot" Delta tables to drive savings on cloud storage.

## Learning Objectives
By the end of this notebook, students will be able to:
- Delete data using partition boundaries
- Configure downstream incremental reads to safely ignore these deletions
- Use **`VACUUM`** to review files to be deleted and commit deletes
- Union archived data with production tables to recreate a full historic dataset

Begin by running our setup script.

In [0]:
%run ../Includes/Classroom-Setup-7.3

Our Delta table is partitioned by two fields. 

Our top level partition is the **`topic`** column. 

Run the cell to note the 3 partition directories (and the Delta Log directory) that collectively comprise our **`bronze`** table.

In [0]:
files = dbutils.fs.ls(f"{DA.paths.user_db}/bronze")
display(files)

Our 2nd level partition was on our **`week_part`** column, which we derived as the year and week of year. There are around 20 directories currently present at this level.

In [0]:
files = dbutils.fs.ls(f"{DA.paths.user_db}/bronze/topic=user_info")
display(files)

Note that in our current dataset, we're tracking only a small number of total users in these files.

In [0]:
total = (spark.table("bronze")
              .filter("topic='user_info'")
              .filter("week_part<='2019-48'")
              .count())
         
print(f"Total: {total}")

## Archiving Data
If a company wishes to maintain an archive of historic records (but only maintain recent records in production tables), cloud-native settings for auto-archiving data can be configured to move data files automatically to lower-cost storage locations.

The cell below simulates this process (here using copy instead of move). 

Note that because only the data files and partition directories are being relocated, the resultant table will be Parquet by default.

**NOTE**: For best performance, directories should have **`OPTIMIZE`** run to condense small files. Because valid and stale data files are stored side-by-side in Delta Lake files, partitions should also have **`VACUUM`** executed prior to moving any Delta Lake data to a pure Parquet table to ensure only valid files are copied.

In [0]:
archive_path = f"{DA.paths.working_dir}/pii_archive"
source_path = f"{DA.paths.user_db}/bronze/topic=user_info"

files = dbutils.fs.ls(source_path)
[dbutils.fs.cp(f[0], f"{archive_path}/{f[1]}", True) for f in files if f[1][-8:-1] <= '2019-48'];

spark.sql(f"""
CREATE TABLE IF NOT EXISTS user_info_archived
USING parquet
LOCATION '{archive_path}'
""")

spark.sql("MSCK REPAIR TABLE user_info_archived")

display(spark.sql("SELECT COUNT(*) FROM user_info_archived"))

Note that the directory structure was maintained as files were copied.

In [0]:
files = dbutils.fs.ls(archive_path)
display(files)

## Deleting at a Partition Boundary
Here we'll model deleting all **`user_info`** that was received before week 49 of 2019.

Note that we are deleting cleanly along partition boundaries. All the data contained in the specified **`week_part`** directories will be removed from our table.

In [0]:
%sql
DELETE FROM bronze 
WHERE topic = 'user_info'
AND week_part <= '2019-48'

We can confirm this delete processed successfully by looking at the history. The **`operationMetrics`** column will indicate the number of removed files.

In [0]:
%sql
DESCRIBE HISTORY bronze

When deleting along partition boundaries, we don't write out new data files; recording the files as removed in the Delta log is sufficient. 

However, file deletion will not actually occur until we **`VACUUM`** our table. 

Note that all of our week partitions still exist in our **`user_info`** directory and that data files still exist in each week directory.

In [0]:
files = dbutils.fs.ls(f"{source_path}/week_part=2019-48")
display(files)

## Reviewing and Committing Deletes

By default, the Delta engine will prevent **`VACUUM`** operations with less than 7 days of retention. The cell below overrides this check.

In [0]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", False)

Adding the **`DRY RUN`** keyword to the end of our **`VACUUM`** statement allows us to preview files to be deleted before they are permanently removed. 

Note that at this moment we could still recover our deleted data by running:

<strong><code>
RESTORE bronze<br/>
TO VERSION AS OF {version}
</code></strong>

In [0]:
%sql
VACUUM bronze RETAIN 0 HOURS DRY RUN

Executing the **`VACUUM`** command below permanently deletes these files.

In [0]:
%sql
VACUUM bronze RETAIN 0 HOURS

For safety, it's best to always re-enable our **`retentionDurationCheck`**. In production, you should avoid overriding this check whenever possible (if other operations are acting against files not yet committed to a Delta table and written before the retention threshold, **`VACUUM`** can result in data corruption).

In [0]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", True)

Note that empty directories will eventually be cleaned up with **`VACUUM`**, but may not always be deleted as they are emptied of data files.

In [0]:
try:
    files = dbutils.fs.ls(source_path)
    print("The files NOT YET deleted.")
except Exception as e:
    print("The files WERE deleted.")

As such, querying the **`bronze`** table with the same filters used in our delete statement should yield 0 records.

In [0]:
%sql
SELECT * 
FROM bronze 
WHERE topic='user_info' AND 
      week_part <= '2019-48'

## Recreating Full Table History

Note that because Parquet using directory partitions as columns in the resulting dataset, the data that was backed up no longer has a **`topic`** field in its schema.

The logic below addresses this while calling **`UNION`** on the archived and production datasets to recreate the full history of the **`user_info`** topic.

In [0]:
%sql
WITH full_bronze_user_info AS (

  SELECT key, value, partition, offset, timestamp, date, week_part 
  FROM bronze 
  WHERE topic='user_info'
  
  UNION SELECT * FROM user_info_archived) 
  
SELECT COUNT(*) FROM full_bronze_user_info

## Updating Streaming Reads to Ignore Changes

The cell below condenses all the code used to perform streaming updates to our **`users`** table.

If you try to execute this code right now, you'll raise an exception
> Detected deleted data from streaming source

Line 22 of the cell below adds the **`.option("ignoreDeletes", True)`** to the DataStreamReader. This option is all that is necessary to enable streaming processing from Delta tables with partition deletes.

In [0]:
from pyspark.sql import functions as F

schema = """
    user_id LONG, 
    update_type STRING, 
    timestamp FLOAT, 
    dob STRING, 
    sex STRING, 
    gender STRING, 
    first_name STRING, 
    last_name STRING, 
    address STRUCT<
        street_address: STRING, 
        city: STRING, 
        state: STRING, 
        zip: INT
    >"""

salt = "BEANS"

unpacked_df = (spark.readStream
                    .option("ignoreDeletes", True)     # This is new!
                    .table("bronze")
                    .filter("topic = 'user_info'")
                    .dropDuplicates()
                    .select(F.from_json(F.col("value").cast("string"), schema).alias("v")).select("v.*")
                    .select(F.sha2(F.concat(F.col("user_id"), F.lit(salt)), 256).alias("alt_id"),
                            F.col('timestamp').cast("timestamp").alias("updated"),
                            F.to_date('dob','MM/dd/yyyy').alias('dob'),
                            'sex', 'gender','first_name','last_name','address.*', "update_type"))



def batch_rank_upsert(microBatchDF, batchId):
    from pyspark.sql.window import Window
    
    window = Window.partitionBy("alt_id").orderBy(F.col("updated").desc())
    
    (microBatchDF
        .filter(F.col("update_type").isin(["new", "update"]))
        .withColumn("rank", F.rank().over(window)).filter("rank == 1").drop("rank")
        .createOrReplaceTempView("ranked_updates"))
    
    microBatchDF._jdf.sparkSession().sql("""
        MERGE INTO users u
        USING ranked_updates r
        ON u.alt_id=r.alt_id
        WHEN MATCHED AND u.updated < r.updated
          THEN UPDATE SET *
        WHEN NOT MATCHED
          THEN INSERT *
    """)

In [0]:
query = (unpacked_df.writeStream
                    .foreachBatch(batch_rank_upsert)
                    .outputMode("update")
                    .option("checkpointLocation", f"{DA.paths.checkpoints}/batch_rank_upsert")
                    .trigger(once=True)
                    .start())    

query.awaitTermination()

Note that we may see the table version increment as this code completes.

In [0]:
%sql
DESCRIBE HISTORY users

However, by examining the Delta log file this version, we'll note that the file written out is just indicating the data change, but that no new records were added or modified.

In [0]:
users_log_path = f"{DA.paths.user_db}/users/_delta_log"
files = dbutils.fs.ls(users_log_path)

max_version = max([file.name for file in files if file.name.endswith(".json")])
display(spark.read.json(f"{users_log_path}/{max_version}"))

## Next Steps
While we did not modify data in our **`workout`** or **`bpm`** partitions, because these read from the same **`bronze`** table we'll need to also update their DataStreamReader logic to ignore changes.

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
DA.cleanup()