# Iceberg PII Data Deletion Demo


This notebook walks through the process of creating an Iceberg table, adding data, deleting PII, and then permanently removing the history containing the PII.


## 1. Setup


First, we need to import pyspark and set up our Spark session. The configuration for the S3 endpoint and Iceberg catalog is already handled by the `docker-compose.yml` file.


In [None]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("IcebergPIIDemo").getOrCreate()


## 2. Create the Iceberg Table


Next, we'll create an Iceberg table called `pii_data` in our `demo` catalog. The schema will include the PII columns we want to manage.


In [None]:
spark.sql("""
CREATE TABLE IF NOT EXISTS demo.pii_data (
    case_id STRING,
    first_name STRING,
    email_address STRING,
    key_nm STRING,
    secure_txt STRING,
    secure_key STRING,
    update_date DATE
)
USING iceberg
""")


## 3. Seed the Table with Data


Now, let's insert some sample data into our table. We'll add two records, one of which we will target for PII deletion.


In [None]:
spark.sql("""
INSERT INTO demo.pii_data VALUES
('case-1', 'John', 'john.doe@example.com', 'key1', 'secret text 1', 'secret_key_1', '2023-01-01'),
('case-2', 'Jane', 'jane.doe@example.com', 'key2', 'secret text 2', 'secret_key_2', '2023-01-02')
""")


Let's verify the data is there.


In [None]:
spark.table("demo.pii_data").show()


We can also inspect the table's history to see the snapshot that was created when we inserted the data.


In [None]:
initial_snapshots = spark.table("demo.pii_data.history")
initial_snapshots.show()


## 4. Delete PII


Now, we will "delete" the PII for `case-1`. In this context, "deletion" means updating the PII columns to `NULL`. This is a common strategy for retaining the record for referential integrity while removing the sensitive information.


In [None]:
def delete_pii(case_id):
    spark.sql(f"""
    UPDATE demo.pii_data
    SET
        first_name = NULL,
        email_address = NULL,
        secure_txt = NULL
    WHERE case_id = '{case_id}'
    """)

delete_pii('case-1')


Let's check the data again. We should see that the PII for `case-1` is now gone.


In [None]:
spark.table("demo.pii_data").show()


If we look at the table history, we'll see a new snapshot has been added.


In [None]:
spark.table("demo.pii_data.history").show()


## 5. The Problem: Time Travel


Even though we've "deleted" the PII from the current view of the table, the old data still exists in the previous snapshot. Anyone with access can use time travel to see the PII.


In [None]:
first_snapshot_id = initial_snapshots.select("snapshot_id").first()[0]
spark.read.option("snapshot-id", first_snapshot_id).table("demo.pii_data").show()


## 6. Permanent Deletion with Maintenance


To permanently remove the PII, we need to perform two maintenance operations:
1.  **Expire Snapshots**: This removes old snapshots from the table's metadata, making time travel to those versions impossible.
2.  **Rewrite Data Files (VACUUM)**: This physically rewrites the data files to remove data that is no longer referenced by any snapshot.


### Expire Old Snapshots


We'll expire all snapshots that are older than the current one. We can get the current timestamp and use that to expire anything older.


In [None]:
from pyspark.sql.functions import current_timestamp

now = spark.sql("SELECT current_timestamp()").collect()[0][0]
spark.sql(f"CALL demo.system.expire_snapshots('pii_data', TIMESTAMP '{now}')")


Now, if we look at the history, we should only see the most recent snapshot.


In [None]:
spark.table("demo.pii_data.history").show()


### Rewrite Data Files (VACUUM)


Even though the snapshots are gone, the underlying Parquet files containing the PII may still exist in S3. The `rewrite_data_files` procedure (similar to VACUUM in other systems) will consolidate data into new files and remove the old, unreferenced ones.


In [None]:
spark.sql("CALL demo.system.rewrite_data_files('pii_data')")


## 7. Validation


Now, let's try to time travel back to the first snapshot. This should fail because the snapshot no longer exists.


In [None]:
try:
    spark.read.option("snapshot-id", first_snapshot_id).table("demo.pii_data").show()
except Exception as e:
    print("Successfully prevented time travel!")
    print(e)


This confirms that we have successfully and permanently deleted the PII from our Iceberg table.
