# Iceberg PII Data Deletion Demo


This notebook walks through the process of creating an Iceberg table, adding data, deleting PII, and then permanently removing the history containing the PII.


## 1. Setup


First, we need to import pyspark and set up our Spark session. The configuration for the S3 endpoint and Iceberg catalog is already handled by the `docker-compose.yml` file.


In [20]:
# Import all utility functions using the import script
exec(open('import_utils.py').read())


Current directory: /home/iceberg/notebooks
Utils directory: /home/iceberg/notebooks/utils
Utils exists: True
Successfully imported utilities from: /home/iceberg/notebooks/utils


In [21]:
# 1) Wire up the s3 -> s3a mappings and MinIO creds on the JVM Hadoop conf
print("Spark:", spark.version)
print(spark._jsc.sc().listJars())
hconf = spark._jsc.hadoopConfiguration()

hconf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hconf.set("fs.AbstractFileSystem.s3.impl", "org.apache.hadoop.fs.s3a.S3A")

hconf.set("fs.s3a.endpoint", "http://minio:9000")
hconf.set("fs.s3a.path.style.access", "true")
hconf.set("fs.s3a.access.key", "admin")
hconf.set("fs.s3a.secret.key", "password")
hconf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
hconf.set("fs.s3a.connection.ssl.enabled", "false")

# 2) Sanity checks: these MUST NOT throw now
spark._jvm.java.lang.Class.forName("org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jvm.org.apache.hadoop.fs.FileSystem.get(
    spark._jvm.java.net.URI.create("s3://warehouse"),
    spark._jsc.hadoopConfiguration()
)
# spark._jvm.java.lang.Class.forName("org.apache.iceberg.actions.Actions")


Spark: 3.5.5
Vector()


JavaObject id=o160

In [22]:
for k, v in spark.sparkContext.getConf().getAll():
    if "catalog" in k:
        print(k, "=", v)

spark.sql.catalog.demo.s3.endpoint = http://minio:9000
spark.sql.catalogImplementation = in-memory
spark.sql.catalog.demo.warehouse = s3://warehouse/wh/
spark.sql.catalog.demo.io-impl = org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.demo.uri = http://rest:8181
spark.sql.catalog.demo.type = rest
spark.sql.catalog.demo = org.apache.iceberg.spark.SparkCatalog


In [23]:
import sys, os
import datetime
print("cwd:", os.getcwd())
print("sys.path:", sys.path)


cwd: /home/iceberg/notebooks
sys.path: ['/home/iceberg/notebooks/utils', '/opt/spark/python/lib/py4j-0.10.9.7-src.zip', '/tmp/spark-11fe5469-f066-4d05-ae79-0816cc0abd95/userFiles-0152f7ec-3077-490b-94f5-14d18ec2e5a1', '/opt/spark/python', '/home/iceberg/notebooks', '/usr/local/lib/python310.zip', '/usr/local/lib/python3.10', '/usr/local/lib/python3.10/lib-dynload', '', '/usr/local/lib/python3.10/site-packages']


In [24]:
!curl -X DELETE http://rest:8181/v1/namespaces/default/tables/pii_data

{"error":{"message":"Table does not exist: default.pii_data","type":"NoSuchTableException","code":404,"stack":["org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist: default.pii_data","\tat org.apache.iceberg.rest.CatalogHandlers.dropTable(CatalogHandlers.java:310)","\tat org.apache.iceberg.rest.RESTCatalogAdapter.handleRequest(RESTCatalogAdapter.java:405)","\tat org.apache.iceberg.rest.RESTServerCatalogAdapter.handleRequest(RESTServerCatalogAdapter.java:42)","\tat org.apache.iceberg.rest.RESTCatalogAdapter.execute(RESTCatalogAdapter.java:628)","\tat org.apache.iceberg.rest.RESTCatalogAdapter.execute(RESTCatalogAdapter.java:609)","\tat org.apache.iceberg.rest.RESTCatalogServlet.execute(RESTCatalogServlet.java:108)","\tat org.apache.iceberg.rest.RESTCatalogServlet.doDelete(RESTCatalogServlet.java:84)","\tat jakarta.servlet.http.HttpServlet.service(HttpServlet.java:526)","\tat jakarta.servlet.http.HttpServlet.service(HttpServlet.java:587)","\tat org.eclipse.jetty.serv

In [25]:
import re
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
import datetime

# All utility functions are already imported from import_utils.py

# Define the base path for our table for easy reuse 
table_base_path = "s3a://warehouse/default/pii_data"


In [26]:
import pandas as pd
import numpy as np

# All utility functions are already imported from import_utils.py


What each of these metadata file types “tells you”

m*.avro (manifests): rows = data-file entries, with partition info + per-column stats + an entry status (ADDED/DELETED). They’re scoped to a snapshot (you can link them via added_snapshot_id).

snap-*.avro (manifest lists): the index for a snapshot; each row points to one or more m*.avro files used by that snapshot.

0000*-*.metadata.json (table metadata versions): the table’s high-level state over time and which snapshots are current/valid.

Next, we'll create an Iceberg table called `pii_data` in our `demo` catalog. The schema will include the PII columns we want to manage.


With the REST catalog, we need to create a namespace before we can create a table. We'll create a namespace called `default` inside our `demo` catalog.


In [27]:
# What Hadoop version is Spark using?
print("Hadoop version:", spark._jvm.org.apache.hadoop.util.VersionInfo.getVersion())

# Can JVM see the S3AFileSystem class?
try:
    spark._jvm.java.lang.Class.forName("org.apache.hadoop.fs.s3a.S3AFileSystem")
    print("S3AFileSystem is present.")
except Exception as e:
    print("S3AFileSystem NOT present:", e)


Hadoop version: 3.3.4
S3AFileSystem is present.


In [28]:
spark.sql("DROP TABLE IF EXISTS demo.default.pii_data;")
spark.sql("CREATE NAMESPACE IF NOT EXISTS demo.default")
df = spark.sql("SHOW TABLES IN demo.default;")
df.show()

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
+---------+---------+-----------+



In [29]:
# Check files after namespace creation
_, _, all_previous = summarize_files(spark, table_base_path, "After Table Creation")


--- File Summary (After Table Creation) ---
Using table name for better reliability...
Metadata file summary unavailable: TABLE_OR_VIEW_NOT_FOUND] The table or view `demo`.`default`.`pii_data`.`snapshots` cannot be found. Verify the spelling and correctness of the schema and catalog.
Data file summary unavailable: TABLE_OR_VIEW_NOT_FOUND] The table or view `demo`.`default`.`pii_data`.`entries` cannot be found. Verify the spelling and correctness of the schema and catalog.


In [30]:
spark.sql("""
CREATE TABLE IF NOT EXISTS demo.default.pii_data (
    case_id STRING,
    first_name STRING,
    email_address STRING,
    key_nm STRING,
    secure_txt STRING,
    secure_key STRING,
    update_date DATE
)
USING iceberg
""")


DataFrame[]

In [31]:
# Check files after table creation
_, _, all_previous = summarize_files(spark, table_base_path, "After Table Creation")
all_previous

--- File Summary (After Table Creation) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+--------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation           |
+--------+--------------------+-----------+-------------------+-------------+--------------------+--------------------+
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|1            |2025-09-10T18-36-40Z|After Table Creation|
+--------+--------------------+-----------+-------------------+-------------+--------------------+--------------------+


Data file summary:
+------+---------+-----------+--------------+-------------+------+---------+
|prefix|file_type|file_format|created_minute|files_created|run_id|operation|
+------+---------+-----------+--------------+-------------+------+---------+
+------+---------+-----------+-------

Unnamed: 0,prefix,file_type,file_format,created_minute,files_created,run_id,operation
0,metadata,metadata_log_entries,json,2025-09-10 18:36:00,1,2025-09-10T18-36-40Z,After Table Creation


## 3. Seed the Table with Data


Now, let's insert some sample data into our table. We'll add two records, one of which we will target for PII deletion.


In [32]:
spark.sql("""
INSERT INTO demo.default.pii_data VALUES
('case-1', 'John', 'john.doe@example.com', 'key1', 'secret text 1', 'secret_key_1', DATE('2023-01-01')),
('case-2', 'Jane', 'jane.doe@example.com', 'key2', 'secret text 2', 'secret_key_2', DATE('2023-01-02'))
""")


DataFrame[]

Let's verify the data is there.

In [33]:
spark.table("demo.default.pii_data").show()

+-------+----------+--------------------+------+-------------+------------+-----------+
|case_id|first_name|       email_address|key_nm|   secure_txt|  secure_key|update_date|
+-------+----------+--------------------+------+-------------+------------+-----------+
| case-1|      John|john.doe@example.com|  key1|secret text 1|secret_key_1| 2023-01-01|
| case-2|      Jane|jane.doe@example.com|  key2|secret text 2|secret_key_2| 2023-01-02|
+-------+----------+--------------------+------+-------------+------------+-----------+



In [34]:
initial_snapshots = spark.table("demo.default.pii_data.history")
initial_snapshots.show()


+--------------------+-------------------+---------+-------------------+
|     made_current_at|        snapshot_id|parent_id|is_current_ancestor|
+--------------------+-------------------+---------+-------------------+
|2025-09-10 18:36:...|7545001130093648888|     NULL|               true|
+--------------------+-------------------+---------+-------------------+



In [35]:
# Check files after data insertion
_, _, all_current = summarize_files(spark, table_base_path, "After Data Insertion")
all_current

--- File Summary (After Data Insertion) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+--------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation           |
+--------+--------------------+-----------+-------------------+-------------+--------------------+--------------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|1            |2025-09-10T18-36-43Z|After Data Insertion|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|2            |2025-09-10T18-36-43Z|After Data Insertion|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|1            |2025-09-10T18-36-43Z|After Data Insertion|
+--------+--------------------+-----------+-------------------+-------------+--------------------+--------------------+


Data file summary:
+------+---------+----------

Unnamed: 0,prefix,file_type,file_format,created_minute,files_created,run_id,operation
0,metadata,snapshots,avro,2025-09-10 18:36:00,1,2025-09-10T18-36-43Z,After Data Insertion
1,metadata,manifests,avro,2025-09-10 18:36:00,1,2025-09-10T18-36-43Z,After Data Insertion
2,metadata,metadata_log_entries,json,2025-09-10 18:36:00,2,2025-09-10T18-36-43Z,After Data Insertion
3,data,data,parquet,2025-09-10 18:36:00,2,2025-09-10T18-36-43Z,After Data Insertion


In [36]:
# Compare
diff = diff_summaries(all_previous, all_current)
diff

Unnamed: 0,prefix,file_type,file_format,created_minute,minute_str,old_count,new_count,delta,status
0,data,data,parquet,2025-09-10 18:36:00,2025-09-10 18:36:00,0,2,2,ADDED
1,metadata,manifests,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,0,1,1,ADDED
2,metadata,metadata_log_entries,json,2025-09-10 18:36:00,2025-09-10 18:36:00,1,2,1,CHANGED
3,metadata,snapshots,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,0,1,1,ADDED


## 3.5. Create Orphaned Files (After Data Insertion)


Now let's create some orphaned files to demonstrate cleanup later. These files will exist in S3 but won't be tracked by Iceberg metadata, simulating a failed write operation.


Let's verify the data is there.


In [37]:
# 🔍 List files + modification dates
print("All files under data/ recursively:")
ls_s3_recursive(spark, "s3://warehouse/default/pii_data/data")

All files under data/ recursively:
2025-09-10 18:36:43  s3://warehouse/default/pii_data/data/00000-22-13d4bdd3-e848-41fd-847d-7a767e44108c-0-00001.parquet
2025-09-10 18:36:43  s3://warehouse/default/pii_data/data/00001-23-13d4bdd3-e848-41fd-847d-7a767e44108c-0-00001.parquet


In [38]:
df = spark.createDataFrame([("x",)], ["dummy"])
df.write.mode("overwrite").parquet("s3://warehouse/default/pii_data/data/ZZZ_orphan_test.parquet")

In [39]:
# 🔍 List files + modification dates
print("All files under data/ recursively:")
ls_s3_recursive(spark, "s3://warehouse/default/pii_data/data")

All files under data/ recursively:
2025-09-10 18:36:43  s3://warehouse/default/pii_data/data/00000-22-13d4bdd3-e848-41fd-847d-7a767e44108c-0-00001.parquet
2025-09-10 18:36:43  s3://warehouse/default/pii_data/data/00001-23-13d4bdd3-e848-41fd-847d-7a767e44108c-0-00001.parquet
2025-09-10 18:36:45  s3://warehouse/default/pii_data/data/ZZZ_orphan_test.parquet/_SUCCESS
2025-09-10 18:36:45  s3://warehouse/default/pii_data/data/ZZZ_orphan_test.parquet/part-00000-423ce4a8-d427-4e58-aa04-155fd5bf4d5c-c000.snappy.parquet
2025-09-10 18:36:45  s3://warehouse/default/pii_data/data/ZZZ_orphan_test.parquet/part-00001-423ce4a8-d427-4e58-aa04-155fd5bf4d5c-c000.snappy.parquet


In [40]:
spark.table("demo.default.pii_data").show()


+-------+----------+--------------------+------+-------------+------------+-----------+
|case_id|first_name|       email_address|key_nm|   secure_txt|  secure_key|update_date|
+-------+----------+--------------------+------+-------------+------------+-----------+
| case-1|      John|john.doe@example.com|  key1|secret text 1|secret_key_1| 2023-01-01|
| case-2|      Jane|jane.doe@example.com|  key2|secret text 2|secret_key_2| 2023-01-02|
+-------+----------+--------------------+------+-------------+------------+-----------+



We can also inspect the table's history to see the snapshot that was created when we inserted the data.


## 4. Delete PII


Now, we will "delete" the PII for `case-1`. In this context, "deletion" means updating the PII columns to `NULL`. This is a common strategy for retaining the record for referential integrity while removing the sensitive information.


In [41]:
# All utility functions are already imported from import_utils.py

delete_pii(spark, 'case-1')


In [42]:
# Check files after PII deletion
print("=== After PII Deletion ===")
all_previous = all_current.copy(deep=True)
_, _, all_current = summarize_files(spark, table_base_path, "After PII Deletion")

# Show the difference
print("\n=== File Summary Comparison (After PII Deletion) ===")
diff_pii = diff_summaries(all_previous, all_current)
diff_pii


=== After PII Deletion ===
--- File Summary (After PII Deletion) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation         |
+--------+--------------------+-----------+-------------------+-------------+--------------------+------------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|3            |2025-09-10T18-36-45Z|After PII Deletion|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|3            |2025-09-10T18-36-45Z|After PII Deletion|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-36-45Z|After PII Deletion|
+--------+--------------------+-----------+-------------------+-------------+--------------------+------------------+


Data file summary:
+------+---------

Unnamed: 0,prefix,file_type,file_format,created_minute,minute_str,old_count,new_count,delta,status
0,data,data,parquet,2025-09-10 18:36:00,2025-09-10 18:36:00,2,3,1,CHANGED
1,metadata,manifests,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,1,3,2,CHANGED
2,metadata,metadata_log_entries,json,2025-09-10 18:36:00,2025-09-10 18:36:00,2,3,1,CHANGED
3,metadata,snapshots,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,1,2,1,CHANGED


## 5. The Problem: Time Travel


Even though we've "deleted" the PII from the current view of the table, the old data still exists in the previous snapshot. Anyone with access can use time travel to see the PII.


In [43]:
# Let's verify the data is there
spark.table("demo.default.pii_data").show()

# We can also inspect the table's history to see the snapshots
initial_snapshots = spark.table("demo.default.pii_data.history")
initial_snapshots.show()

# Time travel back to the first snapshot to see the PII
first_snapshot_id = initial_snapshots.select("snapshot_id").first()[0]
print(f"\nTime traveling back to snapshot {first_snapshot_id} to see the PII:")
spark.read.option("snapshot-id", first_snapshot_id).table("demo.default.pii_data").show()


+-------+----------+--------------------+------+-------------+------------+-----------+
|case_id|first_name|       email_address|key_nm|   secure_txt|  secure_key|update_date|
+-------+----------+--------------------+------+-------------+------------+-----------+
| case-2|      Jane|jane.doe@example.com|  key2|secret text 2|secret_key_2| 2023-01-02|
| case-1|      NULL|                NULL|  key1|         NULL|secret_key_1| 2023-01-01|
+-------+----------+--------------------+------+-------------+------------+-----------+

+--------------------+-------------------+-------------------+-------------------+
|     made_current_at|        snapshot_id|          parent_id|is_current_ancestor|
+--------------------+-------------------+-------------------+-------------------+
|2025-09-10 18:36:...|7545001130093648888|               NULL|               true|
|2025-09-10 18:36:...|5013975016408165402|7545001130093648888|               true|
+--------------------+-------------------+--------------

## 6. Expire Snapshots (Time Travel Cleanup)


To permanently remove the PII, we need to expire old snapshots. This removes old snapshots from the table's metadata, making time travel to those versions impossible.


In [44]:
# Check files before expiring snapshots
print("=== Before Expiring Snapshots ===")
all_previous = all_current.copy(deep=True)
_, _, all_current = summarize_files(spark, table_base_path, "Before Expiring Snapshots")

# Expire all snapshots older than current timestamp
from pyspark.sql.functions import current_timestamp
now = spark.sql("SELECT current_timestamp()").collect()[0][0]
print(f"Expiring snapshots older than: {now}")

spark.sql(f"CALL demo.system.expire_snapshots('default.pii_data', TIMESTAMP '{now}')")

# Check files after expiring snapshots
print("\n=== After Expiring Snapshots ===")
all_previous = all_current.copy(deep=True)
_, _, all_current = summarize_files(spark, table_base_path, "After Expiring Snapshots")

# Show the difference
print("\n=== File Summary Comparison (After Expiring Snapshots) ===")
diff_expire = diff_summaries(all_previous, all_current)
diff_expire


=== Before Expiring Snapshots ===
--- File Summary (Before Expiring Snapshots) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+-------------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation                |
+--------+--------------------+-----------+-------------------+-------------+--------------------+-------------------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|3            |2025-09-10T18-36-47Z|Before Expiring Snapshots|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|3            |2025-09-10T18-36-47Z|Before Expiring Snapshots|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-36-47Z|Before Expiring Snapshots|
+--------+--------------------+-----------+-------------------+-------------+--------------------+--

                                                                                


=== After Expiring Snapshots ===
--- File Summary (After Expiring Snapshots) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+------------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation               |
+--------+--------------------+-----------+-------------------+-------------+--------------------+------------------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-36-49Z|After Expiring Snapshots|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|4            |2025-09-10T18-36-49Z|After Expiring Snapshots|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|1            |2025-09-10T18-36-49Z|After Expiring Snapshots|
+--------+--------------------+-----------+-------------------+-------------+--------------------+---------

Unnamed: 0,prefix,file_type,file_format,created_minute,minute_str,old_count,new_count,delta,status
0,data,data,parquet,2025-09-10 18:36:00,2025-09-10 18:36:00,3,2,-1,CHANGED
1,metadata,manifests,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,3,2,-1,CHANGED
2,metadata,metadata_log_entries,json,2025-09-10 18:36:00,2025-09-10 18:36:00,3,4,1,CHANGED
3,metadata,snapshots,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,2,1,-1,CHANGED


## 7. Orphaned Files Cleanup


Now let's demonstrate that the orphaned files we created earlier still exist, and then clean them up. These files exist in S3 but are not referenced by Iceberg metadata, so they can be safely removed.


In [45]:
# 🔍 List files + modification dates
print("All files under data/ recursively:")
ls_s3_recursive(spark, "s3://warehouse/default/pii_data/data")

All files under data/ recursively:
2025-09-10 18:36:45  s3://warehouse/default/pii_data/data/00000-58-5b319bd6-f4a9-425f-a060-16e95f36f6da-0-00001.parquet
2025-09-10 18:36:43  s3://warehouse/default/pii_data/data/00001-23-13d4bdd3-e848-41fd-847d-7a767e44108c-0-00001.parquet
2025-09-10 18:36:45  s3://warehouse/default/pii_data/data/ZZZ_orphan_test.parquet/_SUCCESS
2025-09-10 18:36:45  s3://warehouse/default/pii_data/data/ZZZ_orphan_test.parquet/part-00000-423ce4a8-d427-4e58-aa04-155fd5bf4d5c-c000.snappy.parquet
2025-09-10 18:36:45  s3://warehouse/default/pii_data/data/ZZZ_orphan_test.parquet/part-00001-423ce4a8-d427-4e58-aa04-155fd5bf4d5c-c000.snappy.parquet


In [46]:
import datetime

# All utility functions are already imported from import_utils.py


In [47]:
cleanup_orphan_files(spark, "demo.default.pii_data", method="action", cutoff="immediate")

Running Iceberg DeleteOrphanFiles Action …
✓ Orphaned files cleanup (Action) completed for demo.default.pii_data


JavaObject id=o739

In [48]:
spark.sql("SELECT file_path FROM demo.default.pii_data.files").show(truncate=False)

+--------------------------------------------------------------------------------------------------+
|file_path                                                                                         |
+--------------------------------------------------------------------------------------------------+
|s3://warehouse/default/pii_data/data/00000-58-5b319bd6-f4a9-425f-a060-16e95f36f6da-0-00001.parquet|
|s3://warehouse/default/pii_data/data/00001-23-13d4bdd3-e848-41fd-847d-7a767e44108c-0-00001.parquet|
+--------------------------------------------------------------------------------------------------+



In [49]:
# Verify the cleanup worked
print("\n=== After Orphaned Files Cleanup ===")
all_previous = all_current.copy(deep=True)
_, _, all_current = summarize_files(spark, table_base_path, "After Orphaned Files Cleanup")

# Show the difference
print("\n=== File Summary Comparison (After Orphaned Files Cleanup) ===")
diff_cleanup = diff_summaries(all_previous, all_current)
diff_cleanup

print("\n🎯 Key Points about remove_orphan_files:")
print("   • Scans the actual storage (S3) and compares against Iceberg metadata")
print("   • Removes files that exist in storage but are NOT referenced by any snapshot")
print("   • Different from expire_snapshots (which removes referenced but old files)")
print("   • Essential for cleaning up failed writes, manual copies, or direct S3 operations")



=== After Orphaned Files Cleanup ===
--- File Summary (After Orphaned Files Cleanup) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+----------------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation                   |
+--------+--------------------+-----------+-------------------+-------------+--------------------+----------------------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-36-51Z|After Orphaned Files Cleanup|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|4            |2025-09-10T18-36-51Z|After Orphaned Files Cleanup|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|1            |2025-09-10T18-36-51Z|After Orphaned Files Cleanup|
+--------+--------------------+-----------+-------------------+------------

## 8. VACUUM Operation (Rewrite Data Files)


Even though the snapshots are gone, the underlying Parquet files containing the PII may still exist in S3. The `rewrite_data_files` procedure (similar to VACUUM in other systems) will consolidate data into new files and remove the old, unreferenced ones.


In [50]:
# Check files before VACUUM
print("=== Before VACUUM (Rewrite Data Files) ===")
all_previous = all_current.copy(deep=True)
_, _, all_current = summarize_files(spark, table_base_path, "Before VACUUM")

# Run VACUUM (rewrite_data_files)
print("\n=== Running VACUUM (Rewrite Data Files) ===")
result = spark.sql("CALL demo.system.rewrite_data_files('default.pii_data')")
print("✓ VACUUM completed")
result.show()

# Check files after VACUUM
print("\n=== After VACUUM (Rewrite Data Files) ===")
all_previous = all_current.copy(deep=True)
_, _, all_current = summarize_files(spark, table_base_path, "After VACUUM")

# Show the difference
print("\n=== File Summary Comparison (After VACUUM) ===")
diff_vacuum = diff_summaries(all_previous, all_current)
diff_vacuum


=== Before VACUUM (Rewrite Data Files) ===
--- File Summary (Before VACUUM) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+-------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation    |
+--------+--------------------+-----------+-------------------+-------------+--------------------+-------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-36-52Z|Before VACUUM|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|4            |2025-09-10T18-36-52Z|Before VACUUM|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|1            |2025-09-10T18-36-52Z|Before VACUUM|
+--------+--------------------+-----------+-------------------+-------------+--------------------+-------------+


Data file summary:
+------+---------+-----------+-----------

Unnamed: 0,prefix,file_type,file_format,created_minute,minute_str,old_count,new_count,delta,status
0,data,data,parquet,2025-09-10 18:36:00,2025-09-10 18:36:00,2,2,0,UNCHANGED
1,metadata,manifests,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,2,2,0,UNCHANGED
2,metadata,metadata_log_entries,json,2025-09-10 18:36:00,2025-09-10 18:36:00,4,4,0,UNCHANGED
3,metadata,snapshots,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,1,1,0,UNCHANGED


## 9. Validation


Now, let's try to time travel back to the first snapshot. This should fail because the snapshot no longer exists.


In [51]:
# Try to time travel back to the first snapshot - this should fail
try:
    spark.read.option("snapshot-id", first_snapshot_id).table("demo.default.pii_data").show()
except Exception as e:
    print("✅ Successfully prevented time travel!")
    print(f"Error: {e}")

# Verify the current data (PII should be gone)
print("\n=== Current Data (PII Should Be Gone) ===")
spark.table("demo.default.pii_data").show()

# Check the table history (should only have one snapshot)
print("\n=== Table History (Should Only Have One Snapshot) ===")
spark.table("demo.default.pii_data.history").show()

print("\n🎉 PII has been permanently deleted!")
print("   • Time travel to old snapshots is no longer possible")
print("   • Orphaned files have been cleaned up")
print("   • Data files have been rewritten to remove unreferenced data")


✅ Successfully prevented time travel!
Error: Cannot find snapshot with ID 7545001130093648888

=== Current Data (PII Should Be Gone) ===
+-------+----------+--------------------+------+-------------+------------+-----------+
|case_id|first_name|       email_address|key_nm|   secure_txt|  secure_key|update_date|
+-------+----------+--------------------+------+-------------+------------+-----------+
| case-2|      Jane|jane.doe@example.com|  key2|secret text 2|secret_key_2| 2023-01-02|
| case-1|      NULL|                NULL|  key1|         NULL|secret_key_1| 2023-01-01|
+-------+----------+--------------------+------+-------------+------------+-----------+


=== Table History (Should Only Have One Snapshot) ===
+--------------------+-------------------+-------------------+-------------------+
|     made_current_at|        snapshot_id|          parent_id|is_current_ancestor|
+--------------------+-------------------+-------------------+-------------------+
|2025-09-10 18:36:...|50139750

In [52]:
# Now let's use remove_orphan_files to clean up the orphaned files
print("\n=== Cleaning Up Orphaned Files ===")
print("Running remove_orphan_files to scan storage and remove unreferenced files...")

try:
    # Remove orphaned files - this scans the storage and compares against table metadata
    result = spark.sql("CALL demo.system.remove_orphan_files('default.pii_data')")
    print("✓ Orphaned files cleanup completed")
    result.show()
except Exception as e:
    print(f"Error during orphaned files cleanup: {e}")



=== Cleaning Up Orphaned Files ===
Running remove_orphan_files to scan storage and remove unreferenced files...
✓ Orphaned files cleanup completed
+--------------------+
|orphan_file_location|
+--------------------+
+--------------------+



In [53]:
# Verify the cleanup worked
print("\n=== After Orphaned Files Cleanup ===")
_, _, all_after_cleanup = summarize_files(spark, table_base_path, "After Orphaned Files Cleanup")

# Show the difference
print("\n=== File Summary Comparison (After Cleanup) ===")
diff_cleanup = diff_summaries(all_current, all_after_cleanup)
diff_cleanup

print("\n🎯 Key Points about remove_orphan_files:")
print("   • Scans the actual storage (S3) and compares against Iceberg metadata")
print("   • Removes files that exist in storage but are NOT referenced by any snapshot")
print("   • Different from expire_snapshots (which removes referenced but old files)")
print("   • Essential for cleaning up failed writes, manual copies, or direct S3 operations")



=== After Orphaned Files Cleanup ===
--- File Summary (After Orphaned Files Cleanup) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+----------------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation                   |
+--------+--------------------+-----------+-------------------+-------------+--------------------+----------------------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-36-53Z|After Orphaned Files Cleanup|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|4            |2025-09-10T18-36-53Z|After Orphaned Files Cleanup|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|1            |2025-09-10T18-36-53Z|After Orphaned Files Cleanup|
+--------+--------------------+-----------+-------------------+------------

In [54]:
initial_snapshots = spark.table("demo.default.pii_data.history")
initial_snapshots.show()


+--------------------+-------------------+-------------------+-------------------+
|     made_current_at|        snapshot_id|          parent_id|is_current_ancestor|
+--------------------+-------------------+-------------------+-------------------+
|2025-09-10 18:36:...|5013975016408165402|7545001130093648888|               true|
+--------------------+-------------------+-------------------+-------------------+



## 4. Delete PII


Now, we will "delete" the PII for `case-1`. In this context, "deletion" means updating the PII columns to `NULL`. This is a common strategy for retaining the record for referential integrity while removing the sensitive information.


In [55]:
def delete_pii(case_id):
    spark.sql(f"""
    UPDATE demo.default.pii_data
    SET
        first_name = NULL,
        email_address = NULL,
        secure_txt = NULL
    WHERE case_id = '{case_id}'
    """)

delete_pii('case-1')


In [56]:
# Check files after data deletion (updates)
all_previous = all_current.copy(deep=True)
_, _, all_current = summarize_files(spark, table_base_path, "After Data Deletion")
all_current

--- File Summary (After Data Deletion) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+-------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation          |
+--------+--------------------+-----------+-------------------+-------------+--------------------+-------------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|5            |2025-09-10T18-36-54Z|After Data Deletion|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|5            |2025-09-10T18-36-54Z|After Data Deletion|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-36-54Z|After Data Deletion|
+--------+--------------------+-----------+-------------------+-------------+--------------------+-------------------+


Data file summary:
+------+---------+-----------+------

Unnamed: 0,prefix,file_type,file_format,created_minute,files_created,run_id,operation
0,metadata,snapshots,avro,2025-09-10 18:36:00,2,2025-09-10T18-36-54Z,After Data Deletion
1,metadata,manifests,avro,2025-09-10 18:36:00,5,2025-09-10T18-36-54Z,After Data Deletion
2,metadata,metadata_log_entries,json,2025-09-10 18:36:00,5,2025-09-10T18-36-54Z,After Data Deletion
3,data,data,parquet,2025-09-10 18:36:00,3,2025-09-10T18-36-54Z,After Data Deletion


## 7. Orphaned Files Cleanup


Sometimes files can become "orphaned" - they exist in storage but are not referenced by Iceberg metadata. This can happen due to:

- **Partial/failed writes** that left files behind
- **Manual file copies** to the wrong location
- **Failed operations** that created files but didn't update metadata
- **Direct S3 operations** that bypassed Iceberg

In our demo, we already created orphaned files during the initial data load (see the `simulate_failed_write()` function above). Let's now demonstrate how to clean them up.


In [57]:
# Check the current state - note that orphaned files don't appear in Iceberg metadata
print("=== Current State (Including Orphaned Files) ===")
print("Note: Orphaned files don't appear in the Iceberg metadata summary")
print("because they were written directly to S3, bypassing Iceberg's metadata tracking.")
_, _, all_current = summarize_files(spark, table_base_path, "Current State with Orphaned Files")


=== Current State (Including Orphaned Files) ===
Note: Orphaned files don't appear in the Iceberg metadata summary
because they were written directly to S3, bypassing Iceberg's metadata tracking.
--- File Summary (Current State with Orphaned Files) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+---------------------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation                        |
+--------+--------------------+-----------+-------------------+-------------+--------------------+---------------------------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|5            |2025-09-10T18-36-55Z|Current State with Orphaned Files|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|5            |2025-09-10T18-36-55Z|Current State with Orphaned Files|
|metadata|snaps

In [58]:
# Now let's use remove_orphan_files to clean up the orphaned files
print("\n=== Cleaning Up Orphaned Files ===")
print("Running remove_orphan_files to scan storage and remove unreferenced files...")

try:
    # Remove orphaned files - this scans the storage and compares against table metadata
    result = spark.sql("CALL demo.system.remove_orphan_files('default.pii_data')")
    print("✓ Orphaned files cleanup completed")
    result.show()
except Exception as e:
    print(f"Error during orphaned files cleanup: {e}")



=== Cleaning Up Orphaned Files ===
Running remove_orphan_files to scan storage and remove unreferenced files...
✓ Orphaned files cleanup completed
+--------------------+
|orphan_file_location|
+--------------------+
+--------------------+



In [59]:
# Verify the cleanup worked
print("\n=== After Orphaned Files Cleanup ===")
_, _, all_after_cleanup = summarize_files(spark, table_base_path, "After Orphaned Files Cleanup")

print("\n🎯 Key Points about remove_orphan_files:")
print("   • Scans the actual storage (S3) and compares against Iceberg metadata")
print("   • Removes files that exist in storage but are NOT referenced by any snapshot")
print("   • Different from expire_snapshots (which removes referenced but old files)")
print("   • Essential for cleaning up failed writes, manual copies, or direct S3 operations")



=== After Orphaned Files Cleanup ===
--- File Summary (After Orphaned Files Cleanup) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+----------------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation                   |
+--------+--------------------+-----------+-------------------+-------------+--------------------+----------------------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|5            |2025-09-10T18-36-56Z|After Orphaned Files Cleanup|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|5            |2025-09-10T18-36-56Z|After Orphaned Files Cleanup|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-36-56Z|After Orphaned Files Cleanup|
+--------+--------------------+-----------+-------------------+------------

In [60]:
# Compare
diff = diff_summaries(all_previous, all_current)
diff

Unnamed: 0,prefix,file_type,file_format,created_minute,minute_str,old_count,new_count,delta,status
0,data,data,parquet,2025-09-10 18:36:00,2025-09-10 18:36:00,2,3,1,CHANGED
1,metadata,manifests,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,2,5,3,CHANGED
2,metadata,metadata_log_entries,json,2025-09-10 18:36:00,2025-09-10 18:36:00,4,5,1,CHANGED
3,metadata,snapshots,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,1,2,1,CHANGED


## 7. Orphaned Files Cleanup


Sometimes files can become "orphaned" - they exist in storage but are not referenced by Iceberg metadata. This can happen due to:

- **Partial/failed writes** that left files behind
- **Manual file copies** to the wrong location
- **Failed operations** that created files but didn't update metadata
- **Direct S3 operations** that bypassed Iceberg

Let's demonstrate this by creating some orphaned files and then cleaning them up.


In [61]:
# First, let's check the current state before creating orphaned files
print("=== Before Creating Orphaned Files ===")
_, _, all_before_orphans = summarize_files(spark, table_base_path, "Before Creating Orphaned Files")


=== Before Creating Orphaned Files ===
--- File Summary (Before Creating Orphaned Files) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+------------------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation                     |
+--------+--------------------+-----------+-------------------+-------------+--------------------+------------------------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|5            |2025-09-10T18-36-56Z|Before Creating Orphaned Files|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|5            |2025-09-10T18-36-56Z|Before Creating Orphaned Files|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-36-56Z|Before Creating Orphaned Files|
+--------+--------------------+-----------+-----------------

In [62]:
# Create some orphaned files by writing directly to S3 (bypassing Iceberg)
# This simulates what might happen with failed writes or manual operations

create_orphaned_files(spark)


Error creating orphaned files: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.


In [63]:
# Check the state after creating orphaned files
# Note: The orphaned files won't show up in our Iceberg metadata summary
# because they're not tracked by Iceberg
print("\n=== After Creating Orphaned Files ===")
_, _, all_after_orphans = summarize_files(spark, table_base_path, "After Creating Orphaned Files")

print("\n📝 Note: The orphaned files don't appear in the Iceberg metadata summary")
print("   because they were written directly to S3, bypassing Iceberg's metadata tracking.")



=== After Creating Orphaned Files ===
--- File Summary (After Creating Orphaned Files) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+-----------------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation                    |
+--------+--------------------+-----------+-------------------+-------------+--------------------+-----------------------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|5            |2025-09-10T18-36-57Z|After Creating Orphaned Files|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|5            |2025-09-10T18-36-57Z|After Creating Orphaned Files|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-36-57Z|After Creating Orphaned Files|
+--------+--------------------+-----------+-------------------+----

In [64]:
# Now let's use remove_orphan_files to clean up the orphaned files
print("\n=== Cleaning Up Orphaned Files ===")
print("Running remove_orphan_files to scan storage and remove unreferenced files...")

try:
    # Remove orphaned files - this scans the storage and compares against table metadata
    result = spark.sql("CALL demo.system.remove_orphan_files('default.pii_data')")
    print("✓ Orphaned files cleanup completed")
    result.show()
except Exception as e:
    print(f"Error during orphaned files cleanup: {e}")



=== Cleaning Up Orphaned Files ===
Running remove_orphan_files to scan storage and remove unreferenced files...
✓ Orphaned files cleanup completed
+--------------------+
|orphan_file_location|
+--------------------+
+--------------------+



In [65]:
# Verify the cleanup worked
print("\n=== After Orphaned Files Cleanup ===")
_, _, all_after_cleanup = summarize_files(spark, table_base_path, "After Orphaned Files Cleanup")

print("\n🎯 Key Points about remove_orphan_files:")
print("   • Scans the actual storage (S3) and compares against Iceberg metadata")
print("   • Removes files that exist in storage but are NOT referenced by any snapshot")
print("   • Different from expire_snapshots (which removes referenced but old files)")
print("   • Essential for cleaning up failed writes, manual copies, or direct S3 operations")



=== After Orphaned Files Cleanup ===
--- File Summary (After Orphaned Files Cleanup) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+----------------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation                   |
+--------+--------------------+-----------+-------------------+-------------+--------------------+----------------------------+
|metadata|manifests           |avro       |2025-09-10 18:36:00|5            |2025-09-10T18-36-58Z|After Orphaned Files Cleanup|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|5            |2025-09-10T18-36-58Z|After Orphaned Files Cleanup|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-36-58Z|After Orphaned Files Cleanup|
+--------+--------------------+-----------+-------------------+------------

## 8. Validation


Now, let's try to time travel back to the first snapshot. This should fail because the snapshot no longer exists.


## 9. Manual Parquet File Reader


Let's check the data again. We should see that the PII for `case-1` is now gone.


In [66]:
spark.table("demo.default.pii_data").show()


+-------+----------+--------------------+------+-------------+------------+-----------+
|case_id|first_name|       email_address|key_nm|   secure_txt|  secure_key|update_date|
+-------+----------+--------------------+------+-------------+------------+-----------+
| case-2|      Jane|jane.doe@example.com|  key2|secret text 2|secret_key_2| 2023-01-02|
| case-1|      NULL|                NULL|  key1|         NULL|secret_key_1| 2023-01-01|
+-------+----------+--------------------+------+-------------+------------+-----------+



If we look at the table history, we'll see a new snapshot has been added.


In [67]:
spark.table("demo.default.pii_data.history").show()


+--------------------+-------------------+-------------------+-------------------+
|     made_current_at|        snapshot_id|          parent_id|is_current_ancestor|
+--------------------+-------------------+-------------------+-------------------+
|2025-09-10 18:36:...|5013975016408165402|7545001130093648888|               true|
|2025-09-10 18:36:...|3743604877359090994|5013975016408165402|               true|
+--------------------+-------------------+-------------------+-------------------+



## 5. The Problem: Time Travel


Even though we've "deleted" the PII from the current view of the table, the old data still exists in the previous snapshot. Anyone with access can use time travel to see the PII.


In [68]:
first_snapshot_id = initial_snapshots.select("snapshot_id").first()[0]
spark.read.option("snapshot-id", first_snapshot_id).table("demo.default.pii_data").show()


+-------+----------+--------------------+------+-------------+------------+-----------+
|case_id|first_name|       email_address|key_nm|   secure_txt|  secure_key|update_date|
+-------+----------+--------------------+------+-------------+------------+-----------+
| case-1|      NULL|                NULL|  key1|         NULL|secret_key_1| 2023-01-01|
| case-2|      Jane|jane.doe@example.com|  key2|secret text 2|secret_key_2| 2023-01-02|
+-------+----------+--------------------+------+-------------+------------+-----------+



## 6. Permanent Deletion with Maintenance


To permanently remove the PII, we need to perform two maintenance operations:
1.  **Expire Snapshots**: This removes old snapshots from the table's metadata, making time travel to those versions impossible.
2.  **Rewrite Data Files (VACUUM)**: This physically rewrites the data files to remove data that is no longer referenced by any snapshot.


### Expire Old Snapshots


We'll expire all snapshots that are older than the current one. We can get the current timestamp and use that to expire anything older.


In [69]:
all_current

Unnamed: 0,prefix,file_type,file_format,created_minute,files_created,run_id,operation
0,metadata,snapshots,avro,2025-09-10 18:36:00,2,2025-09-10T18-36-55Z,Current State with Orphaned Files
1,metadata,manifests,avro,2025-09-10 18:36:00,5,2025-09-10T18-36-55Z,Current State with Orphaned Files
2,metadata,metadata_log_entries,json,2025-09-10 18:36:00,5,2025-09-10T18-36-55Z,Current State with Orphaned Files
3,data,data,parquet,2025-09-10 18:36:00,3,2025-09-10T18-36-55Z,Current State with Orphaned Files


In [70]:
from pyspark.sql.functions import current_timestamp

now = spark.sql("SELECT current_timestamp()").collect()[0][0]
print(now)
spark.sql(f"CALL demo.system.expire_snapshots('default.pii_data', TIMESTAMP '{now}')")


2025-09-10 18:36:58.916577


DataFrame[deleted_data_files_count: bigint, deleted_position_delete_files_count: bigint, deleted_equality_delete_files_count: bigint, deleted_manifest_files_count: bigint, deleted_manifest_lists_count: bigint, deleted_statistics_files_count: bigint]

Now, if we look at the history, we should only see the most recent snapshot.


In [71]:
spark.table("demo.default.pii_data.history").show()


+--------------------+-------------------+-------------------+-------------------+
|     made_current_at|        snapshot_id|          parent_id|is_current_ancestor|
+--------------------+-------------------+-------------------+-------------------+
|2025-09-10 18:36:...|3743604877359090994|5013975016408165402|               true|
+--------------------+-------------------+-------------------+-------------------+



In [72]:
all_previous = all_current.copy(deep=True)
_, _, all_current = summarize_files(spark, table_base_path, "After Expire Snapshots")
all_current

--- File Summary (After Expire Snapshots) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+----------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation             |
+--------+--------------------+-----------+-------------------+-------------+--------------------+----------------------+
|metadata|manifests           |avro       |NULL               |1            |2025-09-10T18-37-00Z|After Expire Snapshots|
|metadata|manifests           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-37-00Z|After Expire Snapshots|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|6            |2025-09-10T18-37-00Z|After Expire Snapshots|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|1            |2025-09-10T18-37-00Z|After Expire Snapshots|
+--------+--------------------+--

Unnamed: 0,prefix,file_type,file_format,created_minute,files_created,run_id,operation
0,metadata,snapshots,avro,2025-09-10 18:36:00,1,2025-09-10T18-37-00Z,After Expire Snapshots
1,metadata,manifests,avro,NaT,1,2025-09-10T18-37-00Z,After Expire Snapshots
2,metadata,manifests,avro,2025-09-10 18:36:00,2,2025-09-10T18-37-00Z,After Expire Snapshots
3,metadata,metadata_log_entries,json,2025-09-10 18:36:00,6,2025-09-10T18-37-00Z,After Expire Snapshots
4,data,data,parquet,2025-09-10 18:36:00,2,2025-09-10T18-37-00Z,After Expire Snapshots


In [73]:
# Compare
diff = diff_summaries(all_previous, all_current)
diff

Unnamed: 0,prefix,file_type,file_format,created_minute,minute_str,old_count,new_count,delta,status
0,data,data,parquet,2025-09-10 18:36:00,2025-09-10 18:36:00,3,2,-1,CHANGED
1,metadata,manifests,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,5,2,-3,CHANGED
2,metadata,manifests,avro,NaT,,0,1,1,ADDED
3,metadata,metadata_log_entries,json,2025-09-10 18:36:00,2025-09-10 18:36:00,5,6,1,CHANGED
4,metadata,snapshots,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,2,1,-1,CHANGED


### Rewrite Data Files (VACUUM)


Even though the snapshots are gone, the underlying Parquet files containing the PII may still exist in S3. The `rewrite_data_files` procedure (similar to VACUUM in other systems) will consolidate data into new files and remove the old, unreferenced ones.


In [74]:
all_current

Unnamed: 0,prefix,file_type,file_format,created_minute,files_created,run_id,operation
0,metadata,snapshots,avro,2025-09-10 18:36:00,1,2025-09-10T18-37-00Z,After Expire Snapshots
1,metadata,manifests,avro,NaT,1,2025-09-10T18-37-00Z,After Expire Snapshots
2,metadata,manifests,avro,2025-09-10 18:36:00,2,2025-09-10T18-37-00Z,After Expire Snapshots
3,metadata,metadata_log_entries,json,2025-09-10 18:36:00,6,2025-09-10T18-37-00Z,After Expire Snapshots
4,data,data,parquet,2025-09-10 18:36:00,2,2025-09-10T18-37-00Z,After Expire Snapshots


In [75]:
spark.sql("CALL demo.system.rewrite_data_files('default.pii_data')")


DataFrame[rewritten_data_files_count: int, added_data_files_count: int, rewritten_bytes_count: bigint, failed_data_files_count: int]

In [76]:
all_previous = all_current.copy(deep=True)
_, _, all_current = summarize_files(spark, table_base_path, "After Rewrite Data Files")
all_current


--- File Summary (After Rewrite Data Files) ---
Using table name for better reliability...

Metadata file summary:
+--------+--------------------+-----------+-------------------+-------------+--------------------+------------------------+
|prefix  |file_type           |file_format|created_minute     |files_created|run_id              |operation               |
+--------+--------------------+-----------+-------------------+-------------+--------------------+------------------------+
|metadata|manifests           |avro       |NULL               |1            |2025-09-10T18-37-01Z|After Rewrite Data Files|
|metadata|manifests           |avro       |2025-09-10 18:36:00|2            |2025-09-10T18-37-01Z|After Rewrite Data Files|
|metadata|metadata_log_entries|json       |2025-09-10 18:36:00|6            |2025-09-10T18-37-01Z|After Rewrite Data Files|
|metadata|snapshots           |avro       |2025-09-10 18:36:00|1            |2025-09-10T18-37-01Z|After Rewrite Data Files|
+--------+-------

Unnamed: 0,prefix,file_type,file_format,created_minute,files_created,run_id,operation
0,metadata,snapshots,avro,2025-09-10 18:36:00,1,2025-09-10T18-37-01Z,After Rewrite Data Files
1,metadata,manifests,avro,NaT,1,2025-09-10T18-37-01Z,After Rewrite Data Files
2,metadata,manifests,avro,2025-09-10 18:36:00,2,2025-09-10T18-37-01Z,After Rewrite Data Files
3,metadata,metadata_log_entries,json,2025-09-10 18:36:00,6,2025-09-10T18-37-01Z,After Rewrite Data Files
4,data,data,parquet,2025-09-10 18:36:00,2,2025-09-10T18-37-01Z,After Rewrite Data Files


In [77]:
# Compare
diff = diff_summaries(all_previous, all_current)
diff

Unnamed: 0,prefix,file_type,file_format,created_minute,minute_str,old_count,new_count,delta,status
0,data,data,parquet,2025-09-10 18:36:00,2025-09-10 18:36:00,2,2,0,UNCHANGED
1,metadata,manifests,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,2,2,0,UNCHANGED
2,metadata,manifests,avro,NaT,,1,1,0,UNCHANGED
3,metadata,metadata_log_entries,json,2025-09-10 18:36:00,2025-09-10 18:36:00,6,6,0,UNCHANGED
4,metadata,snapshots,avro,2025-09-10 18:36:00,2025-09-10 18:36:00,1,1,0,UNCHANGED


## 7. Validation


Now, let's try to time travel back to the first snapshot. This should fail because the snapshot no longer exists.


In [78]:
try:
    spark.read.option("snapshot-id", first_snapshot_id).table("demo.default.pii_data").show()
except Exception as e:
    print("Successfully prevented time travel!")
    print(e)


Successfully prevented time travel!
Cannot find snapshot with ID 5013975016408165402


## 8. Manual Parquet File Reader


This section adds a utility to upload a Parquet file from your local computer and display its contents. This is useful for inspecting individual data files, for example, if you download a file from the MinIO bucket to your machine and want to see what's inside to confirm that PII has been physically removed from the file.


In [79]:
# Install necessary libraries for the uploader widget and Parquet reader
%pip install ipywidgets pandas pyarrow


[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [80]:
import ipywidgets as widgets
import pandas as pd
import io
from IPython.display import display, clear_output

# Create a file upload widget
uploader = widgets.FileUpload(
    accept='.parquet',
    description='Upload Parquet File',
    multiple=False
)

# Create an output widget to display the DataFrame
output = widgets.Output()

def on_upload_change(change):
    """
    This function is triggered when a file is uploaded.
    It reads the Parquet file and displays it as a Pandas DataFrame.
    """
    if not uploader.value:
        return
        
    # Get the uploaded file info
    uploaded_file = uploader.value[0]
    file_content = uploaded_file['content']
    
    # Read the Parquet file content into a Pandas DataFrame
    df = pd.read_parquet(io.BytesIO(file_content))
    
    # Display the DataFrame in the output widget
    with output:
        clear_output(wait=True)
        print(f"Contents of {uploaded_file['name']}:")
        display(df)
        
    # Reset the uploader so the same file can be uploaded again if needed
    uploader.value.clear()
    uploader._counter = 0


# Observe changes in the uploader's value
uploader.observe(on_upload_change, names='value')

# Display the uploader and the output area
display(uploader, output)


FileUpload(value=(), accept='.parquet', description='Upload Parquet File')

Output()

This confirms that we have successfully and permanently deleted the PII from our Iceberg table.


In [81]:
result = spark.sql(f"""SELECT * FROM demo.default.pii_data TIMESTAMP AS OF '2026-09-02 09:40:00'""");
result.show()

+-------+----------+--------------------+------+-------------+------------+-----------+
|case_id|first_name|       email_address|key_nm|   secure_txt|  secure_key|update_date|
+-------+----------+--------------------+------+-------------+------------+-----------+
| case-1|      NULL|                NULL|  key1|         NULL|secret_key_1| 2023-01-01|
| case-2|      Jane|jane.doe@example.com|  key2|secret text 2|secret_key_2| 2023-01-02|
+-------+----------+--------------------+------+-------------+------------+-----------+

