For controlling metadata size and storage costs, Iceberg provides snapshot lifecycle management procedures such as expire_snapshots and other table management techniques.

In [15]:
from pyspark.sql import SparkSession

# Set the absolute paths to the Iceberg tables and JAR files
iceberg_tables_path = "/Users/france.cama/code/iceberg-practice/iceberg_tables"
iceberg_jars_path = "/Users/france.cama/code/iceberg-practice/jars/iceberg-spark-runtime-3.5_2.12-1.5.1.jar"

# Create a Spark session
spark = SparkSession.builder \
    .appName("Iceberg table management") \
    .config("spark.driver.extraJavaOptions", "-Dderby.system.home=" + iceberg_tables_path) \
    .config("spark.jars", iceberg_jars_path) \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog") \
    .config("spark.sql.catalog.spark_catalog.type", "hadoop") \
    .config("spark.sql.catalog.spark_catalog.warehouse", iceberg_tables_path) \
    .getOrCreate()

spark.sql("SELECT * FROM default.titanic.history;").show()

print("Branches of the table. If none has been created: just the main branch is printed which points "
      "at the selected version/snapshot of the table")
spark.sql("SELECT * FROM default.titanic.refs;").show()

spark.sql("ALTER TABLE default.titanic DROP TAG EOY_2024")


# the basic way the get rid of snapshots (to control metadata size and storage costs) is to use the 'expire_snapshots' procedure.
# using expire_snapshots procedure also deletes the linked data files but it excludes the snapshots related to branches and tags.
spark.sql("CALL system.expire_snapshots(" \
        "table => 'default.titanic',\
        older_than => TIMESTAMP '2024-07-22 11:12:59.000')")
print("Done!")

# a more sophisticated way to manage the lifecycle of snapshots is to use BRANCHES and TAGS which do expires after the specified age.
# BRANCHES are independent lineages of snapshots and point to the head of the lineage
# TAGS can be used for retaining important historical snapshots for auditing purposes.
# Branching and tagging can be used for handling GDPR requirements.
# Branches can also be used as part of data engineering workflows, for enabling experimental branches for testing and validating new jobs.

# -- Create a tag for the end of the year and retain it forever.
spark.sql("ALTER TABLE default.titanic CREATE TAG EOY_2024 RETAIN 30 DAYS")

spark.sql("SELECT * FROM default.titanic VERSION AS OF 'EOY_2024'").show(3)

# -- Let's create a branch to keep 1 snapshot for year for a maximum of 10 snapshots. (GDPR use case)
# spark.sql("ALTER TABLE default.titanic CREATE BRANCH yearly_snapshots_10y RETAIN 3650 DAYS WITH SNAPSHOT RETENTION 10 SNAPSHOTS")

spark.sql("SELECT * FROM default.titanic.refs;").show()


spark.sql("CALL system.expire_snapshots(" \
        "table => 'default.titanic',\
        retain_last => 10)")
print("retained last 10 snapshots (except for branches and tags)")

# run a procedure that lists the orphan files 
spark.sql("CALL system.remove_orphan_files(table => 'default.titanic', dry_run => true)").show()

# Data files compaction: a large amount of data files can reduce performances for two main reasons: metadata lookup and disk retrieval.
# if a large amount of data files exists, they can be compacted into more juicy data files with the following procedure. 
# note that there are different strategies that can be used to rewrite data files.
spark.sql("CALL system.rewrite_data_files(table => 'default.titanic',  strategy => 'sort', sort_order => 'zorder(PassengerId)')").show()


+--------------------+-------------------+-------------------+-------------------+
|     made_current_at|        snapshot_id|          parent_id|is_current_ancestor|
+--------------------+-------------------+-------------------+-------------------+
|2024-07-22 12:23:...|6609739537041086161|1184812391625634863|               true|
+--------------------+-------------------+-------------------+-------------------+

Branches of the table. If none has been created: just the main branch is printed which points at the selected version/snapshot of the table
+--------------------+------+-------------------+-----------------------+---------------------+----------------------+
|                name|  type|        snapshot_id|max_reference_age_in_ms|min_snapshots_to_keep|max_snapshot_age_in_ms|
+--------------------+------+-------------------+-----------------------+---------------------+----------------------+
|                main|BRANCH|6609739537041086161|                   NULL|              