Apache Iceberg provides versioning and time travel capabilities, allowing users to query data as it existed at a specific point in time. This feature can be extremely useful for debugging, auditing, and historical analysis.

Time travel in Iceberg allows users to access historical snapshots of their table. Snapshots are created whenever a table is modified, such as adding or deleting data, and are assigned unique identifiers. Each snapshot is a consistent and complete view of the table at a given point in time.

In [12]:
from pyspark.sql import SparkSession

# Set the absolute paths to the Iceberg tables and JAR files
iceberg_tables_path = "/Users/france.cama/code/iceberg-practice/iceberg_tables"
iceberg_jars_path = "/Users/france.cama/code/iceberg-practice/jars/iceberg-spark-runtime-3.5_2.12-1.5.1.jar"

# Create a Spark session
spark = SparkSession.builder \
    .appName("Iceberg time travel feature") \
    .config("spark.driver.extraJavaOptions", "-Dderby.system.home=" + iceberg_tables_path) \
    .config("spark.jars", iceberg_jars_path) \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog") \
    .config("spark.sql.catalog.spark_catalog.type", "hadoop") \
    .config("spark.sql.catalog.spark_catalog.warehouse", iceberg_tables_path) \
    .getOrCreate()


current_catalog = spark.catalog.currentCatalog()
print(f"Current Catalog: {current_catalog}")
# show snapshots details
spark.sql("SELECT * FROM default.titanic.history ORDER BY made_current_at DESC;").show()

# time travel to timestamp
df = spark.sql(f"SELECT * FROM default.titanic TIMESTAMP AS OF '2024-07-22T11:10:00.289+00:00';")
df.show(5)

# time travel using snapshot_id
df_id = spark.sql(f"SELECT * FROM default.titanic VERSION AS OF 4619981039604578990;")
df_id.show(5)


### ROLL BACK ###
# Roll back an Iceberg table to a previous snapshot using either a timestamp or a snapshot ID.
# Users who have been assigned the ADMIN role, the table's owner, and users with INSERT, UPDATE, or DELETE privileges on the table can use the ROLLBACK command.
# ROLLBACK TABLE default.titanic TO { SNAPSHOT '6609739537041086161'};
SNAPSHOT = 6609739537041086161
spark.sql(f"CALL spark_catalog.system.rollback_to_snapshot('default.titanic', {SNAPSHOT})")

print("TITANIC ROLLED BACK")
spark.sql("SELECT * FROM titanic;").show(5)


24/07/22 11:49:51 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Current Catalog: spark_catalog
+--------------------+-------------------+-------------------+-------------------+
|     made_current_at|        snapshot_id|          parent_id|is_current_ancestor|
+--------------------+-------------------+-------------------+-------------------+
|2024-07-22 11:47:...|6609739537041086161|1184812391625634863|               true|
|2024-07-22 11:11:...|1759033321787871743| 813399347516847428|              false|
|2024-07-22 11:11:...| 813399347516847428|6609739537041086161|              false|
|2024-07-22 11:11:...|6609739537041086161|1184812391625634863|               true|
|2024-07-22 11:10:...|1184812391625634863|4619981039604578990|               true|
|2024-07-22 11:09:...|4619981039604578990|               NULL|               true|
+--------------------+-------------------+-------------------+-------------------+

+----+-----+--------+-------+--------------------+-----+-----------+------+------+-----+--------+----------------+-----+-----------+------

ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near 'CALL'.(line 1, pos 0)

== SQL ==
CALL spark_catalog.system.rollback_to_snapshot('default.titanic', 6609739537041086161)
^^^


## Time travel query lifecycle regarding metadata.

Iceberg first looks for the snapshot_id selected in the snapshot JSON object in the metadata file. In this object several properties are contained like: parent-snapshot-id, timestamp-ms, manifest-list, id ecc. Using this metadata than the manifest-list file is retrieved. In this file several informations are included and, in particular, the manifest file(s) (in the manifest_path). By reading this manifest files can use several statistics on the data columns but, more important, the data files path to select on which the data are contained and the query is interested in. 
In conclusion, by leveraging table metadata and 'going straight to the point', iceberg retrieve old data while offering good performance.
