# Iceberg Lab 
## Unit 6: Time Travel

In the previous unit, we-
1. Explore metadata inspection tables that iceberg provides


In this unit, we will-
1. Explore Time Travel feature of Iceberg Tables

### 1. Imports

In [1]:
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark.sparkContext.setLogLevel("WARN")
spark

24/05/13 17:04:41 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-diy-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-diy-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  11002190840


In [6]:
DPMS_NAME=f"dll-hms-{PROJECT_NUMBER}"
LOCATION="us-central1"

metastore_dir = !gcloud metastore services describe $DPMS_NAME --location $LOCATION |grep 'hive.metastore.warehouse.dir'| cut -d':' -f2- | xargs 
HIVE_METASTORE_WAREHOUSE_DIR = metastore_dir[0]
print("HIVE_METASTORE_WAREHOUSE_DIR",HIVE_METASTORE_WAREHOUSE_DIR)

HIVE_METASTORE_WAREHOUSE_DIR gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse


In [7]:
TABLE_NAME="loans_by_state_iceberg"
DB_NAME="loan_db"
#fully qualified table name
FQTN=f"{DB_NAME}.{TABLE_NAME}"
print("Fully quailified table name :",FQTN)

Fully quailified table name : loan_db.loans_by_state_iceberg


### 4. Time Travel

#### a. Time Travel with Snapshots

In [8]:
spark.sql(f"select committed_at, snapshot_id, operation from {FQTN}.snapshots").show(truncate=False)

[Stage 0:>                                                          (0 + 1) / 1]

+-----------------------+-------------------+---------+
|committed_at           |snapshot_id        |operation|
+-----------------------+-------------------+---------+
|2024-05-13 16:40:13.336|9176687385630465169|append   |
|2024-05-13 16:48:45.421|5233765456160638845|overwrite|
|2024-05-13 16:50:19.235|7272046516454809414|append   |
|2024-05-13 16:51:36.457|6697245229602575300|overwrite|
|2024-05-13 16:52:33.693|1795098683837972174|overwrite|
|2024-05-13 16:54:50.592|4282559450411521188|overwrite|
+-----------------------+-------------------+---------+



                                                                                

**Note: Please replace the Snapshot-id value in below statements based on your result from the above query at the time of execution**

In [9]:
print("Table state at snapshot-id '5233765456160638845'")
spark.read.option("snapshot-id","5233765456160638845").format("iceberg").load(f"{FQTN}").filter(col('addr_state').isin('IA','AZ','CA','IN')).show(truncate=False)
      

print("Table state at snapshot-id '7272046516454809414'")
spark.read.option("snapshot-id","7272046516454809414").format("iceberg").load(f"{FQTN}").filter(col('addr_state').isin('IA','AZ','CA','IN')).show(truncate=False)
   
print("Table state at latest snapshot")
spark.read.format("iceberg").load(f"{FQTN}").filter(col('addr_state').isin('IA','AZ','CA','IN')).show(truncate=False)

Table state at snapshot-id '5233765456160638845'


                                                                                

+----------+----------+
|addr_state|loan_count|
+----------+----------+
|CA        |62090     |
|IN        |7511      |
|IA        |1         |
+----------+----------+

Table state at snapshot-id '7272046516454809414'


                                                                                

+----------+----------+
|addr_state|loan_count|
+----------+----------+
|CA        |62090     |
|IN        |7511      |
|IA        |1         |
|AZ        |50000     |
+----------+----------+

Table state at latest snapshot
+----------+----------+
|addr_state|loan_count|
+----------+----------+
|AZ        |11111     |
|CA        |11111     |
|IA        |11111     |
|IN        |11111     |
+----------+----------+



#### b. Time Travel with Timestamps

In [10]:
#checking all updates to table
spark.table(f"{FQTN}.history").show(truncate=False)

+-----------------------+-------------------+-------------------+-------------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|
+-----------------------+-------------------+-------------------+-------------------+
|2024-05-13 16:40:13.336|9176687385630465169|null               |true               |
|2024-05-13 16:48:45.421|5233765456160638845|9176687385630465169|true               |
|2024-05-13 16:50:19.235|7272046516454809414|5233765456160638845|true               |
|2024-05-13 16:51:36.457|6697245229602575300|7272046516454809414|true               |
|2024-05-13 16:52:33.693|1795098683837972174|6697245229602575300|true               |
|2024-05-13 16:54:50.592|4282559450411521188|1795098683837972174|true               |
+-----------------------+-------------------+-------------------+-------------------+



**Note: Please replace the timestamp values for _'dt1'_ and _'dt2'_  in below statements based on your result from the above query at the time of execution**

In [11]:

dt_fmt = "%Y-%m-%d %H:%M:%S"

dt1 = '2024-05-13 16:48:45'
dt1_millis = int(datetime.strptime(dt1,dt_fmt).strftime("%s"))*1000
print("Table state at timestamp ",dt1)
spark.read.option("as-of-timestamp",dt1_millis).format("iceberg").load(f"{FQTN}").filter(col('addr_state').isin('IA','AZ','CA','IN')).show(truncate=False)
      

dt2 = '2024-05-13 16:50:19'
dt2_millis = int(datetime.strptime(dt2,dt_fmt).strftime("%s"))*1000
print("Table state at timestamp ",dt2)
spark.read.option("as-of-timestamp",dt2_millis).format("iceberg").load(f"{FQTN}").filter(col('addr_state').isin('IA','AZ','CA','IN')).show(truncate=False)
   
    
print("Table state at latest timestamp")
spark.read.format("iceberg").load(f"{FQTN}").filter(col('addr_state').isin('IA','AZ','CA','IN')).show(truncate=False)


Table state at timestamp  2024-05-13 16:48:45
+----------+----------+
|addr_state|loan_count|
+----------+----------+
|AZ        |10318     |
|CA        |62090     |
|IN        |7511      |
|IA        |1         |
+----------+----------+

Table state at timestamp  2024-05-13 16:50:19
+----------+----------+
|addr_state|loan_count|
+----------+----------+
|CA        |62090     |
|IN        |7511      |
|IA        |1         |
+----------+----------+

Table state at latest timestamp
+----------+----------+
|addr_state|loan_count|
+----------+----------+
|AZ        |11111     |
|CA        |11111     |
|IA        |11111     |
|IN        |11111     |
+----------+----------+



In [14]:
spark.sql("SELECT * FROM loan_db.loans_by_state_iceberg TIMESTAMP AS OF '2024-05-13 16:52:33'").show(truncate=False)

+----------+----------+
|addr_state|loan_count|
+----------+----------+
|AZ        |11111     |
|SC        |5460      |
|LA        |5284      |
|MN        |8031      |
|NJ        |16367     |
|DC        |1059      |
|OR        |5258      |
|VA        |12775     |
|RI        |1968      |
|KY        |4287      |
|WY        |964       |
|NH        |2148      |
|MI        |11638     |
|NV        |6309      |
|WI        |5798      |
|ID        |522       |
|CA        |62090     |
|CT        |6767      |
|NE        |1299      |
|MT        |1220      |
+----------+----------+
only showing top 20 rows



In [15]:
spark.sql("SELECT * FROM loan_db.loans_by_state_iceberg VERSION AS OF '1795098683837972174'").show(truncate=False)

+----------+----------+
|addr_state|loan_count|
+----------+----------+
|SC        |5460      |
|LA        |5284      |
|MN        |8031      |
|NJ        |16367     |
|DC        |1059      |
|OR        |5258      |
|VA        |12775     |
|RI        |1968      |
|KY        |4287      |
|WY        |964       |
|NH        |2148      |
|MI        |11638     |
|NV        |6309      |
|WI        |5798      |
|ID        |522       |
|CT        |6767      |
|NE        |1299      |
|MT        |1220      |
|NC        |12612     |
|VT        |931       |
+----------+----------+
only showing top 20 rows



### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK