# Delta Lake Lab 
## Unit 6: Time Travel

In the previous unit, we-
1. Learned how to change the schema of tables with data in them, and reviewed the impact on files in the data lake and the transaction log

In this unit, we will-
1. Study delta lake's time travel support

### 1. Imports

In [1]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

22/10/22 23:39:19 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  885979867746


In [6]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}"
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"
print(DELTA_LAKE_DIR_ROOT)

gs://dll-data-bucket-885979867746/delta-consumable


In [7]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-885979867746/delta-consumable/:
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-0d0c8ea0-982f-4f67-ab9d-62f94e57db11-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-27993337-bc5b-4c93-9ab0-b77f48ac9160-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-293e9d10-a628-4cf0-b86c-f9f289913756-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-33f34593-184c-40b8-adfe-73facf9f043f-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-595b5ba1-408f-404d-91ee-7bc396235870-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-5c57d030-c1f7-4f7b-b5d8-2100c4426482-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-670d672a-c41d-4391-a47e-90d45e589ac2-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-8572a416-efe4-47bf-ab8b-755973ae5a7a-c000.snappy.par

### 4. History

In [8]:
spark.sql("DESCRIBE HISTORY loan_db.loans_by_state_delta").select("version","timestamp","operation","operationParameters").show(truncate=False)

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used
                                                                                

+-------+-----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------+
|version|timestamp              |operation|operationParameters                                                                                                                            |
+-------+-----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------+
|11     |2022-10-22 23:38:01.798|WRITE    |{mode -> Append, partitionBy -> []}                                                                                                            |
|10     |2022-10-22 23:37:53.837|WRITE    |{mode -> Append, partitionBy -> []}                                                                                                            |
|9      |2022-10-22 23:37:47.779|WRITE    |{mode -> Append, 

### 5. Lets look at a few versions

In [9]:
spark.sql("SELECT * FROM loan_db.loans_by_state_delta VERSION AS OF 1 where addr_state='IA'").show()

                                                                                

+----------+-----+
|addr_state|count|
+----------+-----+
+----------+-----+



                                                                                

In [10]:
spark.sql("SELECT * FROM loan_db.loans_by_state_delta VERSION AS OF 10 where addr_state='IA'").show()

                                                                                

+----------+-----+------------------+
|addr_state|count|  collateral_value|
+----------+-----+------------------+
|        IA|   26| 264367.4436053467|
|        IA|  349|3490455.5807275963|
|        IA|   32| 326185.5741464369|
|        IA|  164|1649795.5660065983|
|        IA|   97| 977271.2566990124|
|        IA|   79| 793382.1492048983|
|        IA|  149|1494744.6633344109|
|        IA|    7| 79748.79175558798|
|        IA|  452| 4525824.841281812|
|        IA|  186|1866039.3042891505|
|        IA|  489| 4898122.920044726|
|        IA|   32| 326185.5741464369|
|        IA|   67| 674781.2495198158|
|        IA|   12|127582.34045748881|
|        IA|   10| 107591.4553824375|
|        IA|  132|1327091.5650634842|
|        IA|  410|4107695.0403600573|
|        IA|   33| 337041.1662030551|
|        IA|   19|193502.12244121206|
|        IA|  151|1519729.3072769297|
+----------+-----+------------------+
only showing top 20 rows



In [11]:
spark.sql("SELECT * FROM loan_db.loans_by_state_delta VERSION AS OF 5 where addr_state='IA'").show()

                                                                                

+----------+-----+
|addr_state|count|
+----------+-----+
|        IA|  555|
+----------+-----+



### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK