# Delta Lake Lab 
## Unit 5: Schema Evolution

In the previous unit, we -
1. Did a delete operation on the base delta table and learned what happens to the underlying parquet and reviewed the transaction log
2. Did an insert operation on the base delta table and learned what happens to the underlying parquet and reviewed the transaction log
3. Did an update operation on the base delta table and learned what happens to the underlying parquet and reviewed the transaction log
4. Did a merge (upsert) operation on the base delta table and learned what happens to the underlying parquet and reviewed the transaction log

In this unit, we will -
1. Study schema evolution possible with delta lake


### 1. Imports

In [1]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

22/10/22 23:12:20 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  885979867746


In [6]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}"
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"

In [7]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-885979867746/delta-consumable/:
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-1ddad42e-603a-40c6-83b7-4ea3ae4a5128-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-407faca5-0109-4def-a655-a688e694cc6b-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-9566ead4-572a-4cc0-ba48-605bdfa18778-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-efb9177f-a191-4835-928a-0d7e4b72e8e0-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-f028020a-487f-4de7-9fb5-be01dd31b7af-c000.snappy.parquet

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-885979867746/delta-consumab

### 4. Existing schema

In [8]:
spark.sql("DESCRIBE FORMATTED loan_db.loans_by_state_delta").show(truncate=False)

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used
                                                                                

+----------------------------+---------------------------------------------------+-------+
|col_name                    |data_type                                          |comment|
+----------------------------+---------------------------------------------------+-------+
|addr_state                  |string                                             |       |
|count                       |bigint                                             |       |
|                            |                                                   |       |
|# Partitioning              |                                                   |       |
|Not partitioned             |                                                   |       |
|                            |                                                   |       |
|# Detailed Table Information|                                                   |       |
|Name                        |loan_db.loans_by_state_delta                       |       |

### 5. Attempt to modify the schema

In [9]:
# Add a column called collateral_value
schemaEvolvedDF = sql("select addr_state, cast(rand(10)*count as bigint) as count, cast(rand(10) * 10000 * count as double) as collateral_value from loan_db.loans_by_state_delta")
schemaEvolvedDF.show(3)

                                                                                

+----------+-----+--------------------+
|addr_state|count|    collateral_value|
+----------+-----+--------------------+
|        IA|   94|   948770.9115653402|
|        CA| 9939| 9.939137216157739E7|
|        IN| 3850|3.8502319893542394E7|
+----------+-----+--------------------+



In [10]:
# Attempt to append to the table
schemaEvolvedDF.write.format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)

AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: 50aecee7-98f2-40ec-85ef-68af054ea14e).
To enable schema migration using DataFrameWriter or DataStreamWriter, please set:
'.option("mergeSchema", "true")'.
For other operations, set the session configuration
spark.databricks.delta.schema.autoMerge.enabled to "true". See the documentation
specific to the operation for details.

Table schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)


Data schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
-- collateral_value: double (nullable = true)

         

### 6. Supply "mergeSchema" option

In [11]:
schemaEvolvedDF.write.option("mergeSchema",True).format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)

                                                                                

In [12]:
spark.sql("SELECT * FROM loan_db.loans_by_state_delta").show()

+----------+-----+--------------------+
|addr_state|count|    collateral_value|
+----------+-----+--------------------+
|        IA|   94|   948770.9115653402|
|        CA| 9939| 9.939137216157739E7|
|        IN| 3850|3.8502319893542394E7|
|        IA|  555|                null|
|        CA|12345|                null|
|        IN| 6666|                null|
+----------+-----+--------------------+



In [13]:
# Lets look at our datalake and changes from the above execution
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-885979867746/delta-consumable/:
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-19db8128-746a-4cbd-9a35-ce10d21dbf3a-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-1ddad42e-603a-40c6-83b7-4ea3ae4a5128-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-407faca5-0109-4def-a655-a688e694cc6b-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-9566ead4-572a-4cc0-ba48-605bdfa18778-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-efb9177f-a191-4835-928a-0d7e4b72e8e0-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-f028020a-487f-4de7-9fb5-be01dd31b7af-c000.snappy.parquet

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-buc

There is one extra parquet, containing the data with thenew column, and one new transaction log


Lets look at the log-

In [14]:
!gsutil cat "gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000006.json"

CommandException: No URLs matched: gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000006.json


In [15]:
# Lets add data again till we hit the 10th version of the table
schemaEvolvedDF = sql("select addr_state, cast(rand(10)*count as bigint) as count, cast(rand(10) * 10000 * count as double) as collateral_value from loan_db.loans_by_state_delta")
schemaEvolvedDF.write.format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log


                                                                                

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000003.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000004.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000005.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000006.json


In [16]:
# Lets add data again till we hit the 10th version of the table
schemaEvolvedDF = sql("select addr_state, cast(rand(10)*count as bigint) as count, cast(rand(10) * 10000 * count as double) as collateral_value from loan_db.loans_by_state_delta")
schemaEvolvedDF.write.format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log

                                                                                

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000003.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000004.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000005.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000006.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000007.json


In [17]:
# Lets add data again till we hit the 10th version of the table
schemaEvolvedDF = sql("select addr_state, cast(rand(10)*count as bigint) as count, cast(rand(10) * 10000 * count as double) as collateral_value from loan_db.loans_by_state_delta")
schemaEvolvedDF.write.format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log

                                                                                

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000003.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000004.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000005.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000006.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000007.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000008.json


In [18]:
# Lets add data again till we hit the 10th version of the table
schemaEvolvedDF = sql("select addr_state, cast(rand(10)*count as bigint) as count, cast(rand(10) * 10000 * count as double) as collateral_value from loan_db.loans_by_state_delta")
schemaEvolvedDF.write.format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log

                                                                                

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000003.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000004.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000005.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000006.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000007.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000008.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/0000000000000000000

In [19]:
# Lets add data again till we hit the 10th version of the table
schemaEvolvedDF = sql("select addr_state, cast(rand(10)*count as bigint) as count, cast(rand(10) * 10000 * count as double) as collateral_value from loan_db.loans_by_state_delta")
schemaEvolvedDF.write.format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log

                                                                                

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000003.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000004.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000005.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000006.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000007.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000008.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/0000000000000000000

Note how there is a parquet file created - this is a compacted version of table changes 0-9
Delta will load this parquet into memory, to avoid having to read too many small files
At any given point of time, delta will not read more than 10 json transaction logs, but will read all the parquet transaction logs.

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK