# Delta Lake Lab 
## Unit 5: Schema Enforcement & Evolution

In the previous unit, we -
1. Did a delete operation on the base delta table and learned what happens to the underlying parquet and reviewed the transaction log
2. Did an insert operation on the base delta table and learned what happens to the underlying parquet and reviewed the transaction log
3. Did an update operation on the base delta table and learned what happens to the underlying parquet and reviewed the transaction log
4. Did a merge (upsert) operation on the base delta table and learned what happens to the underlying parquet and reviewed the transaction log

In this unit, we will -
1. Study schema enforcement
2. And schema evolution possible with delta lake


### 1. Imports

In [1]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

23/12/02 23:44:09 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-diy-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-diy-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  11002190840


In [6]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}"
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"

In [7]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-11002190840/delta-consumable/:
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-18ffd7b0-964d-4f1e-a648-0892a1d4373d-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-55d5ffb4-6b4a-4ead-8591-983dc631339f-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-6486345e-6c92-4216-b9ed-ffb413e41b4d-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-83d6b120-d178-4ac4-8c8e-f8d590f5f050-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-af39e13d-1030-4881-965c-406884eb9420-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-fe2f9136-6d38-4d3e-8090-bd41da49fe18-c000.snappy.parquet

gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/:
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-110021

In [8]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

6


In [9]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log/*.json | wc -l

6


### 4. Existing schema

In [10]:
spark.sql("DESCRIBE FORMATTED loan_db.loans_by_state_delta").show(truncate=False)

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used
                                                                                

+----------------------------+---------------------------------------------------+-------+
|col_name                    |data_type                                          |comment|
+----------------------------+---------------------------------------------------+-------+
|addr_state                  |string                                             |null   |
|count                       |bigint                                             |null   |
|                            |                                                   |       |
|# Detailed Table Information|                                                   |       |
|Name                        |spark_catalog.loan_db.loans_by_state_delta         |       |
|Type                        |EXTERNAL                                           |       |
|Location                    |gs://dll-data-bucket-11002190840/delta-consumable  |       |
|Provider                    |delta                                              |       |

### 5. Attempt to modify the schema

In [11]:
# Add a column called collateral_value
schemaEvolvedDF = sql("select addr_state, cast(rand(10)*count as bigint) as count, cast(rand(10) * 10000 * count as double) as collateral_value from loan_db.loans_by_state_delta")
schemaEvolvedDF.show(3)

23/12/02 23:44:32 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 9:>                                                          (0 + 1) / 1]

+----------+-----+-----------------+
|addr_state|count| collateral_value|
+----------+-----+-----------------+
|        AK|    0|1709.497137955568|
|        AL|    0|8051.143958005459|
|        AR|    0|5775.925576589018|
+----------+-----+-----------------+
only showing top 3 rows



                                                                                

In [12]:
# Attempt to append to the table
schemaEvolvedDF.write.format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)

AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: 347aa570-3d50-4d3a-8aba-02dcf5b8bcee).
To enable schema migration using DataFrameWriter or DataStreamWriter, please set:
'.option("mergeSchema", "true")'.
For other operations, set the session configuration
spark.databricks.delta.schema.autoMerge.enabled to "true". See the documentation
specific to the operation for details.

Table schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)


Data schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
-- collateral_value: double (nullable = true)

         

### 6. Supply "mergeSchema" option

In [13]:
# Data file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

6


In [14]:
# Delta log count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log/*.json | wc -l

6


In [15]:
schemaEvolvedDF.write.option("mergeSchema",True).format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)

                                                                                

In [16]:
spark.sql("SELECT * FROM loan_db.loans_by_state_delta").show(truncate=False)

                                                                                

+----------+-----+----------------+
|addr_state|count|collateral_value|
+----------+-----+----------------+
|AK        |1    |null            |
|AL        |1    |null            |
|AR        |1    |null            |
|AZ        |1    |null            |
|CA        |12345|null            |
|CO        |1    |null            |
|CT        |1    |null            |
|DC        |1    |null            |
|DE        |1    |null            |
|FL        |1    |null            |
|GA        |1    |null            |
|HI        |1    |null            |
|IA        |555  |null            |
|ID        |1    |null            |
|IL        |1    |null            |
|IN        |6666 |null            |
|KS        |1    |null            |
|KY        |1    |null            |
|LA        |1    |null            |
|MA        |1    |null            |
+----------+-----+----------------+
only showing top 20 rows



In [17]:
# Data file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

7


In [18]:
# Delta log count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log/*.json | wc -l

7


In [19]:
# Lets look at our datalake and changes from the above execution
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-11002190840/delta-consumable/:
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-18ffd7b0-964d-4f1e-a648-0892a1d4373d-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-55d5ffb4-6b4a-4ead-8591-983dc631339f-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-6486345e-6c92-4216-b9ed-ffb413e41b4d-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-6dbd9c8a-28bb-48a7-b1a9-5304b8f16bc2-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-83d6b120-d178-4ac4-8c8e-f8d590f5f050-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-af39e13d-1030-4881-965c-406884eb9420-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-fe2f9136-6d38-4d3e-8090-bd41da49fe18-c000.snappy.parquet

gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/:
gs://dll-data-bucket-11002190840/delta-consumable/_delta_l

There is one extra parquet, containing the data with the new column, and one new transaction log


Lets look at the log-

In [20]:
!gsutil cat "gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000006.json"

BucketNotFoundException: 404 gs://dll-data-bucket-885979867746 bucket does not exist.


In [21]:
# Lets add data again till we hit the 10th version of the table
schemaEvolvedDF = sql("select addr_state, cast(rand(10)*count as bigint) as count, cast(rand(10) * 10000 * count as double) as collateral_value from loan_db.loans_by_state_delta")
schemaEvolvedDF.write.format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log


                                                                                

gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/:
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000003.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000004.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000005.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000006.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000007.json


In [22]:
# Data file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

9


In [23]:
# Delta log count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log/*.json | wc -l

8


In [24]:
# Lets add data again till we hit the 10th version of the table
schemaEvolvedDF = sql("select addr_state, cast(rand(10)*count as bigint) as count, cast(rand(10) * 10000 * count as double) as collateral_value from loan_db.loans_by_state_delta")
schemaEvolvedDF.write.format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)


                                                                                

In [25]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log

gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/:
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000003.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000004.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000005.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000006.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000007.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000008.json


In [26]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part*

gs://dll-data-bucket-11002190840/delta-consumable/part-00000-18ffd7b0-964d-4f1e-a648-0892a1d4373d-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-55d5ffb4-6b4a-4ead-8591-983dc631339f-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-6486345e-6c92-4216-b9ed-ffb413e41b4d-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-687d9651-589c-4d6c-a8fd-df5d01770e85-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-6dbd9c8a-28bb-48a7-b1a9-5304b8f16bc2-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-83d6b120-d178-4ac4-8c8e-f8d590f5f050-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-af39e13d-1030-4881-965c-406884eb9420-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-dc870c0d-50c0-48b3-9d4d-7c28af28813c-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-0

In [27]:
# Data file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

13


In [28]:
# Delta log count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log/*.json | wc -l

9


In [29]:
# Lets add data again till we hit the 10th version of the table
schemaEvolvedDF = sql("select addr_state, cast(rand(10)*count as bigint) as count, cast(rand(10) * 10000 * count as double) as collateral_value from loan_db.loans_by_state_delta")
schemaEvolvedDF.write.format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log

                                                                                

gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/:
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000003.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000004.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000005.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000006.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000007.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000008.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000009.json


In [30]:
# Data file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

21


In [31]:
# Delta log count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log/*.json | wc -l

10


In [32]:
# Lets add data again till we hit the 10th version of the table
schemaEvolvedDF = sql("select addr_state, cast(rand(10)*count as bigint) as count, cast(rand(10) * 10000 * count as double) as collateral_value from loan_db.loans_by_state_delta")
schemaEvolvedDF.write.format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log

                                                                                

gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/:
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000003.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000004.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000005.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000006.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000007.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000008.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000009.json
gs://

In [33]:
# Data file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

29


In [34]:
# Delta log count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log/*.json | wc -l

11


In [35]:
# Lets add data again till we hit the 10th version of the table
schemaEvolvedDF = sql("select addr_state, cast(rand(10)*count as bigint) as count, cast(rand(10) * 10000 * count as double) as collateral_value from loan_db.loans_by_state_delta")
schemaEvolvedDF.write.format("delta").mode("append").save(DELTA_LAKE_DIR_ROOT)
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log

                                                                                

gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/:
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000003.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000004.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000005.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000006.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000007.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000008.json
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000009.json
gs://

In [36]:
# Data file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

37


In [37]:
# Delta log count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log/*.json | wc -l

12


Note how there is a parquet file created - this is a compacted version of table changes 0-9<br>
Delta will load this parquet into memory, to avoid having to read too many small files<br>
At any given point of time, delta will not read more than 10 json transaction logs, but will read all the parquet transaction logs.

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK