# Iceberg Lab 
## Unit 4: Schema Enforcement & Evolution

In the previous unit, we performed below operations and viewed the corresponding changes to **data** and **metadata** folder of the table
1. Deleted a record  
2. Inserted a record 
3. Update a record 
4. Insert Overwrite from a source table
5. Merged from a source table 

In this unit, we will -
1. Explore Schema enforcement
2. How schema evolution is supported in Iceberg


### 1. Imports

In [1]:
from pyspark.sql import SparkSession


import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark.sparkContext.setLogLevel("WARN")
spark

24/05/13 16:57:02 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-diy-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-diy-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  11002190840


In [6]:
DPMS_NAME=f"dll-hms-{PROJECT_NUMBER}"
LOCATION="us-central1"

metastore_dir = !gcloud metastore services describe $DPMS_NAME --location $LOCATION |grep 'hive.metastore.warehouse.dir'| cut -d':' -f2- | xargs 
HIVE_METASTORE_WAREHOUSE_DIR = metastore_dir[0]
print("HIVE_METASTORE_WAREHOUSE_DIR",HIVE_METASTORE_WAREHOUSE_DIR)

HIVE_METASTORE_WAREHOUSE_DIR gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse


In [7]:
TABLE_NAME="loans_by_state_iceberg"
DB_NAME="loan_db"
#fully qualified table name
FQTN=f"{DB_NAME}.{TABLE_NAME}"
print("Fully quailified table name :",FQTN)

Fully quailified table name : loan_db.loans_by_state_iceberg


In [8]:
!gsutil ls -r $HIVE_METASTORE_WAREHOUSE_DIR/loan_db.db/$TABLE_NAME/

gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/:

gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/data/:
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/data/00000-2-525c32e9-c437-4d4c-a7a0-c6fac91bba85-0-00001.parquet
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/data/00000-204-10f4d0c1-80ac-4e51-acf9-b918ef057a8d-0-00001.parquet
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/data/00000-409-0e90780c-f8a8-40ff-a14d-6defc2157af7-0-00001.parquet
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/data/00000-441-eb0fbb23-c499-40cf-9192-c8bc95b48657-0-00001.parquet
gs://gcs-buc

In [9]:
#Get base file counts from the table folder
DATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/data/*.parquet | wc -l
print("DATA_FILE_COUNT",DATA_FILE_COUNT)

METADATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*.json | wc -l
print("METADATA_FILE_COUNT",METADATA_FILE_COUNT)

MANIFEST_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*m[0-9].avro | wc -l
print("MANIFEST_FILE_COUNT",MANIFEST_FILE_COUNT)

MANIFEST_LIST_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/snap*.avro | wc -l
print("MANIFEST_LIST_COUNT",MANIFEST_LIST_COUNT)

DATA_FILE_COUNT ['6']
METADATA_FILE_COUNT ['6']
MANIFEST_FILE_COUNT ['11']
MANIFEST_LIST_COUNT ['6']


### 4. Existing schema

In [10]:
spark.sql(f"DESCRIBE FORMATTED {FQTN}").show(truncate=False)

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used


+----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                                             |comment|
+----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
|addr_state                  |string                                                                                                                |       |
|loan_count                  |bigint                                                                                                                |       |
|                            |                                                                                                                      |       |
|# Partitioning              |                      

### 5. Schema Enforcement

#### a. Appending Extra Columns

In [11]:
# Create a new DF with more columns than out base Iceberg table
additionalColumnsDF = spark.sql(f"select addr_state,loan_count, rand(10)*loan_count as new_col from {FQTN}")
additionalColumnsDF.show(3)

[Stage 0:>                                                          (0 + 1) / 1]

+----------+----------+------------------+
|addr_state|loan_count|           new_col|
+----------+----------+------------------+
|        AK|      1092|  186.677087464748|
|        AL|      5545| 4464.359324714027|
|        AR|      3297|1904.3226626013993|
+----------+----------+------------------+
only showing top 3 rows



                                                                                

In [12]:
# Attempt to append to the table
additionalColumnsDF.writeTo(f"{FQTN}").append()

AnalysisException: 
Cannot write to 'spark_catalog.loan_db.loans_by_state_iceberg', too many data columns:
Table columns: 'addr_state', 'loan_count'
Data columns: 'addr_state', 'loan_count', 'new_col'
       

**NOTE:** Spark validates Iceberg table schema prevents writing additional columns to the table

#### b. Fewer Columns

In [13]:
lessColumnsDF= spark.sql(f"SELECT addr_state FROM {FQTN}")
lessColumnsDF.show(3)

+----------+
|addr_state|
+----------+
|        AK|
|        AL|
|        AR|
+----------+
only showing top 3 rows



In [14]:
# Attempt to append to the table
lessColumnsDF.writeTo(f"{FQTN}").append()

AnalysisException: Cannot write incompatible data to table 'spark_catalog.loan_db.loans_by_state_iceberg':
- Cannot find data for output column 'loan_count'

#### c. Incompatible Data Types

In [15]:
newDatatypeDF = spark.sql(f"SELECT addr_state , STRING(loan_count)  FROM {FQTN}")
newDatatypeDF.printSchema()

root
 |-- addr_state: string (nullable = true)
 |-- loan_count: string (nullable = true)



In [16]:
newDatatypeDF.writeTo(f"{FQTN}").append()

AnalysisException: Cannot write incompatible data to table 'spark_catalog.loan_db.loans_by_state_iceberg':
- Cannot safely cast 'loan_count': string to bigint

### 6. Schema Evolution
Iceberg supports schema evolution using **ALTER TABLE** ddl commands


In [17]:
#Get Base File counts from the table folder

DATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/data/*.parquet | wc -l
print("DATA_FILE_COUNT",DATA_FILE_COUNT)

METADATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*.json | wc -l
print("METADATA_FILE_COUNT",METADATA_FILE_COUNT)

MANIFEST_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*m[0-9].avro | wc -l
print("MANIFEST_FILE_COUNT",MANIFEST_FILE_COUNT)

MANIFEST_LIST_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/snap*.avro | wc -l
print("MANIFEST_LIST_COUNT",MANIFEST_LIST_COUNT)


DATA_FILE_COUNT ['6']
METADATA_FILE_COUNT ['6']
MANIFEST_FILE_COUNT ['11']
MANIFEST_LIST_COUNT ['6']


  #### a. Add new Column

In [18]:
spark.sql(f"ALTER TABLE {FQTN} ADD COLUMN new_column BIGINT")

DataFrame[]

In [19]:
spark.table(f"{FQTN}").show(3)

[Stage 2:>                                                          (0 + 1) / 1]

+----------+----------+----------+
|addr_state|loan_count|new_column|
+----------+----------+----------+
|        AK|      1092|      null|
|        AL|      5545|      null|
|        AR|      3297|      null|
+----------+----------+----------+
only showing top 3 rows



                                                                                

#### b. Rename Column

In [20]:
spark.sql(f"ALTER TABLE {FQTN} RENAME COLUMN new_column TO renamed_column")

DataFrame[]

In [21]:
spark.table(f"{FQTN}").printSchema()

root
 |-- addr_state: string (nullable = true)
 |-- loan_count: long (nullable = true)
 |-- renamed_column: long (nullable = true)



#### c. Drop Column

In [22]:
spark.sql(f"ALTER TABLE {FQTN} DROP COLUMN renamed_column")

DataFrame[]

In [23]:
spark.table(f"{FQTN}").printSchema()

root
 |-- addr_state: string (nullable = true)
 |-- loan_count: long (nullable = true)



In [24]:
#Get New File counts from the table folder after Schema updates

DATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/data/*.parquet | wc -l
print("DATA_FILE_COUNT",DATA_FILE_COUNT)

METADATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*.json | wc -l
print("METADATA_FILE_COUNT",METADATA_FILE_COUNT)

MANIFEST_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*m[0-9].avro | wc -l
print("MANIFEST_FILE_COUNT",MANIFEST_FILE_COUNT)

MANIFEST_LIST_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/snap*.avro | wc -l
print("MANIFEST_LIST_COUNT",MANIFEST_LIST_COUNT)


DATA_FILE_COUNT ['6']
METADATA_FILE_COUNT ['9']
MANIFEST_FILE_COUNT ['11']
MANIFEST_LIST_COUNT ['6']


**NOTE:**
1. Schema Evolution only performs Metadata updates and there are no changes to Data or Snapshot files
2. Notice below that the latest Metadata file keeps track of all schema versions and updates the **current-schema-id** value to reflect the current state

In [25]:
#Scanning through the metadata file notice that the current-schema-id has been updated to the latest version

latest_metadata = !gsutil ls  $HIVE_METASTORE_WAREHOUSE_DIR/loan_db.db/{TABLE_NAME}/metadata/*.metadata.json | tail -1
LATEST_METADATA_FILE = latest_metadata[0]
print("LATEST_METADATA_FILE", LATEST_METADATA_FILE)

!gsutil cat {LATEST_METADATA_FILE} |head -100

LATEST_METADATA_FILE gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/00008-2c4b131c-4a23-4df5-82c3-780ef132c29d.metadata.json
{
  "format-version" : 2,
  "table-uuid" : "9228ae9b-8c1d-4c94-bc1e-9205b86ffd30",
  "location" : "gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg",
  "last-sequence-number" : 6,
  "last-updated-ms" : 1715619518624,
  "last-column-id" : 3,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "addr_state",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "loan_count",
      "required" : false,
      "type" : "long"
    } ]
  }, {
    "type" : "struct",
    "schema-id" : 1,
    "fields" : [ {
      "id" : 1,
      "name" : "addr_state",
      "required" : false,
      "type" : "string"
    }, 

**NOTE:** *(In our case we have added a column , renamed it and then deleted it to revert the table schema to its orgiginal version. Hence the current-schema-id has been set again to **"0"** to reflect the current schema state)*

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK