# Iceberg Lab 
## Unit 8: Table Maintenance Procedures 

In the previous unit, we -
1. Learned about Snapshot Management

In this unit, we will-
1. Learn about Spark Procedures provided by Iceberg for Table Maintenance 


### 1. Imports

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col 

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark.sparkContext.setLogLevel("WARN")
spark

24/05/13 17:15:41 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-diy-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-diy-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  11002190840


In [6]:
DPMS_NAME=f"dll-hms-{PROJECT_NUMBER}"
LOCATION="us-central1"

metastore_dir = !gcloud metastore services describe $DPMS_NAME --location $LOCATION |grep 'hive.metastore.warehouse.dir'| cut -d':' -f2- | xargs 
HIVE_METASTORE_WAREHOUSE_DIR = metastore_dir[0]
print("HIVE_METASTORE_WAREHOUSE_DIR",HIVE_METASTORE_WAREHOUSE_DIR)

HIVE_METASTORE_WAREHOUSE_DIR gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse


In [7]:
TABLE_NAME="loans_by_state_iceberg"
DB_NAME="loan_db"

#fully qualified table name
FQTN=f"{DB_NAME}.{TABLE_NAME}"

print("Fully quailified table name :",FQTN)

Fully quailified table name : loan_db.loans_by_state_iceberg


### 4. Table Maintenance

In [8]:
#Get base file counts from the table folder

DATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/data/*.parquet | wc -l
print("DATA_FILE_COUNT",DATA_FILE_COUNT)

METADATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*.json | wc -l
print("METADATA_FILE_COUNT",METADATA_FILE_COUNT)

MANIFEST_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*m[0-9].avro | wc -l
print("MANIFEST_FILE_COUNT",MANIFEST_FILE_COUNT)

MANIFEST_LIST_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/snap*.avro | wc -l
print("MANIFEST_LIST_COUNT",MANIFEST_LIST_COUNT)

DATA_FILE_COUNT ['6']
METADATA_FILE_COUNT ['11']
MANIFEST_FILE_COUNT ['11']
MANIFEST_LIST_COUNT ['6']


#### a. expire_snapshots

In [9]:
#Fetch an expiration timestamp or input value manually 

#EXP_TS=<enter timestamp manually here>

from pyspark.sql.functions import col,lit

EXP_TS = spark.sql("select committed_at from (SELECT committed_at, ROW_NUMBER() OVER(ORDER BY committed_at ASC) rownum from loan_db.loans_by_state_iceberg.snapshots) a where a.rownum =3").collect()[0][0]
print("Expiration Timestamp=", EXP_TS)

24/05/13 17:16:06 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
24/05/13 17:16:06 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
24/05/13 17:16:06 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[Stage 0:>                                                          (0 + 1) / 1]

Expiration Timestamp= 2024-05-13 16:50:19.235000


                                                                                

In [10]:
spark.sql(f"CALL spark_catalog.system.expire_snapshots('loan_db.loans_by_state_iceberg',TIMESTAMP '{EXP_TS}', 5)").show(truncate=False)

                                                                                

+------------------------+-----------------------------------+-----------------------------------+----------------------------+----------------------------+------------------------------+
|deleted_data_files_count|deleted_position_delete_files_count|deleted_equality_delete_files_count|deleted_manifest_files_count|deleted_manifest_lists_count|deleted_statistics_files_count|
+------------------------+-----------------------------------+-----------------------------------+----------------------------+----------------------------+------------------------------+
|1                       |0                                  |0                                  |1                           |1                           |0                             |
+------------------------+-----------------------------------+-----------------------------------+----------------------------+----------------------------+------------------------------+



In [11]:
#Get file counts from the table folder after expiring old snapshots

DATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/data/*.parquet | wc -l
print("DATA_FILE_COUNT",DATA_FILE_COUNT)

METADATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*.json | wc -l
print("METADATA_FILE_COUNT",METADATA_FILE_COUNT)

MANIFEST_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*m[0-9].avro | wc -l
print("MANIFEST_FILE_COUNT",MANIFEST_FILE_COUNT)

MANIFEST_LIST_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/snap*.avro | wc -l
print("MANIFEST_LIST_COUNT",MANIFEST_LIST_COUNT)

DATA_FILE_COUNT ['5']
METADATA_FILE_COUNT ['12']
MANIFEST_FILE_COUNT ['10']
MANIFEST_LIST_COUNT ['5']


**NOTE:**
1. Iceberg identifies the snapshots it can delete safely and deletes the corresponding data files, manifests and manifest lists
2. Also a new metadata file is added and this file will not maintain information about the expired snapshots so they are no longer available for time travel queries.



In [12]:
spark.table("loan_db.loans_by_state_iceberg.snapshots").show()

+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|        committed_at|        snapshot_id|          parent_id|operation|       manifest_list|             summary|
+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|2024-05-13 16:48:...|5233765456160638845|9176687385630465169|overwrite|gs://gcs-bucket-d...|{spark.app.id -> ...|
|2024-05-13 16:50:...|7272046516454809414|5233765456160638845|   append|gs://gcs-bucket-d...|{spark.app.id -> ...|
|2024-05-13 16:51:...|6697245229602575300|7272046516454809414|overwrite|gs://gcs-bucket-d...|{spark.app.id -> ...|
|2024-05-13 16:52:...|1795098683837972174|6697245229602575300|overwrite|gs://gcs-bucket-d...|{spark.app.id -> ...|
|2024-05-13 16:54:...|4282559450411521188|1795098683837972174|overwrite|gs://gcs-bucket-d...|{spark.app.id -> ...|
+--------------------+-------------------+-------------------+---------+--------

#### b. rewrite_manifests

In [13]:
#Rewrite manifests for a table to optimize scan planning.
spark.sql("CALL spark_catalog.system.rewrite_manifests('loan_db.loans_by_state_iceberg')").show(truncate=False)

24/05/13 17:16:43 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+-------------------------+---------------------+
|rewritten_manifests_count|added_manifests_count|
+-------------------------+---------------------+
|2                        |1                    |
+-------------------------+---------------------+



In [14]:
#Get file counts from the table folder after rewriting manifests

DATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/data/*.parquet | wc -l
print("DATA_FILE_COUNT",DATA_FILE_COUNT)

METADATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*.json | wc -l
print("METADATA_FILE_COUNT",METADATA_FILE_COUNT)

MANIFEST_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*m[0-9].avro | wc -l
print("MANIFEST_FILE_COUNT",MANIFEST_FILE_COUNT)

SNAPSHOT_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/snap*.avro | wc -l
print("SNAPSHOT_FILE_COUNT",SNAPSHOT_FILE_COUNT)

DATA_FILE_COUNT ['5']
METADATA_FILE_COUNT ['13']
MANIFEST_FILE_COUNT ['10']
SNAPSHOT_FILE_COUNT ['6']


**NOTE:**
1. Rewriting manifests performs below operations <br>
    i. Align manifest files with table partitioning <br>
    ii. Sort data files in manifest based on partition spec fields <br>
    iii. Optimize scan planning <br>

2. Adds new snapshot and manifest list to indicate changes to manifests

#### c. rewrite_data_files

In [15]:
#rewrite data file using lexicographical sort order compaction strategy 
spark.sql(f"CALL spark_catalog.system.rewrite_data_files(table => '{FQTN}',strategy => 'sort', sort_order => 'addr_state ASC NULLS LAST' )").show(truncate=False)

+--------------------------+----------------------+---------------------+
|rewritten_data_files_count|added_data_files_count|rewritten_bytes_count|
+--------------------------+----------------------+---------------------+
|0                         |0                     |0                    |
+--------------------------+----------------------+---------------------+



In [16]:
#rewrite data file using zorder sort compaction strategy 
spark.sql(f"CALL spark_catalog.system.rewrite_data_files(table => '{FQTN}',strategy => 'sort', sort_order => 'zorder(addr_state,loan_count)')").show(truncate=False)

+--------------------------+----------------------+---------------------+
|rewritten_data_files_count|added_data_files_count|rewritten_bytes_count|
+--------------------------+----------------------+---------------------+
|0                         |0                     |0                    |
+--------------------------+----------------------+---------------------+



**NOTE:**
1. If the data files are already compacted then rewriting data files does not impact the files
2. If no strategy is specified then binpacking is used as compaction strategy by default

#### d. Clear old metadata files

In [17]:
#Set auto metadata cleanup to true
spark.sql(f'ALTER TABLE {FQTN} SET TBLPROPERTIES("write.metadata.delete-after-commit.enabled"=true)').show(truncate=False)

#Set max versions of metadata files to be retained
spark.sql(f'ALTER TABLE {FQTN} SET TBLPROPERTIES("write.metadata.previous-versions-max"=5)').show(truncate=False)

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used


++
||
++
++

++
||
++
++



In [18]:
#Get base file counts from the table folder

DATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/data/*.parquet | wc -l
print("DATA_FILE_COUNT",DATA_FILE_COUNT)

METADATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*.json | wc -l
print("METADATA_FILE_COUNT",METADATA_FILE_COUNT)

MANIFEST_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*m[0-9].avro | wc -l
print("MANIFEST_FILE_COUNT",MANIFEST_FILE_COUNT)

SNAPSHOT_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/snap*.avro | wc -l
print("SNAPSHOT_FILE_COUNT",SNAPSHOT_FILE_COUNT)

DATA_FILE_COUNT ['5']
METADATA_FILE_COUNT ['6']
MANIFEST_FILE_COUNT ['10']
SNAPSHOT_FILE_COUNT ['6']


In [19]:
!gsutil ls -r  {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*.json

gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/00009-adcf98fc-d438-4d60-869b-058ed138dba6.metadata.json
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/00010-6777e6cd-e522-47d3-ac68-35c307f44542.metadata.json
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/00011-4b07c1d9-8a6f-4f70-b260-569027d7db9f.metadata.json
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/00012-b9b6b628-ed32-47e9-95cc-5904524cd3ca.metadata.json
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/00013-b31ebdc2-0199-492c-883f-ed145a2d3c95.metadata.json
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-wareho

**NOTE:**
1. Iceberg clears all older metadata files and retains only 5 previous versions
2. It adds one new metadata file to commit the transaction of clearing older metadata files, hence we have 6 metadata files

#### e. remove_orphan_files

In [20]:
# Clearing any orphaned (untracked) data files from the data folder
spark.sql(f"CALL spark_catalog.system.remove_orphan_files('{FQTN}')").show(truncate=False)

                                                                                

+--------------------+
|orphan_file_location|
+--------------------+
+--------------------+



**NOTE:** If the procedure finds any orphaned files it will clear them and print the file location of deleted files

In [21]:
#Get base file counts from the table folder

DATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/data/*.parquet | wc -l
print("DATA_FILE_COUNT",DATA_FILE_COUNT)

METADATA_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*.json | wc -l
print("METADATA_FILE_COUNT",METADATA_FILE_COUNT)

MANIFEST_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/*m[0-9].avro | wc -l
print("MANIFEST_FILE_COUNT",MANIFEST_FILE_COUNT)

SNAPSHOT_FILE_COUNT= !gsutil ls -r {HIVE_METASTORE_WAREHOUSE_DIR}/loan_db.db/{TABLE_NAME}/metadata/snap*.avro | wc -l
print("SNAPSHOT_FILE_COUNT",SNAPSHOT_FILE_COUNT)

DATA_FILE_COUNT ['5']
METADATA_FILE_COUNT ['6']
MANIFEST_FILE_COUNT ['10']
SNAPSHOT_FILE_COUNT ['6']


### THIS CONCLUDES THE ICEBERG LAB. DONT FORGET TO SHUT DOWN THE LAB RESOURCES.