# Iceberg Lab 
## Unit 5: Table Inspection

In the previous unit, we-
1. Learned how Schema is enforced in Iceberg
2. Learned how to perform Schema Evolution and how Iceberg keeps track of it

In this unit, we will-
1. Explore metadata inspection tables that iceberg provides

### 1. Imports

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col


import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark.sparkContext.setLogLevel("WARN")
spark

24/05/13 17:02:45 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-diy-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-diy-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  11002190840


In [6]:
DPMS_NAME=f"iceberg-hms-{PROJECT_NUMBER}"
LOCATION="us-central1"

metastore_dir = !gcloud metastore services describe $DPMS_NAME --location $LOCATION |grep 'hive.metastore.warehouse.dir'| cut -d':' -f2- | xargs 
HIVE_METASTORE_WAREHOUSE_DIR = metastore_dir[0]
print("HIVE_METASTORE_WAREHOUSE_DIR",HIVE_METASTORE_WAREHOUSE_DIR)

HIVE_METASTORE_WAREHOUSE_DIR ERROR: (gcloud.metastore.services.describe) NOT_FOUND: Resource 'projects/delta-lake-diy-lab/locations/us-central1/services/iceberg-hms-11002190840' was not found


In [7]:
TABLE_NAME="loans_by_state_iceberg"
DB_NAME="loan_db"

#fully qualified table name
FQTN=f"{DB_NAME}.{TABLE_NAME}"

print("Fully quailified table name :",FQTN)

Fully quailified table name : loan_db.loans_by_state_iceberg


### 4. Table Inspection
Iceberg provides a set of metadata tables that makes it easier to read the files from Metadata folders and the information from these tables can be used to perform time_travel, rollback, snapshot correction or maintenance.


#### a. history

In [8]:
# Shows a history of snapshot updates on the table
spark.table("loan_db.loans_by_state_iceberg.history").show(truncate=False)

[Stage 0:>                                                          (0 + 1) / 1]

+-----------------------+-------------------+-------------------+-------------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|
+-----------------------+-------------------+-------------------+-------------------+
|2024-05-13 16:40:13.336|9176687385630465169|null               |true               |
|2024-05-13 16:48:45.421|5233765456160638845|9176687385630465169|true               |
|2024-05-13 16:50:19.235|7272046516454809414|5233765456160638845|true               |
|2024-05-13 16:51:36.457|6697245229602575300|7272046516454809414|true               |
|2024-05-13 16:52:33.693|1795098683837972174|6697245229602575300|true               |
|2024-05-13 16:54:50.592|4282559450411521188|1795098683837972174|true               |
+-----------------------+-------------------+-------------------+-------------------+



                                                                                

#### b. metadata_log_entries

In [10]:
# Keeps a track of metadata log entries and their current snapshot at the time of metadata file update
spark.table("loan_db.loans_by_state_iceberg.metadata_log_entries").show(truncate=False)

+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+----------------+----------------------+
|timestamp              |file                                                                                                                                                                                   |latest_snapshot_id |latest_schema_id|latest_sequence_number|
+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+----------------+----------------------+
|2024-05-13 16:40:13.336|gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/00000-ff4fdfd8-1029-48c2-ac68-d44d499b

#### c. snapshots

In [11]:
# Reads data from snapshot avro file that keeps a track of snapshot updates, operations performed, partition and table statistics and parent snapshots
# The entry with "null" parent_id is the first snapshot created 
spark.table("loan_db.loans_by_state_iceberg.snapshots").show()

+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|        committed_at|        snapshot_id|          parent_id|operation|       manifest_list|             summary|
+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|2024-05-13 16:40:...|9176687385630465169|               null|   append|gs://gcs-bucket-d...|{spark.app.id -> ...|
|2024-05-13 16:48:...|5233765456160638845|9176687385630465169|overwrite|gs://gcs-bucket-d...|{spark.app.id -> ...|
|2024-05-13 16:50:...|7272046516454809414|5233765456160638845|   append|gs://gcs-bucket-d...|{spark.app.id -> ...|
|2024-05-13 16:51:...|6697245229602575300|7272046516454809414|overwrite|gs://gcs-bucket-d...|{spark.app.id -> ...|
|2024-05-13 16:52:...|1795098683837972174|6697245229602575300|overwrite|gs://gcs-bucket-d...|{spark.app.id -> ...|
|2024-05-13 16:54:...|4282559450411521188|1795098683837972174|overwrite|gs://gcs

#### c. files

In [12]:
# Shows details of current data files only, their respective metadata and statistics for efficient querying

spark.table("loan_db.loans_by_state_iceberg.files").show(truncate=True)

[Stage 4:>                                                          (0 + 1) / 1]

+-------+--------------------+-----------+-------+------------+------------------+--------------------+------------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
|content|           file_path|file_format|spec_id|record_count|file_size_in_bytes|        column_sizes|      value_counts|null_value_counts|nan_value_counts|        lower_bounds|        upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|    readable_metrics|
+-------+--------------------+-----------+-------+------------+------------------+--------------------+------------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
|      0|gs://gcs-bucket-d...|    PARQUET|      0|          51|               991|{1 -> 169, 2 -> 219}|{1 -> 51, 2 -> 51}| {1 -> 0, 2 -> 0}|              {}|{1 -> AK, 2 -> \n.

                                                                                

#### d. all_files

In [13]:
# Similar to "files" above but gives details of all files for a given table

spark.table("loan_db.loans_by_state_iceberg.all_files").show(truncate=False)

                                                                                

+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+--------------------+------------------+-----------------+----------------+-------------------------+------------------------+------------+-------------+------------+-------------+------------------------------------------------------------+
|content|file_path                                                                                                                                                                                |file_format|spec_id|record_count|file_size_in_bytes|column_sizes        |value_counts      |null_value_counts|nan_value_counts|lower_bounds             |upper_bounds            |key_metadata|split_offsets|equality_ids|sort_order_id|readable_metrics                                            |
+-------+-------------

#### e. manifests

In [14]:
# Shows details of manifest files for current snapshot only. Reads data from the manifest avro file in metadata folder

spark.table("loan_db.loans_by_state_iceberg.manifests").show(truncate=False) 

+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|content|path                                                                                                                                                                       |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+----------

#### e. all_manifests

In [15]:
# Similar to "manifests" above but gives details of all manifest files for a given table

spark.table("loan_db.loans_by_state_iceberg.all_manifests").show(truncate=False)

[Stage 8:>                                                          (0 + 1) / 1]

+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+---------------------+
|content|path                                                                                                                                                                       |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|reference_snapshot_id|
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

#### f. partitions

In [16]:
# Below is the output for an unpartitioned table
spark.table("loan_db.loans_by_state_iceberg.partitions").show(truncate=False)

+------------+----------+-----------------------------+----------------------------+--------------------------+----------------------------+--------------------------+-----------------------+------------------------+
|record_count|file_count|total_data_file_size_in_bytes|position_delete_record_count|position_delete_file_count|equality_delete_record_count|equality_delete_file_count|last_updated_at        |last_updated_snapshot_id|
+------------+----------+-----------------------------+----------------------------+--------------------------+----------------------------+--------------------------+-----------------------+------------------------+
|51          |1         |991                          |0                           |0                         |0                           |0                         |2024-05-13 16:54:50.592|4282559450411521188     |
+------------+----------+-----------------------------+----------------------------+--------------------------+---------------------

In [17]:
# Below statement shows a much descriptive output on a partitioned table indicating record count in each partition, files in each partition and spec_id
#( In our case the spec_id = 0 indicating the first column "addr_state" as the partition column)

spark.table("loan_db.loans_by_state_iceberg_partitioned.partitions").show(truncate=False)

+---------+-------+------------+----------+-----------------------------+----------------------------+--------------------------+----------------------------+--------------------------+-----------------------+------------------------+
|partition|spec_id|record_count|file_count|total_data_file_size_in_bytes|position_delete_record_count|position_delete_file_count|equality_delete_record_count|equality_delete_file_count|last_updated_at        |last_updated_snapshot_id|
+---------+-------+------------+----------+-----------------------------+----------------------------+--------------------------+----------------------------+--------------------------+-----------------------+------------------------+
|{TN}     |0      |1           |1         |692                          |0                           |0                         |0                           |0                         |2024-05-13 16:40:28.768|994591193726650744      |
|{LA}     |0      |1           |1         |692              

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK