# Apache Iceberg Lab 
## Unit 2: Create Iceberg table
In the previous unit -
1. We read parquet data in the datalake in the parquet-source directory
2. We cleansed it and persisted it as parquet to the datalake parquet-cleansed directory
3. We then optimized it and persisted it as parquet to the datalake parquet-consumable directory
4. We finally a database called loan_db and defined an external table on the data in parquet-consumable

In this unit you will learn to -
1. Create Iceberg table in Hive Catalog off of the parquet table from the prior notebook and explore the folder structure
2. Study the data and metadata 

### 1. Imports

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark.sparkContext.setLogLevel("WARN")

24/05/13 16:39:54 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-diy-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-diy-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  11002190840


In [6]:
DATA_LAKE_ROOT_PATH = f"gs://iceberg-data-bucket-{PROJECT_NUMBER}"
print("DATA_LAKE_ROOT_PATH:",DATA_LAKE_ROOT_PATH)

DATA_LAKE_ROOT_PATH: gs://iceberg-data-bucket-11002190840


In [7]:
#Version hint keeps track of the latest metadata file of the table. 
version_hint = !gsutil cat $ICEBERG_HDP_WAREHOUSE_DIR/loan_db/loans_by_state_iceberg_hdp/metadata/version-hint.text

METADATA_VERSION = version_hint[0]
print("METADATA_VERSION =",METADATA_VERSION)

METADATA_VERSION = CommandException: "cat" command does not support "file://" URLs. Did you mean to use a gs:// URL?


In [8]:
#Fetch the hive metastore directory 
DPMS_NAME=f"dll-hms-{PROJECT_NUMBER}"
LOCATION="us-central1"

metastore_dir = !gcloud metastore services describe $DPMS_NAME --location $LOCATION |grep 'hive.metastore.warehouse.dir'| cut -d':' -f2- | xargs 
HIVE_METASTORE_WAREHOUSE_DIR = metastore_dir[0]

print("HIVE_METASTORE_WAREHOUSE_DIR =",HIVE_METASTORE_WAREHOUSE_DIR)

HIVE_METASTORE_WAREHOUSE_DIR = gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse


### 5. Create Tables in **"Hive"** Catalog

####    a. Creating Unpartitioned Table
Note: We will use these tables for the rest of the lab

In [9]:
spark.sql("show databases").show(truncate=False)

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used


+---------+
|namespace|
+---------+
|default  |
|loan_db  |
+---------+



In [10]:
spark.sql("show tables in loan_db").show(truncate=False)

[Stage 0:>                                                          (0 + 1) / 1]

+---------+----------------------+-----------+
|namespace|tableName             |isTemporary|
+---------+----------------------+-----------+
|loan_db  |loans_by_state_parquet|false      |
|loan_db  |loans_cleansed_parquet|false      |
+---------+----------------------+-----------+



                                                                                

In [11]:
spark.sql("drop table if exists loan_db.loans_by_state_iceberg_partitioned").show(truncate=False)

++
||
++
++



In [12]:
# Create iceberg table from the Parquet table
spark.sql("SELECT addr_state,loan_count FROM loan_db.loans_by_state_parquet").writeTo("loan_db.loans_by_state_iceberg").using("iceberg").createOrReplace()

                                                                                

In [13]:
spark.sql("show tables from loan_db;").show(truncate=False)

+---------+----------------------+-----------+
|namespace|tableName             |isTemporary|
+---------+----------------------+-----------+
|loan_db  |loans_by_state_iceberg|false      |
|loan_db  |loans_by_state_parquet|false      |
|loan_db  |loans_cleansed_parquet|false      |
+---------+----------------------+-----------+



In [14]:
spark.sql("DESCRIBE FORMATTED loan_db.loans_by_state_iceberg").show(truncate=False)

+----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                                             |comment|
+----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
|addr_state                  |string                                                                                                                |       |
|loan_count                  |bigint                                                                                                                |       |
|                            |                                                                                                                      |       |
|# Partitioning              |                      

**NOTE:** "Hive-Catalog" tables are created in the hive metastore warehouse directory by default

In [15]:
spark.sql("select * from loan_db.loans_by_state_iceberg limit 2").show()

+----------+----------+
|addr_state|loan_count|
+----------+----------+
|        AZ|     10318|
|        SC|      5460|
+----------+----------+



#### b. Create Partitioned Iceberg table

In [16]:
# Create Iceberg partitioned table from the Parquet table
spark.sql("SELECT addr_state,loan_count FROM loan_db.loans_by_state_parquet") \
.writeTo("loan_db.loans_by_state_iceberg_partitioned") \
.partitionedBy("addr_state") \
.using("iceberg") \
.createOrReplace()

                                                                                

In [17]:
spark.sql("DESCRIBE FORMATTED loan_db.loans_by_state_iceberg_partitioned").show(truncate=False)

+----------------------------+---------------------------------------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                                                        |comment|
+----------------------------+---------------------------------------------------------------------------------------------------------------------------------+-------+
|addr_state                  |string                                                                                                                           |       |
|loan_count                  |bigint                                                                                                                           |       |
|                            |                                                                                                                             

### 6. A quick peek at the data layout in hive metastore

Note that similar to Hadoop Catalog, Hive catalog also creates data and metdata folders for both partitioned and unpartitioned tables.
One noticeable difference is that Hive catalog does not create **version-hint.text** file because it tracks the latest version in Hive metastore table instead.

In [18]:
!echo $HIVE_METASTORE_WAREHOUSE_DIR

gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse


#### **Note**: We have two folders created for the table
    1. data directory - Contains the parquet files for the actual data
    2. metadat directory - Contains metadata associated with the Iceberg table

In [20]:
!gsutil ls -r $HIVE_METASTORE_WAREHOUSE_DIR/loan_db.db/loans_by_state_iceberg

gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/:

gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/data/:
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/data/00000-2-525c32e9-c437-4d4c-a7a0-c6fac91bba85-0-00001.parquet

gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/:
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/00000-ff4fdfd8-1029-48c2-ac68-d44d499bc9ce.metadata.json
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/9702db53-a27d-47ad-ab43-477bbf42516d-m0.avro
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/

In the listing above-
1. Metadata file: has .json extension
2. Manifest list: has snap prefix
3. Manifest file: has .avro extension

Lets get familiar with these names.

In [21]:
# Lets review the metadata file for the unpartitioned table-
!gsutil cat "gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/00000-ff4fdfd8-1029-48c2-ac68-d44d499bc9ce.metadata.json"

{
  "format-version" : 2,
  "table-uuid" : "9228ae9b-8c1d-4c94-bc1e-9205b86ffd30",
  "location" : "gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg",
  "last-sequence-number" : 1,
  "last-updated-ms" : 1715618413336,
  "last-column-id" : 2,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "addr_state",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "loan_count",
      "required" : false,
      "type" : "long"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-partition-id" : 999,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "owner" : "spark",
    "write.parquet.compression-codec" : "zstd"
  },
  "current-snapshot-id" : 9176687385630465169,
  "re

In [22]:
!gsutil ls -r $HIVE_METASTORE_WAREHOUSE_DIR/loan_db.db/loans_by_state_iceberg_partitioned

gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/:

gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/data/:

gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/data/addr_state=AK/:
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/data/addr_state=AK/00000-7-886ba725-2684-4fba-aa9a-75fa681013b6-0-00047.parquet

gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/data/addr_state=AL/:
gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/data/addr_state=AL/00000-7-886ba725-2684-4fba-aa9a-75fa681013b6-0-00031.parquet

gs://

In [23]:
!gsutil cat "gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/metadata/00000-83573c55-4d6c-4fd0-a96c-3eef09efb90f.metadata.json"

{
  "format-version" : 2,
  "table-uuid" : "df5acb98-87ce-4425-b01c-55fa718999c1",
  "location" : "gs://gcs-bucket-dll-hms-11002190840-e1664035-a2d9-4215-be46-45c40712/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned",
  "last-sequence-number" : 1,
  "last-updated-ms" : 1715618428768,
  "last-column-id" : 2,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "addr_state",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "loan_count",
      "required" : false,
      "type" : "long"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "addr_state",
      "transform" : "identity",
      "source-id" : 1,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : 

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK