# Delta Lake Lab 
## Unit 12: Using OneTable for Delta Lake metadata translation and sink to BigLake metastore

This lab is powered by Dataproc Serverless Spark.


In this unit, we will -
1. Create a BigLake Iceberg Catalog
2. Create a BigLake Iceberg Database
3. Review data to use for the lab

The lab starts here -
https://github.com/anagha-google/table-format-lab-delta/blob/main/OneTable-Lab.md 

Once you complete the steps below, switch back to the lab guide at the link above.


### 1. Imports

In [1]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

from google.cloud.exceptions import BadRequest
from google.cloud import bigquery

import sqlparse
import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

23/12/04 03:15:55 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-diy-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-diy-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  11002190840


In [6]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}"
UNPARTITIONED_DELTA_LAKE_DIR = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"

### 4. Create BigLake Metastore entities


#### 4.1. Create a BLMS Iceberg catalog

In [12]:
spark.sql("DROP NAMESPACE IF EXISTS loans_iceberg_catalog").show(truncate=False)

++
||
++
++



In [13]:
spark.sql("CREATE NAMESPACE loans_iceberg_catalog").show(truncate=False)

++
||
++
++



#### 4.2. Create a BLMS Iceberg database

In [24]:
spark.sql("CREATE DATABASE IF NOT EXISTS loans_iceberg_dataset").show(truncate=False)

++
||
++
++



#### 4.3. Check databases in the Spark catalog

In [20]:
spark.sql("show databases").show(truncate=False)
# Note the iceberg entities created

+---------------------+
|namespace            |
+---------------------+
|default              |
|loan_db              |
|loans_iceberg_catalog|
|loans_iceberg_dataset|
+---------------------+



### 5. Quick review of the Delta Lake table we will use for the lab

In [21]:
spark.sql("show tables in loan_db").show(truncate=False)

+---------+----------------------------------+-----------+
|namespace|tableName                         |isTemporary|
+---------+----------------------------------+-----------+
|loan_db  |loans_by_state_delta              |false      |
|loan_db  |loans_by_state_delta_clone_shallow|false      |
|loan_db  |loans_by_state_delta_partitioned  |false      |
|loan_db  |loans_by_state_delta_uniform      |false      |
|loan_db  |loans_by_state_parquet            |false      |
|loan_db  |loans_cleansed_parquet            |false      |
+---------+----------------------------------+-----------+



In [23]:
spark.sql("select * from loan_db.loans_by_state_delta  limit 2").show(truncate=False)

23/12/04 04:04:27 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+----------+-----+
|addr_state|count|
+----------+-----+
|AK        |1    |
|AL        |1    |
+----------+-----+



### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK