# Delta Lake Lab 
## Unit 3: Delta Table Utilities

This lab is powered by Dataproc Serverless Spark.

In the previous unit-
1. We created a base table in parquet

In this unit, we will -
1. Create a base delta table off of the parquet base table loan_db.loans_by_state_parquet
2. Take a peek under the hood of the Delta table
3. Review the delta transaction log
4. Look at delta table details
5. Look at delta table history
6. Create a manifest file for an unpartitioned delta table - we will create a BigLake table with it in notebook 10
7. Review entries in the Hive Metastore (Dataproc Metastore Service)

### 1. Imports

In [1]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

from google.cloud.exceptions import BadRequest
from google.cloud import bigquery

import sqlparse
import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark



23/12/02 23:33:22 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-diy-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-diy-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  11002190840


In [6]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}"
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"

### 4. Peek under the hood of our Delta Lake table (loan_db.loans_by_state_delta)

In [7]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-11002190840/delta-consumable/:
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-83d6b120-d178-4ac4-8c8e-f8d590f5f050-c000.snappy.parquet

gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/:
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/
gs://dll-data-bucket-11002190840/delta-consumable/_delta_log/00000000000000000000.json

gs://dll-data-bucket-11002190840/delta-consumable/_symlink_format_manifest/:
gs://dll-data-bucket-11002190840/delta-consumable/_symlink_format_manifest/
gs://dll-data-bucket-11002190840/delta-consumable/_symlink_format_manifest/manifest


In [8]:
!gsutil cat $DELTA_LAKE_DIR_ROOT/_delta_log/00000000000000000000.json

{"commitInfo":{"timestamp":1701559198463,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numFiles":"1","numOutputRows":"51","numOutputBytes":"978"},"engineInfo":"Apache-Spark/3.4.0 Delta-Lake/2.4.0","txnId":"ef45b4d8-1b6b-4e35-8d69-658760b727e5"}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"347aa570-3d50-4d3a-8aba-02dcf5b8bcee","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"addr_state\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"count\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1701559185896}}
{"add":{"path":"part-00000-83d6b120-d178-4ac4-8c8e-f8d590f5f050-c000.snappy.parquet","partitionValues":{},"size":978,"modificationTime":1701559194480,"dataChange":true,"stats":"{\"numRecords\":51,\"minValues\

### 5. Table Details
https://docs.delta.io/latest/delta-utility.html#id6

In [9]:
deltaTable = DeltaTable.forPath(spark, DELTA_LAKE_DIR_ROOT)
tableDetailsDF = deltaTable.detail()
tableDetailsDF.show(truncate=False)

23/12/02 23:33:42 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+------+------------------------------------+----+-----------+-------------------------------------------------+-----------------------+-----------------------+----------------+--------+-----------+----------+----------------+----------------+------------------------+
|format|id                                  |name|description|location                                         |createdAt              |lastModified           |partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|tableFeatures           |
+------+------------------------------------+----+-----------+-------------------------------------------------+-----------------------+-----------------------+----------------+--------+-----------+----------+----------------+----------------+------------------------+
|delta |347aa570-3d50-4d3a-8aba-02dcf5b8bcee|null|null       |gs://dll-data-bucket-11002190840/delta-consumable|2023-12-02 23:19:45.896|2023-12-02 23:19:58.873|[]              |1       |978    

### 6. Table History

https://docs.delta.io/latest/delta-utility.html#id4

In [10]:
deltaTable = DeltaTable.forPath(spark, DELTA_LAKE_DIR_ROOT)

#### Full history

In [11]:
fullHistoryDF = deltaTable.history()
fullHistoryDF.show(truncate=False)



+-------+-----------------------+------+--------+---------+--------------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                   |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+--------------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|0      |2023-12-02 23:19:58.873|null  |null    |WRITE    |{mode -> Overwrite, partitionBy -> []}|null|null    |null     |null       |Serializable  |false        |{numFiles -> 1, nu

                                                                                

#### Last operation

In [12]:
lastOperationDF = deltaTable.history(1)
lastOperationDF.show(truncate=False)

+-------+-----------------------+------+--------+---------+--------------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                   |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+--------------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|0      |2023-12-02 23:19:58.873|null  |null    |WRITE    |{mode -> Overwrite, partitionBy -> []}|null|null    |null     |null       |Serializable  |false        |{numFiles -> 1, nu

### 7. Table manifest file
https://docs.delta.io/latest/delta-utility.html#id8

You can a generate manifest file for a Delta table that can be used by other processing engines (that is, other than Apache Spark) to read the Delta table. For example, to generate a manifest file that can be used by Presto and Athena to read a Delta table, you run the following:

In [13]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT | grep "_symlink_format_manifest/manifest"

gs://dll-data-bucket-11002190840/delta-consumable/_symlink_format_manifest/manifest


In [14]:
MANIFEST_LIST = !gsutil ls -r $DELTA_LAKE_DIR_ROOT | grep "_symlink_format_manifest/manifest"
MANIFEST_FILE = MANIFEST_LIST[0]
print(MANIFEST_FILE)

gs://dll-data-bucket-11002190840/delta-consumable/_symlink_format_manifest/manifest


In [15]:
!gsutil cat $MANIFEST_FILE

gs://dll-data-bucket-11002190840/delta-consumable/part-00000-83d6b120-d178-4ac4-8c8e-f8d590f5f050-c000.snappy.parquet


Using this manifest file, you can create an external table in BigQuery on the Delta Table, except it will be point in time to when the manifest was generated.

### 8. Hive Metastore Entry

In [16]:
spark.sql("show tables in loan_db").show(truncate=False)

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used


+---------+----------------------+-----------+
|namespace|tableName             |isTemporary|
+---------+----------------------+-----------+
|loan_db  |loans_by_state_delta  |false      |
|loan_db  |loans_by_state_parquet|false      |
|loan_db  |loans_cleansed_parquet|false      |
+---------+----------------------+-----------+



### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK