# Delta Lake Lab 
## Unit 3: Delta Table Utilities

This lab is powered by Dataproc Serverless Spark.

In the previous unit-
1. We created a base table in parquet

In this unit, we will -
1. Create a base delta table off of the parquet base table loan_db.loans_by_state_parquet
2. Take a peek under the hood of the Delta table
3. Review the delta transaction log
4. Look at delta table details
5. Look at delta table history
6. Create a manifest file for an unpartitioned delta table
7. Create a Biglake table in BigQuery for the unpartitioned table
8. Compare Biglake output to delta table output
9. Create a manifest file for an partitioned delta table
10. Create a Biglake table in BigQuery for the partitioned table
11. Compare Biglake output to delta table output
12. Review entries in the Hive Metastore (Dataproc Metastore Service)

### 1. Imports

In [5]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

from google.cloud.exceptions import BadRequest
from google.cloud import bigquery

import sqlparse
import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [6]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark



23/11/01 20:25:08 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [7]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-labv7


In [8]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-labv7


In [9]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  697607461278


In [10]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}"
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"

### 4. Peek under the hood of our Delta Lake table (loan_db.loans_by_state_delta)

In [11]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-697607461278/delta-consumable/:
gs://dll-data-bucket-697607461278/delta-consumable/part-00000-35ab01b4-063e-4bd9-9490-0b80f3aea8dd-c000.snappy.parquet

gs://dll-data-bucket-697607461278/delta-consumable/_delta_log/:
gs://dll-data-bucket-697607461278/delta-consumable/_delta_log/
gs://dll-data-bucket-697607461278/delta-consumable/_delta_log/00000000000000000000.json


In [12]:
!gsutil cat $DELTA_LAKE_DIR_ROOT/_delta_log/00000000000000000000.json

{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"fce6afad-4f03-49cb-aa6f-0605d968786b","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"addr_state\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"count\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1698870109769}}
{"add":{"path":"part-00000-35ab01b4-063e-4bd9-9490-0b80f3aea8dd-c000.snappy.parquet","partitionValues":{},"size":978,"modificationTime":1698870116822,"dataChange":true,"stats":"{\"numRecords\":51,\"minValues\":{\"addr_state\":\"AK\",\"count\":1},\"maxValues\":{\"addr_state\":\"WY\",\"count\":1},\"nullCount\":{\"addr_state\":0,\"count\":0}}"}}
{"commitInfo":{"timestamp":1698870120444,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numFiles":"1","numOutp

### 5. Table Details
https://docs.delta.io/latest/delta-utility.html#id6

In [13]:
deltaTable = DeltaTable.forPath(spark, DELTA_LAKE_DIR_ROOT)
detailDF = deltaTable.detail()
detailPDF=detailDF.toPandas()
detailPDF

                                                                                

Unnamed: 0,format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion
0,delta,fce6afad-4f03-49cb-aa6f-0605d968786b,,,gs://dll-data-bucket-697607461278/delta-consum...,2023-11-01 20:21:49.769,2023-11-01 20:22:00.822,[],1,978,{},1,2


### 6. Table History

https://docs.delta.io/latest/delta-utility.html#id4

In [14]:
deltaTable = DeltaTable.forPath(spark, DELTA_LAKE_DIR_ROOT)
fullHistoryPDF = deltaTable.history().toPandas()    # get the full history of the table
lastOperationPDF = deltaTable.history(1).toPandas() # get the last operation



                                                                                

#### Last operation

In [15]:
lastOperationPDF

Unnamed: 0,version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
0,0,2023-11-01 20:22:00.822,,,WRITE,"{'mode': 'Overwrite', 'partitionBy': '[]'}",,,,,Serializable,False,"{'numOutputRows': '51', 'numOutputBytes': '978...",,Apache-Spark/3.3.2 Delta-Lake/2.1.0


#### Full History

In [16]:
fullHistoryPDF

Unnamed: 0,version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
0,0,2023-11-01 20:22:00.822,,,WRITE,"{'mode': 'Overwrite', 'partitionBy': '[]'}",,,,,Serializable,False,"{'numOutputRows': '51', 'numOutputBytes': '978...",,Apache-Spark/3.3.2 Delta-Lake/2.1.0


### 7. Table manifest file
https://docs.delta.io/latest/delta-utility.html#id8

You can a generate manifest file for a Delta table that can be used by other processing engines (that is, other than Apache Spark) to read the Delta table. For example, to generate a manifest file that can be used by Presto and Athena to read a Delta table, you run the following:

In [17]:
deltaTable = DeltaTable.forPath(spark, DELTA_LAKE_DIR_ROOT)
deltaTable.generate("symlink_format_manifest")

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used
                                                                                

In [18]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT | grep "_symlink_format_manifest/manifest"

gs://dll-data-bucket-697607461278/delta-consumable/_symlink_format_manifest/manifest


In [19]:
MANIFEST_LIST = !gsutil ls -r $DELTA_LAKE_DIR_ROOT | grep "_symlink_format_manifest/manifest"
MANIFEST_FILE = MANIFEST_LIST[0]
print(MANIFEST_FILE)

gs://dll-data-bucket-697607461278/delta-consumable/_symlink_format_manifest/manifest


In [20]:
!gsutil cat $MANIFEST_FILE

gs://dll-data-bucket-697607461278/delta-consumable/part-00000-35ab01b4-063e-4bd9-9490-0b80f3aea8dd-c000.snappy.parquet


Using this manifest file, you can create an external table in BigQuery on the Delta Table, except it will be point in time to when the manifest was generated.

#### First let's look at the delta table via Spark

In [21]:
spark.sql("show tables from loan_db;").show(truncate=False)
df = deltaTable.toDF()
pdf = df.toPandas()
pdf = pdf.sort_values(by='addr_state', ascending=True)
print(pdf)


ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Runtime version 4.8 used for parser compilation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Runtime version 4.8 used for parser compilation does not match the current runtime version 4.9.3


+---------+----------------------+-----------+
|namespace|tableName             |isTemporary|
+---------+----------------------+-----------+
|loan_db  |loans_by_state_delta  |false      |
|loan_db  |loans_by_state_parquet|false      |
|loan_db  |loans_cleansed_parquet|false      |
+---------+----------------------+-----------+



[Stage 25:>                                                         (0 + 1) / 1]

   addr_state  count
46         AK      1
30         AL      1
47         AR      1
0          AZ      1
16         CA      1
45         CO      1
17         CT      1
5          DC      1
23         DE      1
44         FL      1
41         GA      1
50         HI      1
35         IA      1
15         ID      1
25         IL      1
31         IN      1
43         KS      1
9          KY      1
2          LA      1
42         MA      1
22         MD      1
26         ME      1
12         MI      1
3          MN      1
24         MO      1
29         MS      1
19         MT      1
20         NC      1
28         ND      1
18         NE      1
11         NH      1
4          NJ      1
34         NM      1
13         NV      1
38         NY      1
32         OH      1
48         OK      1
6          OR      1
36         PA      1
8          RI      1
1          SC      1
37         SD      1
33         TN      1
39         TX      1
49         UT      1
7          VA      1
21         VT

                                                                                

#### Now let's create a BigLake table using the manifest file created above

In [22]:
# Wrapper to use BigQuery client to run query/job, return job ID or result as DF
def bq_query(sql, show_job_id=False):
    """
    Input: SQL query, as a string, to execute in BigQuery
    Returns the query results as a pandas DataFrame, or error, if any
    """

    # Try dry run before executing query to catch any errors
    try:
        job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)

        bq_client.query(sql, job_config=job_config)

    except BadRequest as err:
        print(err)
        return

    job_config = bigquery.QueryJobConfig()
    client_result = bq_client.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running, then get & return data frame
    df = client_result.result().to_dataframe()

    if show_job_id:
        print(f"Finished job_id: {job_id}")

    return df

#### First, let's build and print out the BigQuery command to create an external table with the delta manifest file

In [23]:

bq_client = bigquery.Client(project=PROJECT_ID)

DATASET = 'delta_dataset'
CONNECTION = 'us.biglake-connection'
TABLE_NAME = 'bq_delta_table'
URI = 'gs://dll-data-bucket-' + PROJECT_NUMBER + '/delta-consumable/_symlink_format_manifest/manifest'
BQSQL = 'CREATE EXTERNAL TABLE ' + DATASET + '.' + TABLE_NAME + ' ' + \
        'WITH CONNECTION `' +  CONNECTION + '` ' + \
        'OPTIONS (' + \
        'format="PARQUET", ' + \
        'uris=["' + URI + '"],' + \
        'file_set_spec_type = \'NEW_LINE_DELIMITED_MANIFEST\',' + \
        'max_staleness = INTERVAL 1 DAY,' + \
        'metadata_cache_mode = \'AUTOMATIC\'' + \
        ');'


print(sqlparse.format(BQSQL, reindent_aligned = True, keyword_case='upper'))




CREATE EXTERNAL TABLE delta_dataset.bq_delta_table WITH
CONNECTION `us.biglake-connection`
OPTIONS (format="PARQUET", uris=["gs://dll-data-bucket-697607461278/delta-consumable/_symlink_format_manifest/manifest"],file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST',max_staleness = INTERVAL 1 DAY,metadata_cache_mode = 'AUTOMATIC');


In [24]:
#### Now, you can execute the BigQuery SQL above in the BigQuery Console as shown below or you can execute the next cells to create the table for you

In [25]:

from IPython.display import Image
from IPython.core.display import HTML
Image(url = "./images/Lab3-Image1.png")


In [26]:
df = bq_query(BQSQL)
print(df)

Empty DataFrame
Columns: []
Index: []


#### Now Query the delta table that you created above

In [27]:
BQDELTASQL = 'select addr_state,count from '  + DATASET + '.' + TABLE_NAME + ' order by addr_state ASC;'
df = bq_query(BQDELTASQL)
print(df)

   addr_state  count
0          AK      1
1          AL      1
2          AR      1
3          AZ      1
4          CA      1
5          CO      1
6          CT      1
7          DC      1
8          DE      1
9          FL      1
10         GA      1
11         HI      1
12         IA      1
13         ID      1
14         IL      1
15         IN      1
16         KS      1
17         KY      1
18         LA      1
19         MA      1
20         MD      1
21         ME      1
22         MI      1
23         MN      1
24         MO      1
25         MS      1
26         MT      1
27         NC      1
28         ND      1
29         NE      1
30         NH      1
31         NJ      1
32         NM      1
33         NV      1
34         NY      1
35         OH      1
36         OK      1
37         OR      1
38         PA      1
39         RI      1
40         SC      1
41         SD      1
42         TN      1
43         TX      1
44         UT      1
45         VA      1
46         VT

#### Create manifest file for partitioned delta table

In [28]:
DELTA_PARTLAKE_DIR_ROOT = f"gs://dll-data-bucket-{PROJECT_NUMBER}/delta-sample-partitioned"
deltaPartTable = DeltaTable.forPath(spark, DELTA_PARTLAKE_DIR_ROOT)
deltaPartTable.generate(f"symlink_format_manifest")

                                                                                

#### First let's look at the delta partitioned table via Spark

In [29]:
df = deltaPartTable.toDF()
pdf = df.toPandas()
pdf = pdf.sort_values(by='addr_state', ascending=True)
print(pdf)



   addr_state  count
6          AK      1
17         AL      1
3          AR      1
9          AZ      1
4          CA      1
40         CO      1
48         CT      1
22         DC      1
27         DE      1
50         FL      1
45         GA      1
10         HI      1
30         IA      1
32         ID      1
28         IL      1
39         IN      1
25         KS      1
5          KY      1
41         LA      1
34         MA      1
0          MD      1
16         ME      1
31         MI      1
36         MN      1
21         MO      1
11         MS      1
33         MT      1
20         NC      1
1          ND      1
46         NE      1
8          NH      1
26         NJ      1
35         NM      1
38         NV      1
23         NY      1
15         OH      1
12         OK      1
7          OR      1
44         PA      1
37         RI      1
24         SC      1
13         SD      1
14         TN      1
42         TX      1
47         UT      1
49         VA      1
43         VT

                                                                                

#### Now let's create a BigLake partitioned table using the manifest file created above

In [30]:
#### First, let's build and print out the BigQuery command to create an external partitioned table with the delta manifest file

In [31]:
DATASET = 'delta_dataset'
CONNECTION = 'us.biglake-connection'
TABLE_NAME = 'bq_delta_partitioned_table'
URI = 'gs://dll-data-bucket-' + PROJECT_NUMBER + '/delta-sample-partitioned/_symlink_format_manifest/*/manifest'
BQSQL = 'CREATE EXTERNAL TABLE ' + DATASET + '.' + TABLE_NAME + ' ' + \
        'WITH PARTITION COLUMNS(addr_state STRING)' + ' ' + \
        'WITH CONNECTION `' +  CONNECTION + '` ' + \
        'OPTIONS (' + \
        'format="PARQUET", ' + \
        'hive_partition_uri_prefix="' + DELTA_PARTLAKE_DIR_ROOT + '", ' + \
        'uris=["' + URI + '"],' + \
        'file_set_spec_type = \'NEW_LINE_DELIMITED_MANIFEST\',' + \
        'max_staleness = INTERVAL 1 DAY,' + \
        'metadata_cache_mode = \'AUTOMATIC\'' + \
        ');'



print(sqlparse.format(BQSQL, reindent_aligned = True, keyword_case='upper'))



CREATE EXTERNAL TABLE delta_dataset.bq_delta_partitioned_table WITH
PARTITION COLUMNS(addr_state STRING) WITH
CONNECTION `us.biglake-connection`
OPTIONS (format="PARQUET", hive_partition_uri_prefix="gs://dll-data-bucket-697607461278/delta-sample-partitioned", uris=["gs://dll-data-bucket-697607461278/delta-sample-partitioned/_symlink_format_manifest/*/manifest"],file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST',max_staleness = INTERVAL 1 DAY,metadata_cache_mode = 'AUTOMATIC');


In [32]:
#### Now, you can execute the BigQuery SQL above in the BigQuery Console as shown below or you can execute the next cells to create the partitioned table for you

In [33]:

from IPython.display import Image
from IPython.core.display import HTML
Image(url = "./images/Lab3-Image2.png")

In [None]:
df = bq_query(BQSQL)
print(df)


#### Now Query the delta table that you created above

In [None]:
BQDELTASQL = 'select addr_state,count from '  + DATASET + '.' + TABLE_NAME + ' order by addr_state ASC;'
df = bq_query(BQDELTASQL)
print(df)

### 8. Hive Metastore Entry

In [17]:
spark.sql("show tables in loan_db").show(truncate=False)

+---------+----------------------+-----------+



|namespace|tableName             |isTemporary|



+---------+----------------------+-----------+



|loan_db  |loans_by_state_delta  |false      |



|loan_db  |loans_by_state_parquet|false      |



|loan_db  |loans_cleansed_parquet|false      |



+---------+----------------------+-----------+






### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK