# Delta Lake Lab 
## Unit 10: Querying Delta Lake in BigQuery UI with BigLake manifest support for Delta Lake

This lab is powered by Dataproc Serverless Spark.


In this unit, we will -
1. Generate a manifest file each for the unpartitioned delta table & partitioned delta table - both of which we created in notebook 2
2. Using the Python SDK for BigQuery, we will create BigLake tables for the above
3. Using the Python SDK for BigQuery, we will query the BigLake tables

You can explore the tables created in teh BigQuery UI as well.

### 1. Imports

In [1]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

from google.cloud.exceptions import BadRequest
from google.cloud import bigquery

import sqlparse
import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

23/12/03 01:13:46 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-diy-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-diy-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  11002190840


In [6]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}"
UNPARTITIONED_DELTA_LAKE_DIR = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"
PARTITIONED_DELTA_LAKE_DIR = f"{DATA_LAKE_ROOT_PATH}/delta-sample-partitioned"

### 4. Create a manifest file on the Delta Lake data from notebook 2
https://docs.delta.io/latest/delta-utility.html#id8

You can a generate manifest file for a Delta table that can be used by other processing engines (that is, other than Apache Spark) to read the Delta table. For example, to generate a manifest file that can be used by Presto and Athena to read a Delta table, you run the below

#### 4.1. Create a manifest file on the unpartitioned loans dataset

In [7]:
unpartitionedDeltaTable = DeltaTable.forPath(spark, UNPARTITIONED_DELTA_LAKE_DIR)
unpartitionedDeltaTable.generate("symlink_format_manifest")

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used
23/12/03 01:14:01 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

In [8]:
!gsutil ls -r $UNPARTITIONED_DELTA_LAKE_DIR | grep "_symlink_format_manifest/manifest"

gs://dll-data-bucket-11002190840/delta-consumable/_symlink_format_manifest/manifest


In [9]:
MANIFEST_LIST = !gsutil ls -r $UNPARTITIONED_DELTA_LAKE_DIR | grep "_symlink_format_manifest/manifest"
MANIFEST_FILE = MANIFEST_LIST[0]
print(MANIFEST_FILE)

gs://dll-data-bucket-11002190840/delta-consumable/_symlink_format_manifest/manifest


In [10]:
!gsutil cat $MANIFEST_FILE

gs://dll-data-bucket-11002190840/delta-consumable/part-00000-18ffd7b0-964d-4f1e-a648-0892a1d4373d-c000.snappy.parquet


Using this manifest file, you can create an external table in BigQuery - a BigLake table on the Delta Lake Table, except it will be point in time to when the manifest was generated.

#### 4.2. Create a manifest file on the partitioned loans dataset

In [11]:
partitionedDeltaTable = DeltaTable.forPath(spark, PARTITIONED_DELTA_LAKE_DIR)
partitionedDeltaTable.generate("symlink_format_manifest")

                                                                                

In [12]:
!gsutil ls -r $PARTITIONED_DELTA_LAKE_DIR 

gs://dll-data-bucket-11002190840/delta-sample-partitioned/:

gs://dll-data-bucket-11002190840/delta-sample-partitioned/_delta_log/:
gs://dll-data-bucket-11002190840/delta-sample-partitioned/_delta_log/
gs://dll-data-bucket-11002190840/delta-sample-partitioned/_delta_log/00000000000000000000.json

gs://dll-data-bucket-11002190840/delta-sample-partitioned/_symlink_format_manifest/:

gs://dll-data-bucket-11002190840/delta-sample-partitioned/_symlink_format_manifest/addr_state=AK/:
gs://dll-data-bucket-11002190840/delta-sample-partitioned/_symlink_format_manifest/addr_state=AK/
gs://dll-data-bucket-11002190840/delta-sample-partitioned/_symlink_format_manifest/addr_state=AK/manifest

gs://dll-data-bucket-11002190840/delta-sample-partitioned/_symlink_format_manifest/addr_state=AL/:
gs://dll-data-bucket-11002190840/delta-sample-partitioned/_symlink_format_manifest/addr_state=AL/
gs://dll-data-bucket-11002190840/delta-sample-partitioned/_symlink_format_manifest/addr_state=AL/manifest

gs://dll

### 5. Function to send DDL/DML commands to BigQuery using its SDK

In [13]:
# Wrapper to use BigQuery client to run query/job, return job ID or result as DF
def fn_execute_bq_statement(sql, show_job_id=False):
    """
    Input: SQL query, as a string, to execute in BigQuery
    Returns the query results as a pandas DataFrame, or error, if any
    """

    # Try dry run before executing query to catch any errors
    try:
        job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)

        bq_client.query(sql, job_config=job_config)

    except BadRequest as err:
        print(err)
        return

    job_config = bigquery.QueryJobConfig()
    client_result = bq_client.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running, then get & return data frame
    df = client_result.result().to_dataframe()

    if show_job_id:
        print(f"Finished job_id: {job_id}")

    return df

In [14]:
# Instantiate the BigQuery client
bq_client = bigquery.Client(project=PROJECT_ID)

### 6. Create and query a BigLake table on the unpartitioned Delta Lake table manifest

#### 6.1. Review the dataset in Spark

In [15]:
spark.sql("SELECT * FROM loan_db.loans_by_state_delta LIMIT 2").show(truncate=False)

                                                                                

+----------+-----+
|addr_state|count|
+----------+-----+
|AK        |1    |
|AL        |1    |
+----------+-----+



#### 6.2. Create BigLake table DDL 

In [16]:
DATASET = 'delta_dataset'
CONNECTION = 'us.biglake-connection'
TABLE_NAME = 'loans_deltalake_unpartitioned'
URI = 'gs://dll-data-bucket-' + PROJECT_NUMBER + '/delta-consumable/_symlink_format_manifest/manifest'
BIGLAKE_DDL = 'CREATE OR REPLACE EXTERNAL TABLE ' + DATASET + '.' + TABLE_NAME + ' ' + \
        'WITH CONNECTION `' +  CONNECTION + '` ' + \
        'OPTIONS (' + \
        'format="PARQUET", ' + \
        'uris=["' + URI + '"],' + \
        'file_set_spec_type = \'NEW_LINE_DELIMITED_MANIFEST\',' + \
        'max_staleness = INTERVAL 1 DAY,' + \
        'metadata_cache_mode = \'AUTOMATIC\'' + \
        ');'


print(sqlparse.format(BIGLAKE_DDL, reindent_aligned = True, keyword_case='upper'))

CREATE OR REPLACE EXTERNAL TABLE delta_dataset.loans_deltalake_unpartitioned WITH
CONNECTION `us.biglake-connection`
OPTIONS (format="PARQUET", uris=["gs://dll-data-bucket-11002190840/delta-consumable/_symlink_format_manifest/manifest"],file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST',max_staleness = INTERVAL 1 DAY,metadata_cache_mode = 'AUTOMATIC');


#### 6.3. Create the BigLake unpartitioned manifest based Delta Lake table

In [17]:
# You can execute the SQL in the BQ UI; The following shows how to create the table using the BQ Python SDK; Returns a Pandas dataframe
PDF = fn_execute_bq_statement(BIGLAKE_DDL)
print(PDF)

Empty DataFrame
Columns: []
Index: []


#### 6.4. Run a query against the BigLake unpartitioned Delta Lake table using the BigQuery Python SDK

In [18]:
BIGLAKE_SELECT_SQL = 'select addr_state,count from '  + DATASET + '.' + TABLE_NAME + ' order by addr_state ASC;'
PDF = fn_execute_bq_statement(BIGLAKE_SELECT_SQL)
print(PDF)

   addr_state  count
0          AK      1
1          AL      1
2          AR      1
3          AZ      1
4          CA  12345
5          CO      1
6          CT      1
7          DC      1
8          DE      1
9          FL      1
10         GA      1
11         HI      1
12         IA    555
13         ID      1
14         IL      1
15         IN   6666
16         KS      1
17         KY      1
18         LA      1
19         MA      1
20         MD      1
21         ME      1
22         MI      1
23         MN      1
24         MO      1
25         MS      1
26         MT      1
27         NC      1
28         ND      1
29         NE      1
30         NH      1
31         NJ      1
32         NM      1
33         NV      1
34         NY      1
35         OH      1
36         OK      1
37         OR      1
38         PA      1
39         RI      1
40         SC      1
41         SD      1
42         TN      1
43         TX      1
44         UT      1
45         VA      1
46         VT

### 7. Create and query a BigLake table on the partitioned Delta Lake table manifest

#### 7.1. Create an external table in Spark on the data and explore it

In [19]:
# Define external delta table definition
spark.sql("DROP TABLE IF EXISTS loan_db.loans_by_state_delta_partitioned;").show(truncate=False)
spark.sql(f"CREATE TABLE loan_db.loans_by_state_delta_partitioned USING delta LOCATION \"{PARTITIONED_DELTA_LAKE_DIR}\"").show(truncate=False)
spark.sql("SHOW TABLES IN loan_db").show(truncate=False)

++
||
++
++



23/12/03 01:15:26 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_catalog`.`loan_db`.`loans_by_state_delta_partitioned` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
23/12/03 01:15:27 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.


++
||
++
++

+---------+----------------------------------+-----------+
|namespace|tableName                         |isTemporary|
+---------+----------------------------------+-----------+
|loan_db  |loans_by_state_delta              |false      |
|loan_db  |loans_by_state_delta_clone_shallow|false      |
|loan_db  |loans_by_state_delta_partitioned  |false      |
|loan_db  |loans_by_state_parquet            |false      |
|loan_db  |loans_cleansed_parquet            |false      |
+---------+----------------------------------+-----------+



#### 7.2. Explore the table data in Spark

In [20]:
spark.sql(f"SELECT * FROM loan_db.loans_by_state_delta_partitioned LIMIT 5").show(truncate=False)

                                                                                

+----------+-----+
|addr_state|count|
+----------+-----+
|ME        |1    |
|KY        |1    |
|AK        |1    |
|NM        |1    |
|IN        |1    |
+----------+-----+



#### 7.3. Create a BigLake table defintion

In [24]:
DATASET = 'delta_dataset'
CONNECTION = 'us.biglake-connection'
TABLE_NAME = 'loans_deltalake_partitioned'
URI = 'gs://dll-data-bucket-' + PROJECT_NUMBER + '/delta-sample-partitioned/_symlink_format_manifest/*/manifest'
BIGLAKE_PARTITIONED_TABLE_DDL = 'CREATE OR REPLACE EXTERNAL TABLE ' + DATASET + '.' + TABLE_NAME + ' ' + \
        'WITH PARTITION COLUMNS(addr_state STRING)' + ' ' + \
        'WITH CONNECTION `' +  CONNECTION + '` ' + \
        'OPTIONS (' + \
        'format="PARQUET", ' + \
        'hive_partition_uri_prefix="' + PARTITIONED_DELTA_LAKE_DIR + '", ' + \
        'uris=["' + URI + '"],' + \
        'file_set_spec_type = \'NEW_LINE_DELIMITED_MANIFEST\',' + \
        'max_staleness = INTERVAL 1 DAY,' + \
        'metadata_cache_mode = \'AUTOMATIC\'' + \
        ');'



print(sqlparse.format(BIGLAKE_PARTITIONED_TABLE_DDL, reindent_aligned = True, keyword_case='upper'))

CREATE OR REPLACE EXTERNAL TABLE delta_dataset.loans_deltalake_partitioned WITH
PARTITION COLUMNS(addr_state STRING) WITH
CONNECTION `us.biglake-connection`
OPTIONS (format="PARQUET", hive_partition_uri_prefix="gs://dll-data-bucket-11002190840/delta-sample-partitioned", uris=["gs://dll-data-bucket-11002190840/delta-sample-partitioned/_symlink_format_manifest/*/manifest"],file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST',max_staleness = INTERVAL 1 DAY,metadata_cache_mode = 'AUTOMATIC');


#### 7.4. Create the BigLake partitioned manifest based Delta Lake table

In [25]:
# You can execute the SQL in the BQ UI; The following shows how to create the table using the BQ Python SDK; Returns a Pandas dataframe
PDF = fn_execute_bq_statement(BIGLAKE_PARTITIONED_TABLE_DDL)
print(PDF)

Empty DataFrame
Columns: []
Index: []


#### 7.5. Run a query against the BigLake partitioned Delta Lake table using the BigQuery Python SDK

In [26]:
BIGLAKE_SELECT_SQL = 'select addr_state,count from '  + DATASET + '.' + TABLE_NAME + ' LIMIT 5;'
print(BIGLAKE_SELECT_SQL)
PDF = fn_execute_bq_statement(BIGLAKE_SELECT_SQL)
print(PDF)

select addr_state,count from delta_dataset.loans_deltalake_partitioned LIMIT 5;
  addr_state  count
0         AR      1
1         AZ      1
2         CO      1
3         AK      1
4         CA      1


### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK