# Delta Lake Lab 
## Unit 2: Create a Delta Lake table
In the previous unit -
1. We read parquet data in the datalake
2. Cleansed it, subset it and persisted it as parquet to the datalake parquet-consumable directory
3. We crated a database called loan_db and defined an external table on the data in parquet-consumable

In this unit you will learn to -
1. Create a base table in Delta off of the Parquet table in the prior notebook.
2. Create a partitioned Delta table off of the Parquet table in the prior notebook.

### 1. Imports

In [1]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

22/10/22 23:30:06 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  885979867746


In [6]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}"

### 4. Create an unpartitioned Delta table
We will use this for the test of the lab

In [7]:
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"

In [8]:
# Create delta dataset from the Parquet table
spark.sql("SELECT addr_state,count(*) as count FROM loan_db.loans_by_state_parquet group by addr_state").write.mode("overwrite").format("delta").save(f"{DELTA_LAKE_DIR_ROOT}")

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used
                                                                                

In [9]:
# Define external delta table definition
spark.sql("DROP TABLE IF EXISTS loan_db.loans_by_state_delta;")
spark.sql(f"CREATE TABLE loan_db.loans_by_state_delta USING delta LOCATION \"{DELTA_LAKE_DIR_ROOT}\"")

22/10/22 23:30:32 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `loan_db`.`loans_by_state_delta` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
22/10/22 23:30:33 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.


DataFrame[]

In [10]:
spark.sql("show tables from loan_db;").show()

+---------+--------------------+-----------+
|namespace|           tableName|isTemporary|
+---------+--------------------+-----------+
|  loan_db|loans_by_state_delta|      false|
|  loan_db|loans_by_state_pa...|      false|
|  loan_db|loans_cleansed_pa...|      false|
+---------+--------------------+-----------+



In [11]:
spark.sql("select * from loan_db.loans_by_state_delta limit 2").show()

+----------+-----+
|addr_state|count|
+----------+-----+
|        AZ|    1|
|        SC|    1|
+----------+-----+



In [12]:
spark.sql("DESCRIBE FORMATTED loan_db.loans_by_state_delta").show()

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|          addr_state|              string|       |
|               count|              bigint|       |
|                    |                    |       |
|      # Partitioning|                    |       |
|     Not partitioned|                    |       |
|                    |                    |       |
|# Detailed Table ...|                    |       |
|                Name|loan_db.loans_by_...|       |
|            Location|gs://dll-data-buc...|       |
|            Provider|               delta|       |
|               Owner|               spark|       |
|            External|                true|       |
|    Table Properties|[delta.minReaderV...|       |
+--------------------+--------------------+-------+



In [13]:
spark.sql("DESCRIBE EXTENDED loan_db.loans_by_state_delta").show()

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|          addr_state|              string|       |
|               count|              bigint|       |
|                    |                    |       |
|      # Partitioning|                    |       |
|     Not partitioned|                    |       |
|                    |                    |       |
|# Detailed Table ...|                    |       |
|                Name|loan_db.loans_by_...|       |
|            Location|gs://dll-data-buc...|       |
|            Provider|               delta|       |
|               Owner|               spark|       |
|            External|                true|       |
|    Table Properties|[delta.minReaderV...|       |
+--------------------+--------------------+-------+



### 5. Create a partitioned Delta Lake table

In [14]:
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-sample-partitioned"

In [15]:
# Create delta dataset from the Parquet table
spark.sql("SELECT addr_state,count(*) as count FROM loan_db.loans_by_state_parquet group by addr_state").write.mode("overwrite").partitionBy("addr_state").format("delta").save(f"{DELTA_LAKE_DIR_ROOT}")

                                                                                

### 6. A quick peek at the data lake layout
Compare this to the last cell of the prior notebook.

In [16]:
!gsutil ls -r $DATA_LAKE_ROOT_PATH

gs://dll-data-bucket-885979867746/delta-consumable/:
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-595b5ba1-408f-404d-91ee-7bc396235870-c000.snappy.parquet

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json

gs://dll-data-bucket-885979867746/delta-sample-partitioned/:

gs://dll-data-bucket-885979867746/delta-sample-partitioned/_delta_log/:
gs://dll-data-bucket-885979867746/delta-sample-partitioned/_delta_log/
gs://dll-data-bucket-885979867746/delta-sample-partitioned/_delta_log/00000000000000000000.json

gs://dll-data-bucket-885979867746/delta-sample-partitioned/addr_state=AK/:
gs://dll-data-bucket-885979867746/delta-sample-partitioned/addr_state=AK/part-00000-c641080d-5d27-4dfb-86ab-f49416353093.c000.snappy.parquet

gs://dll-data-bucket-885979867746/delta-sample-partitioned/addr_state=AL/:
gs://dll-data-buc

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK