# Delta Lake Lab 
## Unit 8: Table Clone 

In the previous unit we-
1. Learned about zordering and data skipping native to delta lake

In this unit, we will learn about-
1. Table cloning - shallow clone; creation, and understanding of what happens when a shallow clone is created and when updated
2. Table cloning - deep clone; creation and understanding of what happens when a shallow clone is created and when updated 

### 1. Imports

In [1]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

23/12/02 23:56:43 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-diy-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-diy-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  11002190840


In [6]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}"
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"
print(DELTA_LAKE_DIR_ROOT)

gs://dll-data-bucket-11002190840/delta-consumable


### 4. File listing

In [7]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-11002190840/delta-consumable/:
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-0a06babc-6211-4fe4-ad01-3e1f9726830c-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-18ffd7b0-964d-4f1e-a648-0892a1d4373d-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-55d5ffb4-6b4a-4ead-8591-983dc631339f-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-5afc31da-4141-4f21-93e5-0ee8635aee9f-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-5f4a2030-2b43-4229-815a-55fca555d323-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-6486345e-6c92-4216-b9ed-ffb413e41b4d-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-687d9651-589c-4d6c-a8fd-df5d01770e85-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-6dbd9c8a-28bb-48a7-b1a9-5304b8f16bc2-c000.snappy.parquet
gs:/

In [8]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

38


In [9]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log/*.json | wc -l

13


### 5. Create a shallow clone

In [10]:
SHALLOW_CLONE_DIR = f"{DELTA_LAKE_DIR_ROOT}/shallow_clone/"
print(SHALLOW_CLONE_DIR)

gs://dll-data-bucket-11002190840/delta-consumable/shallow_clone/


In [11]:
spark.sql("SELECT * FROM loan_db.loans_by_state_delta WHERE addr_state='IA' LIMIT 2").show(truncate=False)

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used
23/12/02 23:57:06 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+----------+-----+------------------+
|addr_state|count|collateral_value  |
+----------+-----+------------------+
|IA        |164  |1641533.0867176917|
|IA        |262  |2628029.966480629 |
+----------+-----+------------------+



In [12]:
spark.sql(f"CREATE TABLE IF NOT EXISTS loan_db.loans_by_state_delta_clone_shallow SHALLOW CLONE loan_db.loans_by_state_delta LOCATION \"{SHALLOW_CLONE_DIR}\"")

23/12/02 23:57:25 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_catalog`.`loan_db`.`loans_by_state_delta_clone_shallow` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
23/12/02 23:57:25 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.


DataFrame[source_table_size: bigint, source_num_of_files: bigint, num_removed_files: bigint, num_copied_files: bigint, removed_files_size: bigint, copied_files_size: bigint]

Shallow clone creation is a metadata operation until a CRUD operation is done against it, at which point, the data gets copy-persisted.

In [13]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-11002190840/delta-consumable/:
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-0a06babc-6211-4fe4-ad01-3e1f9726830c-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-18ffd7b0-964d-4f1e-a648-0892a1d4373d-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-55d5ffb4-6b4a-4ead-8591-983dc631339f-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-5afc31da-4141-4f21-93e5-0ee8635aee9f-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-5f4a2030-2b43-4229-815a-55fca555d323-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-6486345e-6c92-4216-b9ed-ffb413e41b4d-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-687d9651-589c-4d6c-a8fd-df5d01770e85-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-6dbd9c8a-28bb-48a7-b1a9-5304b8f16bc2-c000.snappy.parquet
gs:/

In [14]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

38


In [15]:
spark.sql("UPDATE loan_db.loans_by_state_delta_clone_shallow SET count = 11111 WHERE addr_state='IL'")

                                                                                

DataFrame[num_affected_rows: bigint]

In [16]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-11002190840/delta-consumable/:
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-0a06babc-6211-4fe4-ad01-3e1f9726830c-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-18ffd7b0-964d-4f1e-a648-0892a1d4373d-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-55d5ffb4-6b4a-4ead-8591-983dc631339f-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-5afc31da-4141-4f21-93e5-0ee8635aee9f-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-5f4a2030-2b43-4229-815a-55fca555d323-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-6486345e-6c92-4216-b9ed-ffb413e41b4d-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-687d9651-589c-4d6c-a8fd-df5d01770e85-c000.snappy.parquet
gs://dll-data-bucket-11002190840/delta-consumable/part-00000-6dbd9c8a-28bb-48a7-b1a9-5304b8f16bc2-c000.snappy.parquet
gs:/

In [17]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

38


Because of the update, the data got copied over

### 6. Create a deep clone
A deep clone copies the data over.

In [18]:
DEEP_CLONE_DIR = f"{DELTA_LAKE_DIR_ROOT}/deep_clone/"
print(DEEP_CLONE_DIR)

gs://dll-data-bucket-11002190840/delta-consumable/deep_clone/


In [None]:
# Not yet open sourced by Databricks
spark.sql(f"CREATE TABLE IF NOT EXISTS loan_db.loans_by_state_delta_clone_deep  CLONE loan_db.loans_by_state_delta LOCATION \"{DEEP_CLONE_DIR}\"")

In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK.