## Overview of Delta Lake and some use cases

- https://github.com/delta-io

Use Cases:

- Schema Enforcement
- Deletion
- Updates
- Merge
- Time Travel

![images/delta-lake.jpg](https://raw.githubusercontent.com/eformat/telco-churn-augmentation/develop/images/delta-lake.jpg)


In [58]:
# notebook parameters

import os

os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.9"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3.9"
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-11.0.12.0.7-4.fc34.x86_64"

spark_master = "local[*]"
app_name = "churn-etl"
input_files = dict(
    billing="billing_events", 
    account_features="customer_account_features", 
    internet_features="customer_internet_features", 
    meta="customer_meta", 
    phone_features="customer_phone_features"
)
output_file = "churn-etl"
output_prefix = ""
output_mode = "overwrite"
output_kind = "parquet"
input_kind = "parquet"
driver_memory = '8g'
executor_memory = '8g'

In [59]:
import pyspark

session = pyspark.sql.SparkSession.builder \
    .master(spark_master) \
    .appName(app_name) \
    .config("spark.eventLog.enabled", True) \
    .config("spark.eventLog.dir", ".") \
    .config("spark.driver.memory", driver_memory) \
    .config("spark.executor.memory", executor_memory) \
    .config("spark.executor.cores", 1) \
    .config("spark.rapids.sql.concurrentGpuTasks", 1) \
    .config("spark.rapids.memory.pinnedPool.size", "2G") \
    .config("spark.locality.wait", "0s") \
    .config("spark.sql.files.maxPartitionBytes", "512m") \
    .config("spark.plugins", "com.nvidia.spark.SQLPlugin") \
    .config("spark.jars", "/opt/sparkRapidsPlugin/cudf-21.08.2-cuda11.jar,/opt/sparkRapidsPlugin/rapids-4-spark_2.12-21.08.0.jar") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()
session

21/10/08 10:30:33 WARN GpuDeviceManager: Initial RMM allocation (3224.0244140625 MB) is larger than the adjusted maximum allocation (3018.375 MB), lowering initial allocation to the adjusted maximum allocation.
21/10/08 10:30:34 WARN SQLExecPlugin: RAPIDS Accelerator 21.08.0 using cudf 21.08.2. To disable GPU support set `spark.rapids.sql.enabled` to false
21/10/08 10:30:34 WARN Plugin: Installing rapids UDF compiler extensions to Spark. The compiler is disabled by default. To enable it, set `spark.rapids.sql.udfCompiler.enabled` to true


### Schema Enforcement

With Delta Lake “schema on write” is followed, so any changes in schema when writing will be tracked and any discrepancy will raise an exception at that time.

Below code will make a dataframe of 1–5 numbers and we will write it as a Delta table.

In [60]:
import shutil
shutil.rmtree("/tmp/data/delta_sample", ignore_errors=True)

data = session.range(1,5)
data.write.format("delta").mode("overwrite").save("/tmp/data/delta_sample")



In [68]:
import glob
print(glob.glob("/tmp/data/delta_sample/*"))

['/tmp/data/delta_sample/part-00000-fe9b4ee3-c4da-4f1d-9e84-db6f07d17f5e-c000.snappy.parquet', '/tmp/data/delta_sample/part-00003-8bde0667-6254-477e-919a-c11ea03a471e-c000.snappy.parquet', '/tmp/data/delta_sample/part-00004-3ee3c741-aac5-4dbc-b4ed-869bf4d18925-c000.snappy.parquet', '/tmp/data/delta_sample/part-00001-82579b16-a31c-4258-9805-aea030960718-c000.snappy.parquet', '/tmp/data/delta_sample/part-00006-b66d3f3d-1198-4649-8149-59a2457cba06-c000.snappy.parquet', '/tmp/data/delta_sample/part-00007-f6097749-d798-422d-ac01-f17025e93203-c000.snappy.parquet', '/tmp/data/delta_sample/part-00000-e0853e7e-3893-4b1d-9cb9-853b10894ae0-c000.snappy.parquet', '/tmp/data/delta_sample/part-00001-4dbd76f2-7dad-47ed-80dc-e3aef8a7088d-c000.snappy.parquet', '/tmp/data/delta_sample/part-00003-6ebb27ae-9c5c-4d29-8e4b-48b42ec8cb3c-c000.snappy.parquet', '/tmp/data/delta_sample/part-00005-d74d5c06-ff82-4ad0-be9e-1d415f3af4f0-c000.snappy.parquet', '/tmp/data/delta_sample/part-00007-88f53701-3b64-4a89-9b7c-

Now, make a dataframe with numbers from 5–10 and will give its datatype as String and append the dataset on our existing dataset.

In [62]:
import pyspark.sql.functions as fn
new_data = session.range(5,10)
new_data = new_data.withColumn("id",fn.col("id").cast("String"))
new_data.write.format("delta").mode("append").save("/tmp/data/delta_sample")

AnalysisException: Failed to merge fields 'id' and 'id'. Failed to merge incompatible data types LongType and StringType

We get an error **AnalysisException: Failed to merge fields 'id' and 'id'. Failed to merge incompatible data types LongType and StringType** which is good.

Delta lake stopped the incorrect data to go in our delta lake.

Let's append the dataset with correct schema.

In [63]:
new_data = session.range(5,10)
new_data.write.format("delta").mode("append").save("/tmp/data/delta_sample")



We can check the delta logs and see the we added the part files which were newly written on the dataset along with specifying information such as mode of write and modification time.

In [64]:
os.chdir("/tmp/data/delta_sample/_delta_log")
for file in glob.glob("*.json"):
    print(file)
    print(open(file).read())


00000000000000000001.json
{"commitInfo":{"timestamp":1633653152031,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":0,"isBlindAppend":true,"operationMetrics":{"numFiles":"6","numOutputBytes":"2611","numOutputRows":"5"}}}
{"add":{"path":"part-00000-e0853e7e-3893-4b1d-9cb9-853b10894ae0-c000.snappy.parquet","partitionValues":{},"size":296,"modificationTime":1633653151983,"dataChange":true}}
{"add":{"path":"part-00001-82579b16-a31c-4258-9805-aea030960718-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1633653152020,"dataChange":true}}
{"add":{"path":"part-00003-8bde0667-6254-477e-919a-c11ea03a471e-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1633653152005,"dataChange":true}}
{"add":{"path":"part-00004-3ee3c741-aac5-4dbc-b4ed-869bf4d18925-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1633653152000,"dataChange":true}}
{"add":{"path":"part-00006-b66d3f3d-1198-4649-8149-59a2

### Deletion

Lets read our table we just wrote in Delta Format.

In [69]:
from delta.tables import *
delta_df = DeltaTable.forPath(session, "/tmp/data/delta_sample")
delta_df

<delta.tables.DeltaTable at 0x7f1e19808130>

Now, we will delete the data where id is ≤2.

In [66]:
delta_df.delete("id<=2")



Let’s checkout how the commit log is written for delete operation.

It guides spark to delete the original part files through remove and then add the new part file with predicate as id≤2, the operation performed(DELETE) is specified

In [67]:
os.chdir("/tmp/data/delta_sample/_delta_log")
for file in glob.glob("*.json"):
    print(file)
    print(open(file).read())

00000000000000000002.json
{"commitInfo":{"timestamp":1633653191043,"operation":"DELETE","operationParameters":{"predicate":"[\"(`id` <= 2L)\"]"},"readVersion":1,"isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numCopiedRows":"0","executionTimeMs":"1559","numDeletedRows":"2","scanTimeMs":"1363","numAddedFiles":"1","rewriteTimeMs":"195"}}}
{"remove":{"path":"part-00003-6ebb27ae-9c5c-4d29-8e4b-48b42ec8cb3c-c000.snappy.parquet","deletionTimestamp":1633653191042,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":463}}
{"add":{"path":"part-00000-fe9b4ee3-c4da-4f1d-9e84-db6f07d17f5e-c000.snappy.parquet","partitionValues":{},"size":296,"modificationTime":1633653191038,"dataChange":true}}

00000000000000000001.json
{"commitInfo":{"timestamp":1633653152031,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":0,"isBlindAppend":true,"operationMetrics":{"numFiles":"6","numOutputBytes":"2611","numOutputRows":"5"}}}
{"add"

### Updates

We will read the dataset again and will update the value from 5 to 500.

In [71]:
delta_df = DeltaTable.forPath(session, "/tmp/data/delta_sample")
delta_df.update(condition = "id = 5", set = { "id": "500" })
delta_df.toDF().show()



+---+
| id|
+---+
|  8|
|  3|
|  6|
|  7|
|  4|
|  1|
|  9|
|500|
+---+



The above operation is will set id to 500 where it is 5, the Delta Table is auto refresh as the data is updated. 

As you can see the syntax is very simple.

In [72]:
os.chdir("/tmp/data/delta_sample/_delta_log")
for file in glob.glob("*.json"):
    print(file)
    print(open(file).read())

00000000000000000004.json
{"commitInfo":{"timestamp":1633653302852,"operation":"UPDATE","operationParameters":{"predicate":"(id#12463L = 5)"},"readVersion":3,"isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numCopiedRows":"0","executionTimeMs":"2472","scanTimeMs":"2110","numAddedFiles":"1","numUpdatedRows":"1","rewriteTimeMs":"362"}}}
{"remove":{"path":"part-00001-82579b16-a31c-4258-9805-aea030960718-c000.snappy.parquet","deletionTimestamp":1633653302490,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":463}}
{"add":{"path":"part-00000-6810adf8-4a69-47a9-8d8e-3417d6056d02-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1633653302844,"dataChange":true}}

00000000000000000003.json
{"commitInfo":{"timestamp":1633653289670,"operation":"UPDATE","operationParameters":{"predicate":"(id#11855L = 5)"},"readVersion":2,"isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numCopiedRows":"0","executionTimeMs":"1271","scanTi

### Merge

Now, we will perform the merge operation on our Delta Table. Create a new dataset containing Country, Year and Temperature columns and will write it as a Delta Table.

In [73]:
shutil.rmtree("/tmp/data/delta_merge", ignore_errors=True)

df = session.read.csv("/home/mike/tmp/dataset", inferSchema=True, sep=',', header=True)
df.write.format("delta").save("/tmp/data/delta_merge")



In [74]:
delta_merge_df = DeltaTable.forPath(session, "/tmp/data/delta_merge")
delta_merge_df.toDF().show()

+---------+----+-----------+
|  country|year|temperature|
+---------+----+-----------+
|Australia|2019|      23.34|
| Pakistan|2021|   27.89892|
+---------+----+-----------+



In [75]:
update_df = session.read.csv("/home/mike/tmp/update-dataset", inferSchema=True, sep=',', header=True)
update_df.show()

+-----------+----+-----------+
|    country|year|temperature|
+-----------+----+-----------+
|  Australia|2021|      100.0|
|New Zealand|2019|   19.34534|
+-----------+----+-----------+



In [76]:
delta_merge_df.alias("delta_merge").merge(
    update_df.alias("updates"),
    "delta_merge.country = updates.country") \
  .whenMatchedUpdate(set = { 
        "temperature" : "updates.temperature",
        "year" : "updates.year"
  } ) \
  .whenNotMatchedInsert(values =
    {
      "country": "updates.country",
      "year": "updates.year",
      "temperature": "updates.temperature"
    }
  ) \
  .execute()



Final merged records - Australia got updated to 100.00, and year got updated to 2021

In [77]:
delta_merge_df.toDF().show()

+-----------+----+-----------+
|    country|year|temperature|
+-----------+----+-----------+
|New Zealand|2019|   19.34534|
|  Australia|2021|      100.0|
|   Pakistan|2021|   27.89892|
+-----------+----+-----------+



### Time Travel

Delta Lake we will be able to maintain different versions of our dataset’s and can be reused when needed.

In [78]:
delta_df.history().show(10, False)

+-------+-----------------------+------+--------+---------+--------------------------------------+----+--------+---------+-----------+--------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|version|timestamp              |userId|userName|operation|operationParameters                   |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                                                                                                                      |userMetadata|
+-------+-----------------------+------+--------+---------+--------------------------------------+----+--------+---------+-----------+--------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|4      |2021-10

Current data looks like this:

In [48]:
delta_df = DeltaTable.forPath(session, "/tmp/data/delta_sample")
delta_df.toDF().show()

+---+
| id|
+---+
|  1|
|  2|
|500|
|  4|
|  7|
|  3|
|  9|
|  8|
|  6|
+---+



Lets get back version 1 of our data

In [79]:
version_1 = session.read.format("delta").option("versionAsOf",1).load("/tmp/data/delta_sample")
version_1.show()



+---+
| id|
+---+
|  8|
|  6|
|  7|
|  4|
|  5|
|  3|
|  1|
|  2|
|  9|
+---+



So we dont blow out our storage, we can use `deltaTable.vacuum()`

    deltaTable.vacuum()     # vacuum files not required by versions more than 7 days old
    deltaTable.vacuum(100)  # vacuum files not required by versions more than 100 hours old

In [80]:
session.stop()