In [1]:
import $ivy.`org.apache.spark::spark-sql:3.3.1`
import $ivy.`io.delta::delta-core:2.1.0`
import $ivy.`com.lihaoyi::os-lib:0.7.1`
import $ivy.`com.github.mrpowers::jodie:0.0.3`

[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                           
[39m
[32mimport [39m[36m$ivy.$                          
[39m
[32mimport [39m[36m$ivy.$                                 [39m

## Setup

In [2]:

import org.apache.spark.sql._
import org.apache.spark.sql.SparkSession
import io.delta.tables._
import mrpowers.jodie.{DeltaHelpers,Type2Scd}

[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.spark.sql.SparkSession
[39m
[32mimport [39m[36mio.delta.tables._
[39m
[32mimport [39m[36mmrpowers.jodie.{DeltaHelpers,Type2Scd}[39m

In [3]:
val spark = SparkSession.builder.master("local[*]").appName("JodieDemo").config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog").config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension").getOrCreate()
import spark.implicits._

SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties


[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@e5046fd
[32mimport [39m[36mspark.implicits._[39m

In [4]:
def createTable(tablePath:String,df:DataFrame) = df.write.format("delta").mode("overwrite").save(tablePath)

defined [32mfunction[39m [36mcreateTable[39m

# Type 2 SCDs

This library provides an opinionated, conventions over configuration, approach to Type 2 SCD management. Let's look at an example before covering the conventions required to take advantage of the functionality.

Suppose you have the following SCD table with the `pkey` primary key:

In [5]:
val scdTable = Seq(
  (1, "A", "A", true, "2019-01-01 00:00:00", "null"),
  (2, "B", "B", true, "2019-01-01 00:00:00", "null"),
  (4, "D", "D", true, "2019-01-01 00:00:00", "null")
).toDF("pkey", "attr1", "attr2", "is_current", "effective_time", "end_time")
val tablePath = f"${os.pwd}/delta-table"
createTable(tablePath,scdTable)
val scdDeltaTable = DeltaTable.forPath(tablePath)
scdDeltaTable.toDF.show(false)

+----+-----+-----+----------+-------------------+--------+
|pkey|attr1|attr2|is_current|effective_time     |end_time|
+----+-----+-----+----------+-------------------+--------+
|1   |A    |A    |true      |2019-01-01 00:00:00|null    |
|2   |B    |B    |true      |2019-01-01 00:00:00|null    |
|4   |D    |D    |true      |2019-01-01 00:00:00|null    |
+----+-----+-----+----------+-------------------+--------+



[36mscdTable[39m: [32mDataFrame[39m = [pkey: int, attr1: string ... 4 more fields]
[36mtablePath[39m: [32mString[39m = [32m"/Users/brayan_jules/projects/delta-table"[39m
[36mscdDeltaTable[39m: [32mDeltaTable[39m = io.delta.tables.DeltaTable@53abcf4c

### You'd like to perform an upsert with this data:

In [6]:
val updatesDF = Seq(
  (2, "Z", null, "2020-01-01 00:00:00"),
  (3, "C", "C", "2020-09-15 00:00:00")
).toDF("pkey", "attr1", "attr2", "effective_time")
updatesDF.show(false)

+----+-----+-----+-------------------+
|pkey|attr1|attr2|effective_time     |
+----+-----+-----+-------------------+
|2   |Z    |null |2020-01-01 00:00:00|
|3   |C    |C    |2020-09-15 00:00:00|
+----+-----+-----+-------------------+



[36mupdatesDF[39m: [32mDataFrame[39m = [pkey: int, attr1: string ... 2 more fields]

### Here's how to perform the upsert:

In [7]:
Type2Scd.upsert(scdDeltaTable, updatesDF, "pkey", Seq("attr1", "attr2"))

### Here is the the table after the upsert:

In [8]:
scdDeltaTable.toDF.show(false)

+----+-----+-----+----------+-------------------+-------------------+
|pkey|attr1|attr2|is_current|effective_time     |end_time           |
+----+-----+-----+----------+-------------------+-------------------+
|2   |Z    |null |true      |2020-01-01 00:00:00|null               |
|2   |B    |B    |false     |2019-01-01 00:00:00|2020-01-01 00:00:00|
|3   |C    |C    |true      |2020-09-15 00:00:00|null               |
|1   |A    |A    |true      |2019-01-01 00:00:00|null               |
|4   |D    |D    |true      |2019-01-01 00:00:00|null               |
+----+-----+-----+----------+-------------------+-------------------+



#### You can leverage the upsert code if your SCD table meets these requirements:

* Contains a unique primary key column
* Any change in an attribute column triggers an upsert
* SCD logic is exposed via `effective_time`, `end_time` and `is_current` column

`merge` logic can get really messy, so it's easiest to follow these conventions.  See [this blog post](https://mungingdata.com/delta-lake/type-2-scd-upserts/) if you'd like to build a SCD with custom logic.

# Kill Duplicates
The function `killDuplicateRecords` deletes all the duplicated records from a table given a set of columns.

Given the following table:

In [9]:
val inputData = Seq(
  (1, "Benito", "Jackson"),
  (2, "Maria", "Willis"),
  (3, "Jose", "Travolta"),
  (4, "Benito", "Jackson"),
  (5, "Jose", "Travolta"),
  (6, "Maria", "Pitt"),
  (9, "Benito", "Jackson")
).toDF("id", "firstname", "lastname")
val tablePath = f"${os.pwd}/people"
createTable(tablePath,inputData)
val peopleDeltaTable = DeltaTable.forPath(tablePath)
peopleDeltaTable.toDF.show(false)

+---+---------+--------+
|id |firstname|lastname|
+---+---------+--------+
|4  |Benito   |Jackson |
|9  |Benito   |Jackson |
|1  |Benito   |Jackson |
|3  |Jose     |Travolta|
|5  |Jose     |Travolta|
|2  |Maria    |Willis  |
|6  |Maria    |Pitt    |
+---+---------+--------+



[36minputData[39m: [32mDataFrame[39m = [id: int, firstname: string ... 1 more field]
[36mtablePath[39m: [32mString[39m = [32m"/Users/brayan_jules/projects/people"[39m
[36mpeopleDeltaTable[39m: [32mDeltaTable[39m = io.delta.tables.DeltaTable@5de8d01

### We can Run the following function to remove all duplicates:

In [10]:
DeltaHelpers.killDuplicateRecords(deltaTable = peopleDeltaTable,duplicateColumns = Seq("firstname","lastname"))

###  The result of running the previous function is the following table:

In [11]:
DeltaTable.forPath(tablePath).toDF.show(false)

+---+---------+--------+
|id |firstname|lastname|
+---+---------+--------+
|2  |Maria    |Willis  |
|6  |Maria    |Pitt    |
+---+---------+--------+



### As you can see all the duplicated values were removed from the table which was what we indented to do.

If you goal was different and wanted to keep only one ocurrence of each element, check the function `removeDuplicateRecords`.

# Remove Duplicates

The functions `removeDuplicateRecords` deletes duplicates but keeps one occurrence of each record that was duplicated. There are two versions of that function, lets look an example of each.

Let’s see an example of how to use the first version. Given the following table:

In [9]:
val tablePath = f"${os.pwd}/students1"
val df = Seq(
        (1, "Benito", "Jackson"),
        (1, "Benito", "Jackson"),
        (1, "Benito", "Jackson"),
        (1, "Benito", "Jackson"),
        (1, "Benito", "Jackson")
      ).toDF("id", "firstname", "lastname")
createTable(tablePath,df)
val studentsDeltaTable = DeltaTable.forPath(tablePath)
studentsDeltaTable.toDF.show(false)

+---+---------+--------+
|id |firstname|lastname|
+---+---------+--------+
|1  |Benito   |Jackson |
|1  |Benito   |Jackson |
|1  |Benito   |Jackson |
|1  |Benito   |Jackson |
|1  |Benito   |Jackson |
+---+---------+--------+



[36mtablePath[39m: [32mString[39m = [32m"/Users/brayan_jules/projects/students1"[39m
[36mdf[39m: [32mDataFrame[39m = [id: int, firstname: string ... 1 more field]
[36mstudentsDeltaTable[39m: [32mDeltaTable[39m = io.delta.tables.DeltaTable@638908cd

### We can run the following function to remove all duplicates:

In [10]:
DeltaHelpers.removeDuplicateRecords(deltaTable = studentsDeltaTable, 
                                    duplicateColumns = Seq("firstname","lastname"))

### The result of running the previous function is the following table:

In [11]:
DeltaTable.forPath(tablePath).toDF.show(false)

+---+---------+--------+
|id |firstname|lastname|
+---+---------+--------+
|1  |Benito   |Jackson |
+---+---------+--------+



### Now let’s see an example of how to use the second version:
Suppose you have the same table:

In [13]:
val tablePath = f"${os.pwd}/students2"
val df = Seq(
      (2, "Maria", "Willis"),
      (3, "Jose", "Travolta"),
      (4, "Benito", "Jackson"),
      (1, "Benito", "Jackson"),
      (5, "Jose", "Travolta"),
      (6, "Maria", "Pitt"),
      (9, "Benito", "Jackson")
    ).toDF("id", "firstname", "lastname")
createTable(tablePath,df)
val studentsDeltaTable = DeltaTable.forPath(tablePath)
studentsDeltaTable.toDF.show(false)

+---+---------+--------+
|id |firstname|lastname|
+---+---------+--------+
|9  |Benito   |Jackson |
|1  |Benito   |Jackson |
|4  |Benito   |Jackson |
|3  |Jose     |Travolta|
|5  |Jose     |Travolta|
|2  |Maria    |Willis  |
|6  |Maria    |Pitt    |
+---+---------+--------+



[36mtablePath[39m: [32mString[39m = [32m"/Users/brayan_jules/projects/students2"[39m
[36mdf[39m: [32mDataFrame[39m = [id: int, firstname: string ... 1 more field]
[36mstudentsDeltaTable[39m: [32mDeltaTable[39m = io.delta.tables.DeltaTable@4d6b6ad1

This time the function takes an additional input parameter, a primary key that will be used to sort the duplicated records in ascending order and remove them according to that order.

In [14]:
DeltaHelpers.removeDuplicateRecords(deltaTable = studentsDeltaTable, primaryKey = "id",
  duplicateColumns = Seq("firstname","lastname"))

### The result of running the previous function is the following:

In [15]:
DeltaTable.forPath(tablePath).toDF.show(false)

+---+---------+--------+
|id |firstname|lastname|
+---+---------+--------+
|1  |Benito   |Jackson |
|3  |Jose     |Travolta|
|2  |Maria    |Willis  |
|6  |Maria    |Pitt    |
+---+---------+--------+



# Copy Delta Table
This function takes an existing delta table and makes a copy of all its data, properties, and partitions to a new delta table. The new table could be created based on a specified path or just a given table name.

Copying does not include the delta log, which means that you will not be able to restore the new table to an old version of the original table.

Let's demostrate it with an example, given the following table:

In [16]:
val tablePath = f"${os.pwd}/students3"
val tablePathCopy = f"${os.pwd}/students4"
val df = Seq(
      (1, "Benito", "Jackson"),
      (5, "Jose", "Travolta"),
      (6, "Maria", "Willis"),
    ).toDF("id", "firstname", "lastname")
createTable(tablePath,df)
val studentsDeltaTable3 = DeltaTable.forPath(tablePath)
studentsDeltaTable3.toDF.show(false)

+---+---------+--------+
|id |firstname|lastname|
+---+---------+--------+
|1  |Benito   |Jackson |
|5  |Jose     |Travolta|
|6  |Maria    |Willis  |
+---+---------+--------+



[36mtablePath[39m: [32mString[39m = [32m"/Users/brayan_jules/projects/students3"[39m
[36mtablePathCopy[39m: [32mString[39m = [32m"/Users/brayan_jules/projects/students4"[39m
[36mdf[39m: [32mDataFrame[39m = [id: int, firstname: string ... 1 more field]
[36mstudentsDeltaTable3[39m: [32mDeltaTable[39m = io.delta.tables.DeltaTable@6486f667

### Here's how to perform the copy to a specific path:



In [17]:
DeltaHelpers.copyTable(deltaTable = studentsDeltaTable3, targetPath = Some(tablePathCopy))

: 

### The result of copying the table is the following:

In [7]:
DeltaTable.forPath(tablePathCopy).toDF.show(false)

: 

Note the location where the table will be stored in this last function call will be based on the spark conf property spark.sql.warehouse.dir.

# Latest Version of Delta Table
The function `latestVersion` return the latest version number of a table given its storage path.

Here's how to use the function:

In [18]:
DeltaHelpers.latestVersion(path = tablePath)

[36mres17[39m: [32mLong[39m = [32m9L[39m

# Insert Data Without Duplicates
The function `appendWithoutDuplicates` inserts data into an existing delta table and prevents data duplication in the process. Let's see an example of how it works.

Suppose we have the following table:

In [19]:
val tablePath = f"${os.pwd}/people2"
val df = Seq(
      (1, "Benito", "Jackson"),
      (5, "Rosalia", "Pitt"),
      (6, "Maria", "Pitt")
    ).toDF("id", "firstname", "lastname")
createTable(tablePath,df)
val peopleDeltaTable2 = DeltaTable.forPath(tablePath)
peopleDeltaTable2.toDF.show(false)

+---+---------+--------+
|id |firstname|lastname|
+---+---------+--------+
|1  |Benito   |Jackson |
|5  |Rosalia  |Pitt    |
|6  |Maria    |Pitt    |
+---+---------+--------+



[36mtablePath[39m: [32mString[39m = [32m"/Users/brayan_jules/projects/people2"[39m
[36mdf[39m: [32mDataFrame[39m = [id: int, firstname: string ... 1 more field]
[36mpeopleDeltaTable2[39m: [32mDeltaTable[39m = io.delta.tables.DeltaTable@71cec453

### And we want to insert this new dataframe:

In [20]:
val newDF = Seq(
      (6, "Rosalia", "Pitt"),
      (2, "Maria", "Willis"),
      (3, "Jose", "Travolta"),
      (4, "Maria", "Pitt")
    ).toDF("id", "firstname", "lastname")
newDF.show(false)

+---+---------+--------+
|id |firstname|lastname|
+---+---------+--------+
|6  |Rosalia  |Pitt    |
|2  |Maria    |Willis  |
|3  |Jose     |Travolta|
|4  |Maria    |Pitt    |
+---+---------+--------+



[36mnewDF[39m: [32mDataFrame[39m = [id: int, firstname: string ... 1 more field]

### We can use the following function to insert new data and avoid data duplication:

In [21]:
DeltaHelpers.appendWithoutDuplicates(deltaTable = peopleDeltaTable2, appendData = newDF, 
  primaryKeysColumns = Seq("firstname","lastname")
)

### The result table will be the following:

In [22]:
DeltaTable.forPath(tablePath).toDF.show(false)

+---+---------+--------+
|id |firstname|lastname|
+---+---------+--------+
|1  |Benito   |Jackson |
|5  |Rosalia  |Pitt    |
|2  |Maria    |Willis  |
|3  |Jose     |Travolta|
|6  |Maria    |Pitt    |
+---+---------+--------+



# Generate MD5 from columns
The function `withMD5Columns` appends a md5 hash of specified columns to the DataFrame. This can be used as a unique key if the selected columns form a composite key. Here is an example

Suppose we have the following dataframe:

In [23]:
val df = Seq(
      (1, "Benito", "Jackson"),
      (5, "Rosalia", "Pitt"),
      (6, "Maria", "Pitt")
    ).toDF("id", "firstname", "lastname")
df.show(false)

+---+---------+--------+
|id |firstname|lastname|
+---+---------+--------+
|1  |Benito   |Jackson |
|5  |Rosalia  |Pitt    |
|6  |Maria    |Pitt    |
+---+---------+--------+



[36mdf[39m: [32mDataFrame[39m = [id: int, firstname: string ... 1 more field]

### We can use the md5 function in this way:

In [24]:
val resultDF = DeltaHelpers.withMD5Columns(dataFrame = df, 
                                           cols = List("firstname","lastname"),
                                           newColName = "unique_id")

[36mresultDF[39m: [32mDataFrame[39m = [id: int, firstname: string ... 2 more fields]

### The result dataframe will be the following:

In [25]:
resultDF.show(false)

+---+---------+--------+--------------------------------+
|id |firstname|lastname|unique_id                       |
+---+---------+--------+--------------------------------+
|1  |Benito   |Jackson |3456d6842080e8188b35f515254fece8|
|5  |Rosalia  |Pitt    |ec8d357c71914f989d704b7be0d4e708|
|6  |Maria    |Pitt    |2af7722350b26a3c7c043b8202d1d9e5|
+---+---------+--------+--------------------------------+



You can use this function with the columns identified in findCompositeKeyCandidate to append a unique key to the DataFrame.

# Find Composite Key

This function `findCompositeKeyCandidate` helps you find a composite key that uniquely identifies the rows your Delta table. It returns a list of columns that can be used as a composite key. i.e:

Suppose we have the following table:

In [26]:
val tablePath = f"${os.pwd}/people3"
val df = Seq(
      (1, "Benito", "Jackson"),
      (5, "Rosalia", "Pitt"),
      (6, "Maria", "Pitt"),
      (7, "Maria", "Rodriguez")
    ).toDF("id", "firstname", "lastname")
createTable(tablePath,df)
val peopleDeltaTable3 = DeltaTable.forPath(tablePath)
peopleDeltaTable3.toDF.show(false)

+---+---------+---------+
|id |firstname|lastname |
+---+---------+---------+
|7  |Maria    |Rodriguez|
|1  |Benito   |Jackson  |
|5  |Rosalia  |Pitt     |
|6  |Maria    |Pitt     |
+---+---------+---------+



[36mtablePath[39m: [32mString[39m = [32m"/Users/brayan_jules/projects/people3"[39m
[36mdf[39m: [32mDataFrame[39m = [id: int, firstname: string ... 1 more field]
[36mpeopleDeltaTable3[39m: [32mDeltaTable[39m = io.delta.tables.DeltaTable@1d9d5c3f

### Now execute the function and get the result:

In [27]:
DeltaHelpers.findCompositeKeyCandidate(
  deltaTable = peopleDeltaTable3,
  excludeCols = Seq("id")
)

[36mres26[39m: [32mSeq[39m[[32mString[39m] = [33mArraySeq[39m([32m"firstname"[39m, [32m"lastname"[39m)