# POC: Incremental reads on Delta without hive metastore

Using local metastore with incremental feactures

# Use cases
Change Data Feed is not enabled by default. The following use cases should drive when you enable the change data feed.

1. Silver and Gold tables: Improve Delta performance by processing only row-level changes following initial MERGE, UPDATE, or DELETE operations to accelerate and simplify ETL and ELT operations.

2. Transmit changes: Send a change data feed to downstream systems such as Kafka or RDBMS that can use it to incrementally process in later stages of data pipelines.

3. Audit trail table: Capture the change data feed as a Delta table provides perpetual storage and efficient query capability to see all changes over time, including when deletes occur and what updates were made.



# Known constrains:

- The files in the _change_data folder follow the retention policy of the table. Therefore, if you run the VACUUM command, change data feed data is also deleted. Default is 7 days. [Reference](https://docs.delta.io/latest/delta-utility.html#remove-files-no-longer-referenced-by-a-delta-table)

- With **column mapping** enabled on a Delta table, you can drop or rename columns in the table without rewriting data files for existing data. With column mapping enabled, change data feed has limitations after performing non-additive schema changes such as renaming or dropping a column, changing data type, or nullability changes. [Reference](https://docs.delta.io/latest/delta-change-data-feed.html#change-data-feed-limitations-for-tables-with-column-mapping-enabled)


## Before run

Checks if spark 3.5.3 and Hadoop are install
also the delta-spark>=3.2.1 and pyspark>=3.5.3 libraries.

Remove if existis the /warehouse/ folder in this direcory

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, month, dayofmonth

## Spark Setup

In [2]:
spark_jar_packages = ",".join([
    "io.delta:delta-spark_2.12:3.2.0",
])

In [3]:
LOCAL_WAREHOUSE_CATALOG = "file:///home/baptvit/Documents/github/lakehouse-labs/notebooks/warehouse/"

In [4]:
spark = (
    SparkSession.builder
    .appName("delta-without-playground")
    .config("spark.jars.packages", spark_jar_packages)

    # Delta-Hive Integration
     .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
     .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
     .config("spark.sql.catalog.local.type", "hadoop")   # Use Hadoop catalog
     .config("spark.sql.warehouse.dir", LOCAL_WAREHOUSE_CATALOG)   # Path to store metadata
    .getOrCreate()
)

25/01/13 15:07:39 WARN Utils: Your hostname, baptvit resolves to a loopback address: 127.0.1.1; using 192.168.2.129 instead (on interface wlp4s0)
25/01/13 15:07:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/baptvit/.ivy2/cache
The jars for the packages stored in: /home/baptvit/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9f04eedb-aa9f-411b-a7b6-961b7ae704e7;1.0
	confs: [default]
	found io.delta#delta-spark_2.12;3.2.0 in central
	found io.delta#delta-storage;3.2.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in local-m2-cache
:: resolution report :: resolve 93ms :: artifacts dl 3ms
	:: modules in use:
	io.delta#delta-spark_2.12;3.2.0 from central in [default]
	io.delta#delta-storage;3.2.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from local-m2-cache in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default    

## Creating a fake database

In [5]:
import random
from faker import Faker

In [6]:
def generate_entry(faker: Faker, country_codes: list):
    return {
        "id": faker.unique.uuid4(),
        "name":  faker.name(),
        "email": faker.email(),
        "passport": faker.passport_number(),
        "country_code": random.choice(country_codes),
        "iban": faker.iban(),
        "swift": faker.swift11(),
        "created_at": faker.past_date(start_date='-90d').strftime('%Y-%m-%d')
    }

In [7]:
def generate_dataset(num: int, seed: int):
    country_codes = ['US', 'CA', 'JP', 'KR', 'FR', 'GE', 'UK', 'BR', 'AR']
    Faker.seed(seed)
    faker = Faker()
    return [generate_entry(faker, country_codes) for _ in range(num)]

In [8]:
dataset = generate_dataset(num=100, seed=739)

In [9]:
df = spark.createDataFrame(dataset)\
        .withColumn("year", year(col("created_at")))\
        .withColumn("month", month(col("created_at")))\
        .withColumn("day", dayofmonth(col("created_at")))

In [10]:
df.count()

                                                                                

100

In [11]:
df.createOrReplaceTempView("accounts")

In [13]:
spark.sql("SELECT * FROM accounts LIMIT 1;").show(vertical=True)

-RECORD 0----------------------------
 country_code | GE                   
 created_at   | 2024-10-20           
 email        | powelljason@examp... 
 iban         | GB77AKMZ560580635... 
 id           | 5a424412-b127-4f8... 
 name         | Cody Taylor          
 passport     | 895549199            
 swift        | INSEGB5PR6S          
 year         | 2024                 
 month        | 10                   
 day          | 20                   



## Save using the default database

Using the CDF Enable.
delta.enableChangeDataFeed = true

In [14]:
spark.sql("""
    CREATE DATABASE IF NOT EXISTS delta;
""")

DataFrame[]

In [15]:
df.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .option("delta.enableChangeDataFeed", "true") \
    .partitionBy("year", "month") \
    .saveAsTable("delta.accounts")

25/01/13 15:08:03 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 95,00% for 8 writers
25/01/13 15:08:03 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 84,44% for 9 writers
25/01/13 15:08:03 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 76,00% for 10 writers
25/01/13 15:08:03 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 69,09% for 11 writers
25/01/13 15:08:03 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 63,33% for 12 writers
25/01/13 15:08:03 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 58,46% for 13 writers
25/01/13 15:08:03 WARN MemoryManager: Total allocation exceeds 95,

## Reading from local direct from file

In [18]:
LOCAL_ACCOUNT_TABLE = LOCAL_WAREHOUSE_CATALOG + "delta.db/accounts"
LOCAL_ACCOUNT_TABLE

'file:///home/baptvit/Documents/github/lakehouse-labs/notebooks/warehouse/delta.db/accounts'

In [19]:
# providing a starting version
df_read = spark.read.format("delta") \
  .load(LOCAL_ACCOUNT_TABLE)

In [20]:
df_read.count()

25/01/13 15:08:26 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

100

## Reading the table history from local folder

Local folder and spark SQL

In [21]:
spark.sql(f"DESCRIBE HISTORY '{LOCAL_ACCOUNT_TABLE}'").show(vertical=True, truncate=False)

-RECORD 0--------------------------------------------------------------------------------------------------------------------------------------------------------------
 version             | 0                                                                                                                                               
 timestamp           | 2025-01-13 15:08:04.344                                                                                                                         
 userId              | NULL                                                                                                                                            
 userName            | NULL                                                                                                                                            
 operation           | CREATE OR REPLACE TABLE AS SELECT                                                                                                        

## As the local catalog is set

spark.sql.catalog.local.warehouse", "file:///home/baptvit/Documents/github/lakehouse-labs/notebooks/warehouse/delta/

In [22]:
spark.sql("DESCRIBE HISTORY delta.accounts").show(vertical=True, truncate=False)

-RECORD 0--------------------------------------------------------------------------------------------------------------------------------------------------------------
 version             | 0                                                                                                                                               
 timestamp           | 2025-01-13 15:08:04.344                                                                                                                         
 userId              | NULL                                                                                                                                            
 userName            | NULL                                                                                                                                            
 operation           | CREATE OR REPLACE TABLE AS SELECT                                                                                                        

## Using CDC and table history to identify the increments

### Local folder with describe history

In [25]:
# Get the latest version of the Delta table
latest_version = spark.sql(f"DESCRIBE HISTORY '{LOCAL_ACCOUNT_TABLE}'") \
    .selectExpr("max(version)") \
    .collect()[0][0]

In [26]:
latest_version

0

### Reading just the last table version using local catalog

In [27]:
# Query changes for the latest version
latest_changes_df = spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", latest_version) \
    .load(f"{LOCAL_ACCOUNT_TABLE}")

latest_changes_df.count()

100

## Creating the upsert

### Upsert Dataset

Editing 4 records and adding new 4 records

In [28]:
entries = [
    # Existing entries
    dataset[2], 
    dataset[4], 
    dataset[7],
    dataset[11],
    # New entries
    *generate_dataset(4, seed=1037)
]

In [29]:
for entry in entries:
    username = entry['name'].lower().replace(" ", ".")
    entry['email'] = f"{username}@domain.com"

In [30]:
upsert_df = spark.createDataFrame(entries)\
        .withColumn("year", year(col("created_at")))\
        .withColumn("month", month(col("created_at")))\
        .withColumn("day", dayofmonth(col("created_at")))

In [31]:
upsert_df.count()

8

In [32]:
upsert_df.createOrReplaceTempView("upsert_data")

# Upsert Strategy

## Slowly Changing Dimension (SCD) Type 1

In SCD Type 1, the existing records are overwritten with new data when there is a match, and new records are inserted when there is no match. This approach does not preserve historical changes; it simply updates the records with the latest data.



### Using upsert in SQL like syntax

In [33]:
# spark.sql("""
#     MERGE INTO deltalake_raw.accounts AS target
#     USING upsert_data AS source ON 
#         target.id = source.id
#     WHEN MATCHED THEN UPDATE SET
#         target.country_code = source.country_code,
#         target.email = source.email,
#         target.name = source.name,
#         target.iban = source.iban,
#         target.swift = source.swift,
#         target.passport = source.passport
#     WHEN NOT MATCHED THEN INSERT *
# """)

### Using upsert in python library

In [34]:
from delta.tables import DeltaTable

# Load the Delta table
delta_table = DeltaTable.forName(spark, "delta.accounts")

# Perform the merge operation
delta_table.alias("target").merge(
    source=upsert_df.alias("upsert_data"),
    condition="target.id = upsert_data.id"
).whenMatchedUpdate(set={
    "country_code": "upsert_data.country_code",
    "email": "upsert_data.email",
    "name": "upsert_data.name",
    "iban": "upsert_data.iban",
    "swift": "upsert_data.swift",
    "passport": "upsert_data.passport"
}).whenNotMatchedInsertAll().execute()

In [35]:
spark.sql("DESCRIBE HISTORY delta.accounts").show(vertical=True, truncate=False)

-RECORD 0-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 version             | 1                                                                                                                                                                                                                                                                                                                               

## Delta Metadata using table_changes (Changes)

### Slowly Changing Dimension (SCD) Type 2

Key Characteristics of SCD Type 2:
    
    1. Preservation of History: SCD Type 2 retains historical changes by creating new records for each change, rather than overwriting existing records.

    2. Versioning: Each change is tracked with a version or timestamp, allowing you to see the state of a record at any point in time.

    3. Pre- and Post-Images: The update_preimage and update_postimage change types indicate that the table is capturing the state of a record before and after an update, which is a hallmark of SCD Type 2.

In [37]:
spark.sql("""
    SELECT 
        *
    FROM
        table_changes('delta.accounts', 1)
""").show(truncate=False)

+------------+----------+---------------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+----------------+---------------+-----------------------+
|country_code|created_at|email                      |iban                  |id                                  |name            |passport |swift      |year|month|day|_change_type    |_commit_version|_commit_timestamp      |
+------------+----------+---------------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+----------------+---------------+-----------------------+
|BR          |2024-12-09|derrick15@example.com      |GB14AYNQ55188150393152|0daad7bc-25b6-4469-8a2f-2ba767f86791|Cassidy Jones MD|595954695|VTHYGBZMNOI|2024|12   |9  |update_preimage |1              |2025-01-13 15:09:15.802|
|BR          |2024-12-09|cassidy.jones.md@domain.com|GB14AYNQ55188150393152|0daad7bc-25b6-4469-8a2f-

### Get only the changes in the last version

In [39]:
spark.sql("""
    SELECT 
        *
    FROM
        table_changes('delta.accounts', 1)
    WHERE
        _change_type IN ('insert', 'update_postimage')
""").show(truncate=False)

+------------+----------+---------------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+----------------+---------------+-----------------------+
|country_code|created_at|email                      |iban                  |id                                  |name            |passport |swift      |year|month|day|_change_type    |_commit_version|_commit_timestamp      |
+------------+----------+---------------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+----------------+---------------+-----------------------+
|BR          |2024-12-09|cassidy.jones.md@domain.com|GB14AYNQ55188150393152|0daad7bc-25b6-4469-8a2f-2ba767f86791|Cassidy Jones MD|595954695|VTHYGBZMNOI|2024|12   |9  |update_postimage|1              |2025-01-13 15:09:15.802|
|GE          |2024-12-10|kara.thomas@domain.com     |GB02LAAF80272115976869|4cbbf121-caae-42aa-8508-

### Version 0 of the data

In [40]:
spark.sql("""
    SELECT 
        *
    FROM
        delta.accounts version as of 0
    WHERE
        id = '0daad7bc-25b6-4469-8a2f-2ba767f86791'
""").show(truncate=False)

+------------+----------+---------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+
|country_code|created_at|email                |iban                  |id                                  |name            |passport |swift      |year|month|day|
+------------+----------+---------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+
|BR          |2024-12-09|derrick15@example.com|GB14AYNQ55188150393152|0daad7bc-25b6-4469-8a2f-2ba767f86791|Cassidy Jones MD|595954695|VTHYGBZMNOI|2024|12   |9  |
+------------+----------+---------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+



                                                                                

### Version 1 of the data

In [42]:
spark.sql("""
    SELECT 
        *
    FROM
        delta.accounts version as of 1
    WHERE
        id = '0daad7bc-25b6-4469-8a2f-2ba767f86791'
""").show(truncate=False)

+------------+----------+---------------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+
|country_code|created_at|email                      |iban                  |id                                  |name            |passport |swift      |year|month|day|
+------------+----------+---------------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+
|BR          |2024-12-09|cassidy.jones.md@domain.com|GB14AYNQ55188150393152|0daad7bc-25b6-4469-8a2f-2ba767f86791|Cassidy Jones MD|595954695|VTHYGBZMNOI|2024|12   |9  |
+------------+----------+---------------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+



## TODO: Optimaze commands

## TODO: Liquid clustering

## TODO: Miscellaneous on delta