# POC: Incremental reads on Iceberg without hive metastore integrate with S3 

Using local metastore with incremental feactures

# Use cases
Change Data Feed is not enabled by default. The following use cases should drive when you enable the change data feed.

1. Silver and Gold tables: Improve Delta performance by processing only row-level changes following initial MERGE, UPDATE, or DELETE operations to accelerate and simplify ETL and ELT operations.

2. Transmit changes: Send a change data feed to downstream systems such as Kafka or RDBMS that can use it to incrementally process in later stages of data pipelines.

3. Audit trail table: Capture the change data feed as a Delta table provides perpetual storage and efficient query capability to see all changes over time, including when deletes occur and what updates were made.



# Known constrains:

- The snapshots also has constrains in rentation policys. Therefore, if you run the Optimizations command, change data feed data is also deleted. Default is 5 days according to some documentations. [Reference](https://www.tabular.io/apache-iceberg-cookbook/data-operations-snapshot-expiration/)

## Before run

Checks if spark 3.5.3 and Hadoop are install
also the pyiceberg>=0.8.1 and pyspark>=3.5.3 libraries.

Remove if existis the /warehouse/ folder in this direcory

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, month, dayofmonth

### Spark Setup

In [2]:
spark_jar_packages = ",".join([
    "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1",
    "org.apache.hadoop:hadoop-aws:3.3.4",
    "com.amazonaws:aws-java-sdk-bundle:1.12.262",
])

In [3]:
LOCAL_WAREHOUSE_CATALOG = "s3a://lakehouse-raw/"

In [4]:
spark = (
    SparkSession.builder
    .master("local[*]")
    .appName("iceberg-hive-playground")
    .config("spark.jars.packages", spark_jar_packages)

    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")

    # local caralog
    .config("spark.sql.catalog.local.type", "hadoop")   # Use Hadoop catalog
    .config("spark.sql.catalog.local.warehouse", LOCAL_WAREHOUSE_CATALOG)   # Path to store metadata
    .config("spark.sql.warehouse.dir", LOCAL_WAREHOUSE_CATALOG)   # Path to store metadata
    .config("spark.sql.defaultCatalog", "local")
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")

    # S3 (MinIO Integration)
    .config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9010")
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin")
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
    .config("spark.hadoop.fs.s3a.region", "us-east-1")
    .getOrCreate()
)

25/01/14 15:44:45 WARN Utils: Your hostname, baptvit resolves to a loopback address: 127.0.1.1; using 192.168.2.129 instead (on interface wlp4s0)
25/01/14 15:44:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /home/baptvit/.ivy2/cache
The jars for the packages stored in: /home/baptvit/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.5_2.12 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-8a390678-8f84-408d-8426-bb1c8c5e0bd4;1.0
	confs: [default]
	found org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.7.1 in central


:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
:: resolution report :: resolve 127ms :: artifacts dl 4ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.12.262 from central in [default]
	org.apache.hadoop#hadoop-aws;3.3.4 from central in [default]
	org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.7.1 from central in [default]
	org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   4   |   0   |   0   |   0   ||   4   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apach

## Creating a fake database

In [5]:
import random
from faker import Faker

In [6]:
def generate_entry(faker: Faker, country_codes: list):
    return {
        "id": faker.unique.uuid4(),
        "name":  faker.name(),
        "email": faker.email(),
        "passport": faker.passport_number(),
        "country_code": random.choice(country_codes),
        "iban": faker.iban(),
        "swift": faker.swift11(),
        "created_at": faker.past_date(start_date='-90d').strftime('%Y-%m-%d')
    }

In [7]:
def generate_dataset(num: int, seed: int):
    country_codes = ['US', 'CA', 'JP', 'KR', 'FR', 'GE', 'UK', 'BR', 'AR']
    Faker.seed(seed)
    faker = Faker()
    return [generate_entry(faker, country_codes) for _ in range(num)]

In [8]:
dataset = generate_dataset(num=100, seed=739)

In [9]:
df = spark.createDataFrame(dataset)\
        .withColumn("year", year(col("created_at")))\
        .withColumn("month", month(col("created_at")))\
        .withColumn("day", dayofmonth(col("created_at")))

25/01/14 15:44:49 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties


In [10]:
df.show(1)

                                                                                

+------------+----------+--------------------+--------------------+--------------------+-----------+---------+-----------+----+-----+---+
|country_code|created_at|               email|                iban|                  id|       name| passport|      swift|year|month|day|
+------------+----------+--------------------+--------------------+--------------------+-----------+---------+-----------+----+-----+---+
|          BR|2024-10-21|powelljason@examp...|GB77AKMZ560580635...|5a424412-b127-4f8...|Cody Taylor|895549199|INSEGB5PR6S|2024|   10| 21|
+------------+----------+--------------------+--------------------+--------------------+-----------+---------+-----------+----+-----+---+
only showing top 1 row



In [11]:
df.count()

100

In [12]:
df.createOrReplaceTempView("accounts")

25/01/14 15:44:58 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


## Save using the default database

In [13]:
spark.sql("""
    CREATE DATABASE IF NOT EXISTS iceberg;
""")

DataFrame[]

Ideially we could specify the table schema to about the **overwriteSchem=true**

````
from pyspark.sql.types import DoubleType, FloatType, LongType, StructType,StructField, StringType
schema = StructType([
  StructField("vendor_id", LongType(), True),
  StructField("trip_id", LongType(), True),
  StructField("trip_distance", FloatType(), True),
  StructField("fare_amount", DoubleType(), True),
  StructField("store_and_fwd_flag", StringType(), True)
])

df = spark.createDataFrame([], schema)
df.writeTo("demo.nyc.taxis").create()
````

In [14]:
spark.sql(
    """
    CREATE TABLE IF NOT EXISTS iceberg.accounts
        USING iceberg
    """
)

DataFrame[]

In [15]:
df.writeTo("iceberg.accounts")\
    .tableProperty("changelog.enabled", "true")\
    .tableProperty("overwriteSchema", "true")\
    .partitionedBy("year", "month")\
    .createOrReplace() ## FIrst time create or replace

## Reading from local direct from file

In [16]:
LOCAL_ACCOUNT_TABLE = LOCAL_WAREHOUSE_CATALOG + "iceberg/accounts"

In [17]:
LOCAL_ACCOUNT_TABLE

's3a://lakehouse-raw/iceberg/accounts'

In [18]:
# providing a starting version
df_read = spark.read.format("iceberg") \
  .load(LOCAL_ACCOUNT_TABLE)

In [19]:
df_read.count()

100

## Reading the table history from local folder

Local folder and spark SQL

In [20]:
spark.sql("""
    CREATE DATABASE IF NOT EXISTS iceberg;
""")

DataFrame[]

In [21]:
spark.sql(
    f"""
    CREATE TABLE IF NOT EXISTS iceberg.accounts
        USING iceberg
        OPTIONS (
          path = '{LOCAL_ACCOUNT_TABLE}'
        );
    """
)

DataFrame[]

In [22]:
spark.sql("""
    SELECT *
    FROM iceberg.accounts.history
""").show(truncate=False)


+-----------------------+-------------------+---------+-------------------+
|made_current_at        |snapshot_id        |parent_id|is_current_ancestor|
+-----------------------+-------------------+---------+-------------------+
|2025-01-14 15:45:56.026|3427546640704181193|NULL     |true               |
+-----------------------+-------------------+---------+-------------------+



## Using CDC and table history to identify the increments

### Local folder with describe history

In [23]:
latest_snapshot_id = spark.sql("""
    SELECT snapshot_id
    FROM iceberg.accounts.snapshots
    ORDER BY committed_at DESC
    LIMIT 1;
""").collect()[0][0]

latest_snapshot_id

3427546640704181193

### Reading just the last table version using local catalog

In [24]:
latest_changes_df = spark.read.format("iceberg") \
    .option("snapshot-id", latest_snapshot_id) \
    .load("iceberg.accounts")

latest_changes_df.count()

100

In [25]:
latest_changes_df.show(5)

+------------+----------+--------------------+--------------------+--------------------+----------------+---------+-----------+----+-----+---+
|country_code|created_at|               email|                iban|                  id|            name| passport|      swift|year|month|day|
+------------+----------+--------------------+--------------------+--------------------+----------------+---------+-----------+----+-----+---+
|          JP|2024-12-09|brightthomas@exam...|GB56NVYS608859440...|189b84f0-9527-45e...|  Brittany Heath|H58091059|BEBQGBVOSLL|2024|   12|  9|
|          US|2024-12-16|donaldpierce@exam...|GB55LFTZ500270831...|b7e33adb-9bfe-465...|        Ann Cruz|T22953641|HMAQGBCSXE8|2024|   12| 16|
|          KR|2024-12-09|harrisondeanna@ex...|GB96EZYO163067760...|bef2df38-4a7a-4b7...| Destiny Jimenez|F75210547|WSHOGBQ55I9|2024|   12|  9|
|          JP|2024-12-10|derrick15@example...|GB14AYNQ551881503...|0daad7bc-25b6-446...|Cassidy Jones MD|595954695|VTHYGBZMNOI|2024|   12| 10|

## Creating the upsert

### Upsert Dataset

Editing 4 records and adding new 4 records

In [26]:
entries = [
    # Existing entries
    dataset[2], 
    dataset[4], 
    dataset[7],
    dataset[11],
    # New entries
    *generate_dataset(4, seed=1037)
]

In [27]:
for entry in entries:
    username = entry['name'].lower().replace(" ", ".")
    entry['email'] = f"{username}@domain.com"

In [28]:
upsert_df = spark.createDataFrame(entries)\
        .withColumn("year", year(col("created_at")))\
        .withColumn("month", month(col("created_at")))\
        .withColumn("day", dayofmonth(col("created_at")))

In [29]:
upsert_df.show(8, truncate=False)

+------------+----------+---------------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+
|country_code|created_at|email                      |iban                  |id                                  |name            |passport |swift      |year|month|day|
+------------+----------+---------------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+
|US          |2024-12-16|ann.cruz@domain.com        |GB55LFTZ50027083194346|b7e33adb-9bfe-465f-a533-1d57f8d9c9f6|Ann Cruz        |T22953641|HMAQGBCSXE8|2024|12   |16 |
|JP          |2024-12-10|cassidy.jones.md@domain.com|GB14AYNQ55188150393152|0daad7bc-25b6-4469-8a2f-2ba767f86791|Cassidy Jones MD|595954695|VTHYGBZMNOI|2024|12   |10 |
|AR          |2024-12-11|kara.thomas@domain.com     |GB02LAAF80272115976869|4cbbf121-caae-42aa-8508-3fd99bb2f762|Kara Thomas     |661814813|DULPGBWLTDU|2024|12 

In [31]:
upsert_df.count()

8

In [32]:
upsert_df.createOrReplaceTempView("upsert_data")

# Upsert Strategy

## Slowly Changing Dimension (SCD) Type 1

In SCD Type 1, the existing records are overwritten with new data when there is a match, and new records are inserted when there is no match. This approach does not preserve historical changes; it simply updates the records with the latest data.


### Using upsert in SQL like syntax

In [33]:
spark.sql("""
    MERGE INTO iceberg.accounts AS target
    USING upsert_data AS source ON 
        target.id = source.id
    WHEN MATCHED THEN 
        UPDATE SET
            target.country_code = source.country_code,
            target.email = source.email,
            target.name = source.name,
            target.iban = source.iban,
            target.swift = source.swift,
            target.passport = source.passport
    WHEN NOT MATCHED THEN 
        INSERT *
""")

DataFrame[]

**but this trigges a snapshot creation with SCD 1, cant not isolate nativally the increments**

## Slowly Changing Dimension (SCD) Type 2

Slowly Changing Dimension Type 2 (SCD Type 2) is a method used in data warehousing to track historical changes in dimension data over time. Unlike SCD Type 1, which overwrites old data with new data, SCD Type 2 preserves the full history of changes by creating new records for each change. This allows you to analyze how data has evolved over time.



### Using append in python library

Not available

In [None]:
upsert_df.writeTo("iceberg_db.accounts")\
    .tableProperty("changelog.enabled", "true")\
    .tableProperty("overwriteSchema", "true")\
    .partitionedBy("year", "month")\
    .append() ## 

In [None]:
spark.sql("""
    SELECT *
    FROM iceberg_db.accounts.history
""").show(vertical=True, truncate=False)

In [None]:
spark.sql("""
    SELECT *
    FROM iceberg_db.accounts.snapshots
""").show(vertical=True, truncate=False)

## Reading incrementals

Iceberg doesnt keep the incremental isolated. snapshot is then a consolidade of the most recent version of the table.

In [34]:
spark.sql("""
    SELECT snapshot_id, parent_id
    FROM iceberg.accounts.snapshots
    ORDER BY committed_at DESC;
""").show()

+-------------------+-------------------+
|        snapshot_id|          parent_id|
+-------------------+-------------------+
|8762264941598402107|3427546640704181193|
|3427546640704181193|               NULL|
+-------------------+-------------------+



### Reading the last changes

In [35]:
latest_snapshot_id, parent_snapshot_id = spark.sql("""
    SELECT snapshot_id, parent_id
    FROM iceberg.accounts.snapshots
    ORDER BY committed_at DESC
    LIMIT 1;
""").collect()[0]

latest_snapshot_id

8762264941598402107

In [36]:
parent_snapshot_id

3427546640704181193

In [37]:
last_snapshot = spark.read.format("iceberg") \
    .option("snapshot-id", latest_snapshot_id) \
    .load("iceberg.accounts")

last_snapshot.count()

104

In [38]:
increment = spark.read.format("iceberg") \
    .option("start-snapshot-id", parent_snapshot_id) \
    .option("end-snapshot-id", latest_snapshot_id) \
    .load("iceberg.accounts")

increment.count()

0

### Create a changlog_view for the iceberg table SCD 2 

In order to create identify the incrementals beetween diferent snapshot_ids we need to create a changelog_view  

In [39]:
spark.sql(f"""
CALL system.create_changelog_view(
  table => 'iceberg.accounts',
  options => map('start-snapshot-id', '{parent_snapshot_id}', 'end-snapshot-id', '{latest_snapshot_id}'),
  changelog_view => 'accounts_cdc'
)""")

25/01/14 15:52:25 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


DataFrame[changelog_view: string]

In [40]:
spark.sql(f"""
    SELECT * FROM accounts_cdc
    WHERE _commit_snapshot_id == {latest_snapshot_id} 
""").show()

+------------+----------+--------------------+--------------------+--------------------+----------------+---------+-----------+----+-----+---+------------+---------------+-------------------+
|country_code|created_at|               email|                iban|                  id|            name| passport|      swift|year|month|day|_change_type|_change_ordinal|_commit_snapshot_id|
+------------+----------+--------------------+--------------------+--------------------+----------------+---------+-----------+----+-----+---+------------+---------------+-------------------+
|          AR|2024-12-11|kara.thomas@domai...|GB02LAAF802721159...|4cbbf121-caae-42a...|     Kara Thomas|661814813|DULPGBWLTDU|2024|   12| 11|      INSERT|              0|8762264941598402107|
|          AR|2024-12-11|  tramos@example.com|GB02LAAF802721159...|4cbbf121-caae-42a...|     Kara Thomas|661814813|DULPGBWLTDU|2024|   12| 11|      DELETE|              0|8762264941598402107|
|          AR|2025-01-05|amber.gomez@dom

In [41]:
spark.sql(f"""
    SELECT * FROM accounts_cdc
    WHERE _change_type = 'INSERT'
""").count()

8

### Isolate the increments 

In [43]:
spark.sql(f"""
    SELECT * FROM accounts_cdc
    WHERE _commit_snapshot_id == {latest_snapshot_id} AND _change_type = 'INSERT'
""").show()

+------------+----------+--------------------+--------------------+--------------------+----------------+---------+-----------+----+-----+---+------------+---------------+-------------------+
|country_code|created_at|               email|                iban|                  id|            name| passport|      swift|year|month|day|_change_type|_change_ordinal|_commit_snapshot_id|
+------------+----------+--------------------+--------------------+--------------------+----------------+---------+-----------+----+-----+---+------------+---------------+-------------------+
|          AR|2024-12-11|kara.thomas@domai...|GB02LAAF802721159...|4cbbf121-caae-42a...|     Kara Thomas|661814813|DULPGBWLTDU|2024|   12| 11|      INSERT|              0|8762264941598402107|
|          AR|2025-01-05|amber.gomez@domai...|GB14JAMN326528523...|7e1d1d6c-ac3b-478...|     Amber Gomez|V40815604|OEDJGBV2XYE|2025|    1|  5|      INSERT|              0|8762264941598402107|
|          BR|2025-01-10|joseph.arellano

### Create a changlog_view for the iceberg table SCD 2 - with UPDATE_BEFORE and UPDATE_AFTER

This feature can help achive SCD 3 in iceberg tables

In [44]:
spark.sql(f"""
CALL system.create_changelog_view(
  table => 'iceberg.accounts',
  options => map('start-snapshot-id', '{parent_snapshot_id}', 'end-snapshot-id', '{latest_snapshot_id}'),
  changelog_view => 'accounts_cdc_sdc3',
  compute_updates => true,
  identifier_columns => array('id')
)""")

DataFrame[changelog_view: string]

In [45]:
spark.sql(f"""
    SELECT * FROM accounts_cdc_sdc3
    WHERE _commit_snapshot_id == {latest_snapshot_id}
""").show()

+------------+----------+--------------------+--------------------+--------------------+----------------+---------+-----------+----+-----+---+-------------+---------------+-------------------+
|country_code|created_at|               email|                iban|                  id|            name| passport|      swift|year|month|day| _change_type|_change_ordinal|_commit_snapshot_id|
+------------+----------+--------------------+--------------------+--------------------+----------------+---------+-----------+----+-----+---+-------------+---------------+-------------------+
|          JP|2024-12-10|derrick15@example...|GB14AYNQ551881503...|0daad7bc-25b6-446...|Cassidy Jones MD|595954695|VTHYGBZMNOI|2024|   12| 10|UPDATE_BEFORE|              0|8762264941598402107|
|          JP|2024-12-10|cassidy.jones.md@...|GB14AYNQ551881503...|0daad7bc-25b6-446...|Cassidy Jones MD|595954695|VTHYGBZMNOI|2024|   12| 10| UPDATE_AFTER|              0|8762264941598402107|
|          CA|2024-11-08|david.flyn

In [46]:
from pyspark.sql.functions import lit, when

cdc_data = spark.sql(f"""
    SELECT * FROM accounts_cdc_sdc3
    WHERE _commit_snapshot_id == {latest_snapshot_id}
""")

# Add previous value columns for SCD Type 3
scd3_data = cdc_data.withColumn("previous_email", lit(None)) \
                    .withColumn("previous_country_code", lit(None))

# Process updates to populate previous values
scd3_data = scd3_data.withColumn(
    "previous_email",
    when(col("_change_type") == "UPDATE_AFTER", col("email")).otherwise(col("previous_email"))
).withColumn(
    "previous_country_code",
    when(col("_change_type") == "UPDATE_AFTER", col("country_code")).otherwise(col("previous_country_code"))
)

# Filter out UPDATE_BEFORE rows (since we only need the latest state)
scd3_data = scd3_data.filter(col("_change_type") != "UPDATE_BEFORE")

# Show the SCD Type 3 result
scd3_data.select(
    "id", "name", "email", "previous_email", "country_code", "previous_country_code",
    "created_at", "iban", "passport", "swift", "year", "month", "day"
).show()

+--------------------+----------------+--------------------+--------------------+------------+---------------------+----------+--------------------+---------+-----------+----+-----+---+
|                  id|            name|               email|      previous_email|country_code|previous_country_code|created_at|                iban| passport|      swift|year|month|day|
+--------------------+----------------+--------------------+--------------------+------------+---------------------+----------+--------------------+---------+-----------+----+-----+---+
|0daad7bc-25b6-446...|Cassidy Jones MD|cassidy.jones.md@...|cassidy.jones.md@...|          JP|                   JP|2024-12-10|GB14AYNQ551881503...|595954695|VTHYGBZMNOI|2024|   12| 10|
|3b07d43d-e15f-482...|     David Flynn|david.flynn@domai...|                NULL|          CA|                 NULL|2024-11-08|GB02WPAU755046197...|481064071|MKYOGBA7ENY|2024|   11|  8|
|4cbbf121-caae-42a...|     Kara Thomas|kara.thomas@domai...|kara.thoma

## TODO: Optimaze commands

## TODO: Miscellaneous on Iceberg