# POC: Incremental reads on Hudi without hive metastore

Using local metastore with incremental feactures

# Use cases
Change Data Feed is not enabled by default. The following use cases should drive when you enable the change data feed.

1. Silver and Gold tables: Improve Delta performance by processing only row-level changes following initial MERGE, UPDATE, or DELETE operations to accelerate and simplify ETL and ELT operations.

2. Transmit changes: Send a change data feed to downstream systems such as Kafka or RDBMS that can use it to incrementally process in later stages of data pipelines.

3. Audit trail table: Capture the change data feed as a Delta table provides perpetual storage and efficient query capability to see all changes over time, including when deletes occur and what updates were made.



# Known constrains:

- Versions lower than 0.15.0 must be aware of possible commits retantion policies. By default hudi keeps last **10 commits** of a table. [reference](https://hudi.apache.org/docs/0.15.0/hoodie_cleaner/)

## Before run

Checks if spark 3.5.3 and Hadoop are install
also the pyspark>=3.5.3 libraries.

Remove if existis the /warehouse/ folder in this direcory

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, month, dayofmonth

### Spark Setup

In [2]:
spark_jar_packages = ",".join([
    "org.apache.hudi:hudi-spark3.5-bundle_2.12:0.15.0",
    #"org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0",
    "org.apache.hadoop:hadoop-aws:3.3.4",
    "com.amazonaws:aws-java-sdk-bundle:1.12.262",
])

In [3]:
LOCAL_WAREHOUSE_CATALOG = "s3a://lakehouse-raw/"

In [4]:
spark = (
    SparkSession.builder
    .master("local[*]")
    .appName("hudi-local-playground")
    .config("spark.jars.packages", spark_jar_packages)

    # Hudi Integration
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.kryo.registrator", "org.apache.spark.HoodieSparkKryoRegistrar")
    .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog")

    # Local catalog
    .config("spark.sql.catalog.local.type", "hadoop")   # Use Hadoop catalog
    .config("spark.sql.warehouse.dir", LOCAL_WAREHOUSE_CATALOG)   # Path to store metadata

    # S3 (MinIO Integration)
    .config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9010")
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin")
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
    .config("spark.hadoop.fs.s3a.region", "us-east-1")
    .getOrCreate()
)

25/01/14 15:33:11 WARN Utils: Your hostname, baptvit resolves to a loopback address: 127.0.1.1; using 192.168.2.129 instead (on interface wlp4s0)
25/01/14 15:33:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /home/baptvit/.ivy2/cache
The jars for the packages stored in: /home/baptvit/.ivy2/jars
org.apache.hudi#hudi-spark3.5-bundle_2.12 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e4a67b32-bd08-4a23-be47-c4cbeee2e02e;1.0
	confs: [default]


:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found org.apache.hudi#hudi-spark3.5-bundle_2.12;0.15.0 in central
	found org.apache.hive#hive-storage-api;2.8.1 in central
	found org.slf4j#slf4j-api;1.7.36 in central
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
:: resolution report :: resolve 148ms :: artifacts dl 5ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.12.262 from central in [default]
	org.apache.hadoop#hadoop-aws;3.3.4 from central in [default]
	org.apache.hive#hive-storage-api;2.8.1 from central in [default]
	org.apache.hudi#hudi-spark3.5-bundle_2.12;0.15.0 from central in [default]
	org.slf4j#slf4j-api;1.7.36 from central in [default]
	org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number

## Creating a fake database

In [5]:
import random
from faker import Faker

def generate_entry(faker: Faker, country_codes: list):
    return {
        "id": faker.unique.uuid4(),
        "name":  faker.name(),
        "email": faker.email(),
        "passport": faker.passport_number(),
        "country_code": random.choice(country_codes),
        "iban": faker.iban(),
        "swift": faker.swift11(),
        "created_at": faker.past_date(start_date='-90d').strftime('%Y-%m-%d')
    }

In [6]:
def generate_dataset(num: int, seed: int):
    country_codes = ['US', 'CA', 'JP', 'KR', 'FR', 'GE', 'UK', 'BR', 'AR']
    Faker.seed(seed)
    faker = Faker()
    return [generate_entry(faker, country_codes) for _ in range(num)]

In [7]:
dataset = generate_dataset(num=100, seed=739)

In [8]:
df = spark.createDataFrame(dataset)\
        .withColumn("year", year(col("created_at")))\
        .withColumn("month", month(col("created_at")))\
        .withColumn("day", dayofmonth(col("created_at")))

25/01/14 15:33:16 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties


In [9]:
df.count()

                                                                                

100

In [10]:
df.show(1)

+------------+----------+--------------------+--------------------+--------------------+-----------+---------+-----------+----+-----+---+
|country_code|created_at|               email|                iban|                  id|       name| passport|      swift|year|month|day|
+------------+----------+--------------------+--------------------+--------------------+-----------+---------+-----------+----+-----+---+
|          JP|2024-10-21|powelljason@examp...|GB77AKMZ560580635...|5a424412-b127-4f8...|Cody Taylor|895549199|INSEGB5PR6S|2024|   10| 21|
+------------+----------+--------------------+--------------------+--------------------+-----------+---------+-----------+----+-----+---+
only showing top 1 row



## Save using the default database

Hudi trancks automatically the changes

In [11]:
spark.sql("""
    CREATE DATABASE IF NOT EXISTS hudi
    LOCATION 's3a://lakehouse-raw/hudi/'
""")

DataFrame[]

25/01/14 15:33:24 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


In [12]:
df.write.format("hudi") \
    .option("hoodie.database.name", "hudi") \
    .option("hoodie.table.name", "accounts") \
    .option("hoodie.datasource.write.recordkey.field", "id") \
    .option("hoodie.datasource.write.precombine.field", "created_at") \
    .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .option("hoodie.datasource.hive_sync.partition_fields", "year,month") \
    .option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.MultiPartKeysValueExtractor") \
    .option("hoodie.datasource.write.hive_style_partitioning","true") \
    .partitionBy("year", "month") \
    .mode("append") \
    .save("s3a://lakehouse-raw/hudi/accounts") ## hudi needs the base full path

25/01/14 15:33:41 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
25/01/14 15:33:41 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
25/01/14 15:33:42 WARN S3ABlockOutputStream: Application invoked the Syncable API against stream writing to hudi/accounts/.hoodie/metadata/files/.files-0000-0_00000000000000010.log.1_0-0-0. This is unsupported
25/01/14 15:33:42 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
                                                                                



25/01/14 15:33:45 WARN HoodieSparkSqlWriterInternal: Closing write client


## Reading from local direct from file

In [15]:
LOCAL_ACCOUNT_TABLE = LOCAL_WAREHOUSE_CATALOG + "hudi/accounts"

In [16]:
# providing a starting version
df_read = spark.read.format("hudi") \
  .load(LOCAL_ACCOUNT_TABLE)

In [17]:
df_read.show(1)

+-------------------+--------------------+--------------------+----------------------+--------------------+------------+----------+------------------+--------------------+--------------------+-------------+---------+-----------+---+----+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|country_code|created_at|             email|                iban|                  id|         name| passport|      swift|day|year|month|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------+----------+------------------+--------------------+--------------------+-------------+---------+-----------+---+----+-----+
|  20250114153341605|20250114153341605...|5674b1aa-0418-4cb...|    year=2024/month=12|86e35a4b-42eb-429...|          GE|2024-12-24|fmason@example.com|GB63XTWT081190002...|5674b1aa-0418-4cb...|Ryan Robinson|742550816|JRRTGBWUN5K| 24|2024|   12|
+-------------------+---

## Reading the table history from local folder

Local folder and spark SQL

In [18]:
spark.sql(
    """
    CREATE DATABASE IF NOT EXISTS hudi;
    """
)

DataFrame[]

In [19]:
spark.sql(
    f"""
    CREATE TABLE IF NOT EXISTS hudi.accounts
        USING hudi
        OPTIONS (
          path = '{LOCAL_ACCOUNT_TABLE}'
        );
    """
)

DataFrame[]

In [20]:
spark.sql("""
    call show_commits (
        table => 'hudi.accounts',
        from_commit => '10'
    )    
""").show(vertical=True, truncate=False)

-RECORD 0-----------------------------------------
 commit_time                  | 20250114153341605 
 state_transition_time        | 20250114153345583 
 action                       | commit            
 total_bytes_written          | 1760827           
 total_files_added            | 4                 
 total_files_updated          | 0                 
 total_partitions_written     | 4                 
 total_records_written        | 100               
 total_update_records_written | 0                 
 total_errors                 | 0                 



## Using CDC and table history to identify the increments

### Local folder with show_commits

In [21]:
last_commig_time = spark.sql("""
    call show_commits (
        table => 'hudi.accounts',
        from_commit => '0'
    )    
""").collect()[0][0]

last_commig_time

'20250114153341605'

### Reading just the last table version using local catalog

In [30]:
spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", int(last_commig_time) - 1) \
    .load(LOCAL_ACCOUNT_TABLE).show()

+-------------------+--------------------+--------------------+----------------------+--------------------+------------+----------+--------------------+--------------------+--------------------+------------------+---------+-----------+----+-----+---+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|country_code|created_at|               email|                iban|                  id|              name| passport|      swift|year|month|day|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------+----------+--------------------+--------------------+--------------------+------------------+---------+-----------+----+-----+---+
|  20250114153341605|20250114153341605...|5674b1aa-0418-4cb...|    year=2024/month=12|86e35a4b-42eb-429...|          GE|2024-12-24|  fmason@example.com|GB63XTWT081190002...|5674b1aa-0418-4cb...|     Ryan Robinson|742550816|JRRTGBWUN5K|2024|   12| 

In [34]:
spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", int(last_commig_time) - 1) \
    .load(LOCAL_ACCOUNT_TABLE).count()

100

## Creating the upsert

### Upsert Dataset

Editing 4 records and adding new 4 records

In [35]:
entries = [
    # Existing entries
    dataset[2], 
    dataset[4], 
    dataset[7],
    dataset[11],
    # New entries
    *generate_dataset(4, seed=1037)
]

In [36]:
for entry in entries:
    username = entry['name'].lower().replace(" ", ".")
    entry['email'] = f"{username}@domain.com"

In [37]:
upsert_df = spark.createDataFrame(entries)\
        .withColumn("year", year(col("created_at")))\
        .withColumn("month", month(col("created_at")))\
        .withColumn("day", dayofmonth(col("created_at")))

In [38]:
upsert_df.show(8, truncate=False)

+------------+----------+---------------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+
|country_code|created_at|email                      |iban                  |id                                  |name            |passport |swift      |year|month|day|
+------------+----------+---------------------------+----------------------+------------------------------------+----------------+---------+-----------+----+-----+---+
|AR          |2024-12-16|ann.cruz@domain.com        |GB55LFTZ50027083194346|b7e33adb-9bfe-465f-a533-1d57f8d9c9f6|Ann Cruz        |T22953641|HMAQGBCSXE8|2024|12   |16 |
|UK          |2024-12-10|cassidy.jones.md@domain.com|GB14AYNQ55188150393152|0daad7bc-25b6-4469-8a2f-2ba767f86791|Cassidy Jones MD|595954695|VTHYGBZMNOI|2024|12   |10 |
|FR          |2024-12-11|kara.thomas@domain.com     |GB02LAAF80272115976869|4cbbf121-caae-42aa-8508-3fd99bb2f762|Kara Thomas     |661814813|DULPGBWLTDU|2024|12 

In [39]:
upsert_df.createOrReplaceTempView("upsert_data")

# Upsert Strategy

## Slowly Changing Dimension (SCD) Type 1

In SCD Type 1, the existing records are overwritten with new data when there is a match, and new records are inserted when there is no match. This approach does not preserve historical changes; it simply updates the records with the latest data.


HUDI Achive upsert SCD by default enabling the **"hoodie.datasource.write.operation", "upsert" and .mode("append")**

In [41]:
upsert_df.write.format("hudi") \
    .option("hoodie.database.name", "hudi") \
    .option("hoodie.table.name", "accounts") \
    .option("hoodie.datasource.write.recordkey.field", "id") \
    .option("hoodie.datasource.write.precombine.field", "created_at") \
    .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .option("hoodie.datasource.hive_sync.partition_fields", "year,month") \
    .option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.MultiPartKeysValueExtractor") \
    .option("hoodie.datasource.write.hive_style_partitioning","true") \
    .partitionBy("year", "month") \
    .mode("append") \
    .save("s3a://lakehouse-raw/hudi/accounts") ## hudi needs the base full path

25/01/14 15:37:30 WARN HoodieSparkSqlWriterInternal: Closing write client


In [42]:
spark.sql(
    f"""
    CREATE TABLE IF NOT EXISTS hudi.accounts
        USING hudi
        OPTIONS (
          path = '{LOCAL_ACCOUNT_TABLE}'
        );
    """
)

DataFrame[]

In [44]:
spark.sql("""
    call show_commits (
        table => 'hudi.accounts',
        from_commit => '2'
    )    
""").show(vertical=True, truncate=False)

-RECORD 0-----------------------------------------
 commit_time                  | 20250114153728648 
 state_transition_time        | 20250114153730233 
 action                       | commit            
 total_bytes_written          | 1321940           
 total_files_added            | 0                 
 total_files_updated          | 3                 
 total_partitions_written     | 3                 
 total_records_written        | 89                
 total_update_records_written | 4                 
 total_errors                 | 0                 
-RECORD 1-----------------------------------------
 commit_time                  | 20250114153341605 
 state_transition_time        | 20250114153345583 
 action                       | commit            
 total_bytes_written          | 1760827           
 total_files_added            | 4                 
 total_files_updated          | 0                 
 total_partitions_written     | 4                 
 total_records_written        |

### Reading the last changes

In [48]:
last_commig_time = spark.sql("""
    call show_commits (
        table => 'hudi.accounts',
        from_commit => '0'
    )    
""").collect()[0][0]

last_commig_time = int(last_commig_time) - 1

In [49]:
spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", last_commig_time) \
    .load(LOCAL_ACCOUNT_TABLE).show()

+-------------------+--------------------+--------------------+----------------------+--------------------+------------+----------+--------------------+--------------------+--------------------+----------------+---------+-----------+----+-----+---+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|country_code|created_at|               email|                iban|                  id|            name| passport|      swift|year|month|day|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------+----------+--------------------+--------------------+--------------------+----------------+---------+-----------+----+-----+---+
|  20250114153728648|20250114153728648...|4cbbf121-caae-42a...|    year=2024/month=12|86e35a4b-42eb-429...|          FR|2024-12-11|kara.thomas@domai...|GB02LAAF802721159...|4cbbf121-caae-42a...|     Kara Thomas|661814813|DULPGBWLTDU|2024|   12| 11|
|  2

In [50]:
spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", last_commig_time) \
    .load(LOCAL_ACCOUNT_TABLE).count()

8

## TODO: Optimaze commands

## TODO: Miscellaneous on Hudi