# POC: Incremental reads on Hudi without hive metastore

Using local metastore with incremental feactures

# Use cases
Change Data Feed is not enabled by default. The following use cases should drive when you enable the change data feed.

1. Silver and Gold tables: Improve Delta performance by processing only row-level changes following initial MERGE, UPDATE, or DELETE operations to accelerate and simplify ETL and ELT operations.

2. Transmit changes: Send a change data feed to downstream systems such as Kafka or RDBMS that can use it to incrementally process in later stages of data pipelines.

3. Audit trail table: Capture the change data feed as a Delta table provides perpetual storage and efficient query capability to see all changes over time, including when deletes occur and what updates were made.



# Known constrains:

- Versions lower than 0.15.0 must be aware of possible commits retantion policies. By default hudi keeps last **10 commits** of a table. [reference](https://hudi.apache.org/docs/0.15.0/hoodie_cleaner/)

## Before run

Checks if spark 3.5.3 and Hadoop are install
also the pyspark>=3.5.3 libraries.

Remove if existis the /warehouse/ folder in this direcory

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, month, dayofmonth

### Spark Setup

In [2]:
spark_jar_packages = ",".join([
    #"org.apache.hudi:hudi-spark3.5-bundle_2.12:0.15.0",
    "org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0",
])

In [3]:
LOCAL_WAREHOUSE_CATALOG = "file:///home/baptvit/Documents/github/lakehouse-labs/notebooks/warehouse/hudi/"

In [4]:
spark = (
    SparkSession.builder
    .master("local[*]")
    .appName("hudi-local-playground")
    .config("spark.jars.packages", spark_jar_packages)

    # Hudi-Hive Integration
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.kryo.registrator", "org.apache.spark.HoodieSparkKryoRegistrar")
    .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog")
    
    .config("spark.sql.catalog.local.type", "hadoop")   # Use Hadoop catalog
    .config("spark.sql.warehouse.dir", LOCAL_WAREHOUSE_CATALOG)   # Path to store metadata
    .getOrCreate()
)

25/01/14 12:23:23 WARN Utils: Your hostname, baptvit resolves to a loopback address: 127.0.1.1; using 192.168.2.129 instead (on interface wlp4s0)
25/01/14 12:23:23 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /home/baptvit/.ivy2/cache
The jars for the packages stored in: /home/baptvit/.ivy2/jars
org.apache.hudi#hudi-spark3.5-bundle_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-7ec19967-fd59-4860-8428-6313ebfd160c;1.0
	confs: [default]


:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found org.apache.hudi#hudi-spark3.5-bundle_2.12;1.0.0 in central
	found org.apache.hive#hive-storage-api;2.8.1 in central
	found org.slf4j#slf4j-api;1.7.36 in central
:: resolution report :: resolve 122ms :: artifacts dl 4ms
	:: modules in use:
	org.apache.hive#hive-storage-api;2.8.1 from central in [default]
	org.apache.hudi#hudi-spark3.5-bundle_2.12;1.0.0 from central in [default]
	org.slf4j#slf4j-api;1.7.36 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-7ec19967-fd59-4860-8428-6313ebfd160c
	confs: [default]
	0 artifacts copied, 3 a

## Creating a fake database

In [5]:
import random
from faker import Faker

def generate_entry(faker: Faker, country_codes: list):
    return {
        "id": faker.unique.uuid4(),
        "name":  faker.name(),
        "email": faker.email(),
        "passport": faker.passport_number(),
        "country_code": random.choice(country_codes),
        "iban": faker.iban(),
        "swift": faker.swift11(),
        "created_at": faker.past_date(start_date='-90d').strftime('%Y-%m-%d')
    }

In [6]:
def generate_dataset(num: int, seed: int):
    country_codes = ['US', 'CA', 'JP', 'KR', 'FR', 'GE', 'UK', 'BR', 'AR']
    Faker.seed(seed)
    faker = Faker()
    return [generate_entry(faker, country_codes) for _ in range(num)]

In [7]:
dataset = generate_dataset(num=100, seed=739)

In [8]:
df = spark.createDataFrame(dataset)\
        .withColumn("year", year(col("created_at")))\
        .withColumn("month", month(col("created_at")))\
        .withColumn("day", dayofmonth(col("created_at")))

25/01/14 12:23:38 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
25/01/14 12:23:38 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf


In [9]:
df.count()

                                                                                

100

In [10]:
df.show(1)

+------------+----------+--------------------+--------------------+--------------------+-----------+---------+-----------+----+-----+---+
|country_code|created_at|               email|                iban|                  id|       name| passport|      swift|year|month|day|
+------------+----------+--------------------+--------------------+--------------------+-----------+---------+-----------+----+-----+---+
|          AR|2024-10-21|powelljason@examp...|GB77AKMZ560580635...|5a424412-b127-4f8...|Cody Taylor|895549199|INSEGB5PR6S|2024|   10| 21|
+------------+----------+--------------------+--------------------+--------------------+-----------+---------+-----------+----+-----+---+
only showing top 1 row



25/01/14 12:23:41 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


## Save using the default database

Hudi trancks automatically the changes

In [11]:
spark.sql("""
    CREATE DATABASE IF NOT EXISTS hudi;
""")

DataFrame[]

In [12]:
df.write.format("hudi") \
    .option("hoodie.database.name", "hudi") \
    .option("hoodie.table.name", "accounts_1") \
    .option("hoodie.datasource.write.recordkey.field", "id") \
    .option("hoodie.datasource.write.precombine.field", "created_at") \
    .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .option("hoodie.datasource.hive_sync.partition_fields", "year,month") \
    .option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.MultiPartKeysValueExtractor") \
    .option("hoodie.datasource.write.hive_style_partitioning","true") \
    .partitionBy("year", "month") \
    .mode("append") \
    .save("file:///home/baptvit/Documents/github/lakehouse-labs/notebooks/warehouse/hudi/hudi.db/accounts_1") ## hudi needs the base full path

25/01/14 12:23:49 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties




## Reading from local direct from file

In [None]:
LOCAL_ACCOUNT_TABLE = LOCAL_WAREHOUSE_CATALOG + "hudi_db.db/accounts"

In [None]:
# providing a starting version
df_read = spark.read.format("hudi") \
  .load(LOCAL_ACCOUNT_TABLE)

In [None]:
df_read.show(1)

## Reading the table history from local folder

Local folder and spark SQL

In [None]:
spark.sql(
    """
    CREATE DATABASE IF NOT EXISTS hudi_db;
    """
)

In [None]:
spark.sql(
    f"""
    CREATE TABLE IF NOT EXISTS hudi_db.accounts
        USING hudi
        OPTIONS (
          path = '{LOCAL_ACCOUNT_TABLE}'
        );
    """
)

In [None]:
spark.sql("""
    call show_commits (
        table => 'hudi_db.accounts',
        from_commit => '10'
    )    
""").show(vertical=True, truncate=False)

## Using CDC and table history to identify the increments

### Local folder with show_commits

In [None]:
last_commig_time = spark.sql("""
    call show_commits (
        table => 'hudi_db.accounts',
        from_commit => '0'
    )    
""").collect()[0][0]

last_commig_time

### Reading just the last table version using local catalog

In [None]:
spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", last_commig_time) \
    .load(LOCAL_ACCOUNT_TABLE).show()

In [None]:
spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", last_commig_time) \
    .load(LOCAL_ACCOUNT_TABLE).count()

## Creating the upsert

### Upsert Dataset

Editing 4 records and adding new 4 records

In [None]:
entries = [
    # Existing entries
    dataset[2], 
    dataset[4], 
    dataset[7],
    dataset[11],
    # New entries
    *generate_dataset(4, seed=1037)
]

In [None]:
for entry in entries:
    username = entry['name'].lower().replace(" ", ".")
    entry['email'] = f"{username}@domain.com"

In [None]:
upsert_df = spark.createDataFrame(entries)\
        .withColumn("year", year(col("created_at")))\
        .withColumn("month", month(col("created_at")))\
        .withColumn("day", dayofmonth(col("created_at")))

In [None]:
upsert_df.show(8, truncate=False)

In [None]:
upsert_df.createOrReplaceTempView("upsert_data")

# Upsert Strategy

## Slowly Changing Dimension (SCD) Type 1

In SCD Type 1, the existing records are overwritten with new data when there is a match, and new records are inserted when there is no match. This approach does not preserve historical changes; it simply updates the records with the latest data.


HUDI Achive upsert SCD by default enabling the **"hoodie.datasource.write.operation", "upsert" and .mode("append")**

In [None]:
upsert_df.write.format("hudi") \
    .option("hoodie.database.name", "hudi_db") \
    .option("hoodie.table.name", "accounts") \
    .option("hoodie.datasource.write.recordkey.field", "id") \
    .option("hoodie.datasource.write.precombine.field", "created_at") \
    .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .option("hoodie.datasource.hive_sync.partition_fields", "year,month") \
    .option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.MultiPartKeysValueExtractor") \
    .option("hoodie.datasource.write.hive_style_partitioning","true") \
    .partitionBy("year", "month") \
    .mode("append") \
    .save("file:///home/baptvit/Documents/github/lakehouse-labs/notebooks/warehouse/hudi/hudi_db.db/accounts") ## hudi needs the base full path

In [None]:
spark.sql(
    f"""
    CREATE TABLE IF NOT EXISTS hudi_db.accounts
        USING hudi
        OPTIONS (
          path = '{LOCAL_ACCOUNT_TABLE}'
        );
    """
)

In [None]:
spark.sql("""
    call show_commits (
        table => 'hudi_db.accounts',
        from_commit => '2'
    )    
""").show(vertical=True, truncate=False)

### Reading the last changes

In [None]:
last_commig_time = spark.sql("""
    call show_commits (
        table => 'hudi_db.accounts',
        from_commit => '0'
    )    
""").collect()[0][0]

last_commig_time

In [None]:
spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", last_commig_time) \
    .load(LOCAL_ACCOUNT_TABLE).show()

In [None]:
spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", last_commig_time) \
    .load(LOCAL_ACCOUNT_TABLE).count()

## TODO: Optimaze commands

## TODO: Miscellaneous on Hudi