<img src="https://hudi.apache.org/assets/images/hudi.png"> &nbsp; &nbsp; &nbsp;<img src="../images/logo.svg" alt="lakeFS logo" width=300/> 

## Write-Audit-Publish (WAP) pattern with Apache Hudi and lakeFS

Please see the accompanying blog series for more details: 

1. [Data Engineering Patterns: Write-Audit-Publish (WAP)](https://lakefs.io/blog/data-engineering-patterns-write-audit-publish)
1. [How to Implement Write-Audit-Publish (WAP)](https://lakefs.io/blog/how-to-implement-write-audit-publish)
1. [Putting the Write-Audit-Publish Pattern into Practice with lakeFS](https://lakefs.io/blog/write-audit-publish-with-lakefs/)

[@rmoff](https://twitter.com/rmoff/) 

# Initialisation

## Set up the connection to lakeFS

In [1]:
import lakefs_client
from lakefs_client.client import LakeFSClient

lakefs_config = lakefs_client.Configuration()
lakefs_config.username = 'AKIAIOSFODNN7EXAMPLE'
lakefs_config.password = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
lakefs_config.host = 'http://lakefs:8000'

lakefs = LakeFSClient(lakefs_config)
lakefs_api_client = lakefs_client.ApiClient(lakefs_config)

### Get the first repository present in lakeFS

In [2]:
repo=lakefs.repositories.list_repositories().results[0]
print(f"Using lakeFS repository '{repo.id}' with storage namespace {repo.storage_namespace}")

Using lakeFS repository 'example' with storage namespace s3://example


### Define the data storage directory based on the provided namespace

In [3]:
data_dir=repo.storage_namespace.replace('s3','s3a')
print(f"Using {data_dir} for data storage")

Using s3a://example for data storage


## Set up Spark 

Added the the following to fix `java.lang.IllegalArgumentException: For input string: "null"` when querying a Hudi table per [8061](https://github.com/apache/hudi/issues/8061): 
    
* `spark.hadoop.spark.sql.legacy.parquet.nanosAsLong`
* `spark.hadoop.spark.sql.parquet.binaryAsString`
* `spark.hadoop.spark.sql.parquet.int96AsTimestamp`
* `spark.hadoop.spark.sql.caseSensitive`

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", "http://lakefs:8000") \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", "AKIAIOSFODNN7EXAMPLE") \
        .config("spark.hadoop.fs.s3a.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY") \
        .config("spark.jars.packages", "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.0") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
        .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
        .config("spark.hadoop.spark.sql.legacy.parquet.nanosAsLong", "false") \
        .config("spark.hadoop.spark.sql.parquet.binaryAsString", "false") \
        .config("spark.hadoop.spark.sql.parquet.int96AsTimestamp", "true") \
        .config("spark.hadoop.spark.sql.caseSensitive", "false") \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark




## Load test data

In [5]:
df = spark.read.option("inferSchema","true").option("multiline","true").json("/home/jovyan/data/nyc_film_permits.json")

### Inspect test data

In [6]:
df.createOrReplaceTempView("permits_src")

In [7]:
%%sql
SELECT borough, count(*) permit_cnt
FROM permits_src
GROUP BY borough

borough,permit_cnt
Queens,168
Brooklyn,334
Staten Island,7
Manhattan,463
Bronx,28


## Write test data to Hudi files

### Set Hudi options

_Hudi requires a Primary key for the table, so we're picking a composite key here since there's no obvious single field to use_

In [8]:
hudi_options = {
    'hoodie.table.name': 'permits',
    'hoodie.datasource.write.recordkey.field': 'borough,startdatetime',
    'hoodie.datasource.write.partitionpath.field': 'borough',
    'hoodie.datasource.write.table.name': 'permits',
    'hoodie.datasource.write.operation': 'insert'
}

### Write Hudi file

In [9]:
branch='main'

In [10]:
permits=(f"{data_dir}/{branch}/nyc/permits")

df.write.format("hudi"). \
    options(**hudi_options). \
    mode("overwrite"). \
    save(permits)


### Inspect the files written

In [11]:
for f in lakefs.objects.list_objects(repo.id,'main').results:
    print(f['path'])

aggs/agg_plot/_delta_log/
aggs/agg_plot/_delta_log/00000000000000000000.json
aggs/agg_plot/_delta_log/00000000000000000001.json
aggs/agg_plot/part-00000-2ee8ce47-d6e9-4fa9-a1ff-753028a42a84-c000.snappy.parquet
aggs/agg_plot/part-00000-f9eda370-5b4d-4723-9196-4d8e40990f5d-c000.snappy.parquet
aggs/agg_variety/_delta_log/
aggs/agg_variety/_delta_log/00000000000000000000.json
aggs/agg_variety/_delta_log/00000000000000000001.json
aggs/agg_variety/part-00000-cf6566ff-3b49-499a-a8ea-0fab940e1174-c000.snappy.parquet
aggs/agg_variety/part-00000-e3f85d05-1461-4d0c-b089-ff67dec276a2-c000.snappy.parquet
nyc/permits/
nyc/permits/.hoodie/
nyc/permits/.hoodie/.aux/
nyc/permits/.hoodie/.aux/.bootstrap/.fileids/
nyc/permits/.hoodie/.aux/.bootstrap/.partitions/
nyc/permits/.hoodie/.schema/
nyc/permits/.hoodie/.temp/
nyc/permits/.hoodie/20230518111510100.commit
nyc/permits/.hoodie/20230518111510100.commit.requested
nyc/permits/.hoodie/20230518111510100.inflight
nyc/permits/.hoodie/archived/
nyc/permits/.

### Load the Hudi data as a view

In [12]:
permits=(f"{data_dir}/{branch}/nyc/permits")
print(f"Reading Hudi table from {permits} into view `permits_{branch}`")

spark.read. \
format("hudi"). \
options(**hudi_options). \
load(permits). \
createOrReplaceTempView(f"permits_{branch}")

Reading Hudi table from s3a://example/main/nyc/permits into view `permits_main`


In [13]:
%%sql
SELECT borough, count(*) permit_cnt
FROM permits_main
GROUP BY borough

borough,permit_cnt
Manhattan,463
Brooklyn,334
Queens,168
Bronx,28
Staten Island,7


## Commit the data to the `main` branch

In [14]:
api_client = lakefs_client.ApiClient(lakefs_config)

from lakefs_client.api import commits_api
from lakefs_client.model.commit import Commit
from lakefs_client.model.commit_creation import CommitCreation

api_instance = commits_api.CommitsApi(api_client)
commit_creation = CommitCreation(
    message="First commit of NYC Permit data"
) 


api_instance.commit(repo.id, 'main', commit_creation)

{'committer': 'docker',
 'creation_date': 1684408537,
 'id': 'f310fed40883dac47878380ab6ac20568ba61dedb2ab85d7bea5adda8f9a7aa9',
 'message': 'First commit of NYC Permit data',
 'meta_range_id': '',
 'metadata': {},
 'parents': ['00b07af2b550fdee6a19437999806b67bd2aae894220881293b790207205a6c1']}

# The Setup

lakeFS is based on branches (just like git). Branches are copy-on-write, making them 'cheap' in terms of storage. 

We're going to create a branch to write data to, audit it, and then merge it back if we're happy with the audit. 

In [15]:
branch='etl_job_42'

### Create branch

In [16]:
from lakefs_client.api import branches_api
from lakefs_client.model.branch_creation import BranchCreation

api_instance = branches_api.BranchesApi(lakefs_api_client)
branch_creation = BranchCreation(
    name=branch,
    source="main",
) 

api_instance.create_branch(repo.id, branch_creation)

'f310fed40883dac47878380ab6ac20568ba61dedb2ab85d7bea5adda8f9a7aa9'

### Check that we still see the same data

In [17]:
permits=(f"{data_dir}/{branch}/nyc/permits")
vw=(f"permits_{branch}")
print(f"Reading Hudi table from {permits} into view `{vw}`")

spark.read. \
format("hudi"). \
options(**hudi_options). \
load(permits). \
createOrReplaceTempView(f"{vw}")

spark.sql(f"SELECT borough, count(*) permit_cnt FROM {vw} GROUP BY borough").show()

Reading Hudi table from s3a://example/etl_job_42/nyc/permits into view `permits_etl_job_42`
+-------------+----------+
|      borough|permit_cnt|
+-------------+----------+
|    Manhattan|       463|
|     Brooklyn|       334|
|       Queens|       168|
|        Bronx|        28|
|Staten Island|         7|
+-------------+----------+



# Write

## Load the data into a table

In [18]:
print(f"Reading Hudi table from {permits} into table `nyc_permits`")

spark.sql("DROP TABLE IF EXISTS nyc_permits")
spark.sql("CREATE TABLE nyc_permits USING HUDI LOCATION '"+ permits + "'")

Reading Hudi table from s3a://example/etl_job_42/nyc/permits into table `nyc_permits`


DataFrame[]

In [19]:
%%sql
SHOW TABLES;

namespace,tableName,isTemporary
default,nyc_permits,False
,permits_etl_job_42,False
,permits_main,False
,permits_src,False


In [20]:
%%sql
SELECT borough, count(*) permit_cnt
FROM nyc_permits
GROUP BY borough

borough,permit_cnt
Manhattan,463
Brooklyn,334
Queens,168
Bronx,28
Staten Island,7


In [21]:
%%sql

DELETE FROM nyc_permits
WHERE borough='Manhattan'

In [22]:
%%sql
SELECT borough, count(*) permit_cnt
FROM nyc_permits
GROUP BY borough

borough,permit_cnt
Brooklyn,334
Queens,168
Bronx,28
Staten Island,7


## Inspecting the staged/unpublished data

### Staged/unpublished data

#### The changes are reflected in the table:

In [23]:
permits=(f"{data_dir}/{branch}/nyc/permits")
vw=(f"permits_{branch}")
print(f"Reading Hudi table from {permits} into view `{vw}`")

spark.read. \
format("hudi"). \
options(**hudi_options). \
load(permits). \
createOrReplaceTempView(f"{vw}")

spark.sql(f"SELECT borough, count(*) permit_cnt FROM {vw} GROUP BY borough").show()

Reading Hudi table from s3a://example/etl_job_42/nyc/permits into view `permits_etl_job_42`
+-------------+----------+
|      borough|permit_cnt|
+-------------+----------+
|     Brooklyn|       334|
|       Queens|       168|
|        Bronx|        28|
|Staten Island|         7|
+-------------+----------+



### Published data

The data on the `main` branch remains unchanged. We can validate this by running a query against the data, specifying `main` as the branch:

In [24]:
branch="main"
permits=(f"{data_dir}/{branch}/nyc/permits")
vw=(f"permits_{branch}")
print(f"Reading Hudi table from {permits} into view `{vw}`")

spark.read. \
format("hudi"). \
options(**hudi_options). \
load(permits). \
createOrReplaceTempView(f"{vw}")

spark.sql(f"SELECT borough, count(*) permit_cnt FROM {vw} GROUP BY borough").show()

Reading Hudi table from s3a://example/main/nyc/permits into view `permits_main`
+-------------+----------+
|      borough|permit_cnt|
+-------------+----------+
|    Manhattan|       463|
|     Brooklyn|       334|
|       Queens|       168|
|        Bronx|        28|
|Staten Island|         7|
+-------------+----------+



# Audit 

At the moment the data is written to the audit branch (`etl_job_42`), but not published to `main`. 

How you audit the data is up to you. The nice thing about the data being staged is that you can do it within the same ETL job, or have another tool do it.

Here's a very simple example of doing in Python. We're going to programatically check that only the four expected boroughs remain in the data.

First, we define those that are expected:

In [25]:
expected_boroughs = {"Queens", "Brooklyn", "Bronx", "Staten Island"}

Then we get a set of the actual boroughs in the staged data

In [26]:
branch="etl_job_42"
permits=(f"{data_dir}/{branch}/nyc/permits")
distinct_boroughs = spark.read \
                    .format("hudi") \
                    .load(permits) \
                    .select("borough") \
                    .distinct() \
                    .toLocalIterator()
boroughs = {row[0] for row in distinct_boroughs}

Now we do two checks:

1. Compare the length of the expected vs actual set
2. Check that the two sets when unioned are still the same length. This is necessary, since the first test isn't sufficient alone

In [27]:
if (   (len(boroughs)          != len(expected_boroughs)) \
      or (len(boroughs)          != len(set.union(boroughs, expected_boroughs))) \
      or (len(expected_boroughs) != len(set.union(boroughs, expected_boroughs)))):
    raise ValueError(f"Audit failed, borough set does not match expected boroughs: {boroughs} != {expected_boroughs}")
else:
    print(f"Audit has passed 🙌🏻")

Audit has passed 🙌🏻


# Publish

Publishing data in lakeFS means merging the audit branch back into `main`, making it available to anyone working with the data in that branch.

## Commit the data to the audit branch (`etl_job_42`)

We can add a commit message, as well as optional metadata

In [28]:
from lakefs_client.api import commits_api
from lakefs_client.model.commit import Commit
from lakefs_client.model.commit_creation import CommitCreation

api_instance = commits_api.CommitsApi(api_client)
commit_creation = CommitCreation(
    message="Remove data for Manhattan from permits dataset",
    metadata={
        "etl job name": "etl_job_42",
        "author": "rmoff",
    }
) 

api_instance.commit(repo.id, branch, commit_creation)

{'committer': 'docker',
 'creation_date': 1684408564,
 'id': 'ab822f0217159c6bc952074135e3a48b5e24905ebcc4cd8bfd6bf6e7876006b4',
 'message': 'Remove data for Manhattan from permits dataset',
 'meta_range_id': '',
 'metadata': {'author': 'rmoff', 'etl job name': 'etl_job_42'},
 'parents': ['f310fed40883dac47878380ab6ac20568ba61dedb2ab85d7bea5adda8f9a7aa9']}

## Merge the branch back into `main`

In [29]:
lakefs.refs.merge_into_branch(repository=repo.id, source_ref='etl_job_42', destination_branch='main')

{'reference': 'e1906d3f2b37292b23720a80bc23512936ddbb3f7deff5f820436997c1d9c0a8',
 'summary': {'added': 0, 'changed': 0, 'conflict': 0, 'removed': 0}}

## Inspecting the published data

In [30]:
branch="main"
permits=(f"{data_dir}/{branch}/nyc/permits")
vw=(f"permits_{branch}")
print(f"Reading Hudi table from {permits} into view `{vw}`")

spark.read. \
format("hudi"). \
options(**hudi_options). \
load(permits). \
createOrReplaceTempView(f"{vw}")

spark.sql(f"SELECT borough, count(*) permit_cnt FROM {vw} GROUP BY borough").show()

Reading Hudi table from s3a://example/main/nyc/permits into view `permits_main`
+-------------+----------+
|      borough|permit_cnt|
+-------------+----------+
|     Brooklyn|       334|
|       Queens|       168|
|        Bronx|        28|
|Staten Island|         7|
+-------------+----------+



# Where Next?

* For more information about write-audit-publish see [this talk from Michelle Winters](https://www.youtube.com/watch?v=fXHdeBnpXrg&t=1001s) and [this talk from Sam Redai](https://www.dremio.com/wp-content/uploads/2022/05/Sam-Redai-The-Write-Audit-Publish-Pattern-via-Apache-Iceberg.pdf).
* To try out lakeFS check out the [hands-on Quickstart](https://docs.lakefs.io/quickstart/)