<img src="https://projectnessie.org/img/nessie.svg" alt="lakeFS logo" width=200/> 

## Write-Audit-Publish (WAP) pattern with Nessie

Please see the accompanying blog series for more details: 

1. [Data Engineering Patterns: Write-Audit-Publish (WAP)](https://lakefs.io/blog/data-engineering-patterns-write-audit-publish)
1. [How to Implement Write-Audit-Publish (WAP)](https://lakefs.io/blog/how-to-implement-write-audit-publish)
1. [Putting the Write-Audit-Publish Pattern into Practice with lakeFS](https://lakefs.io/blog/write-audit-publish-with-lakefs/)

[@rmoff](https://twitter.com/rmoff/) 

# Setup & Initialisation

In [None]:
import sys
!{sys.executable} -m pip install pynessie==0.30.0

## Set up Spark 

In [None]:
import os
from pyspark.sql import *
from pyspark import SparkConf
import pynessie

conf = SparkConf()
conf.set("spark.jars.packages","org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.58.1")
conf.set("spark.sql.execution.pyarrow.enabled", "true")
conf.set("spark.sql.catalog.rmoff", "org.apache.iceberg.spark.SparkCatalog")
conf.set("spark.sql.catalog.rmoff.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog")
conf.set("spark.sql.catalog.rmoff.warehouse",  "file://" + os.getcwd() + "/spark_warehouse/iceberg")
conf.set("spark.sql.catalog.rmoff.uri", "http://nessie:19120/api/v1")
conf.set("spark.sql.catalog.rmoff.ref", "main")
conf.set("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions")

spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")
spark

## Load test data

In [None]:
spark.read.option("inferSchema","true").option("multiline","true").json("/data/nyc_film_permits.json").createOrReplaceTempView("permits_src")

In [None]:
%%sql
SELECT borough, count(*) permit_cnt
FROM permits_src
GROUP BY borough

## Write test data to Iceberg files 

In [None]:
%%sql 

CREATE TABLE rmoff.permits USING ICEBERG
AS SELECT * FROM permits_src

#### Inspect Iceberg metadata

In [None]:
%sql SELECT * FROM rmoff.permits.files

In [None]:
%sql SELECT * FROM rmoff.permits.history

In [None]:
%sql SELECT * FROM rmoff.permits.snapshots

# The Setup

## Create Nessie branch 

In [None]:
branch='etl_job_42'

In [None]:
%sql CREATE BRANCH {branch} IN rmoff FROM main

### Use the new branch for reading and writing

#### Now change the `REFERENCE`

In [None]:
%sql USE REFERENCE {branch} IN rmoff

### Show list of references in Nessie

In [None]:
%sql LIST REFERENCES IN rmoff

### Check that we still see the same data

In [None]:
%sql SELECT borough, count(*) permit_cnt FROM rmoff.permits GROUP BY borough

# Write

Update the dataframe to remove rows matching predicate. 

In [None]:
%sql DELETE FROM rmoff.permits WHERE borough='Manhattan'

## Inspecting the staged/unpublished data

### Staged/unpublished data

#### The changes are reflected in the table:

In [None]:
%sql SELECT borough, count(*) permit_cnt FROM rmoff.permits GROUP BY borough

### Published data

The data on the `main` branch remains unchanged. We can validate this by running a query against the data, specifying `main` as the branch using the `@<branch>` suffix:

In [None]:
%sql SELECT borough, count(*) permit_cnt FROM rmoff.`permits@main` GROUP BY borough

# Audit 

At the moment the data is written to the audit branch (`etl_job_42`), but not published to `main`. 

How you audit the data is up to you. The nice thing about the data being staged is that you can do it within the same ETL job, or have another tool do it.

Here's a very simple example of doing in Python. We're going to programatically check that only the four expected boroughs remain in the data.

First, we define those that are expected:

In [None]:
expected_boroughs = {"Queens", "Brooklyn", "Bronx", "Staten Island"}

Then we get a set of the actual boroughs in the staged data

In [None]:
distinct_boroughs = spark.sql("SELECT DISTINCT borough FROM rmoff.permits").toLocalIterator()
boroughs = {row[0] for row in distinct_boroughs}

Now we do two checks:

1. Compare the length of the expected vs actual set
2. Check that the two sets when unioned are still the same length. This is necessary, since the first test isn't sufficient alone

In [None]:
if (   (len(boroughs)          != len(expected_boroughs)) \
      or (len(boroughs)          != len(set.union(boroughs, expected_boroughs))) \
      or (len(expected_boroughs) != len(set.union(boroughs, expected_boroughs)))):
    raise ValueError(f"Audit failed, borough set does not match expected boroughs: {boroughs} != {expected_boroughs}")
else:
    print(f"Audit has passed 🙌🏻")

# Publish

Publishing data in Nessie means merging the audit branch back into `main`, making it available to anyone working with the data in that branch.

In [None]:
%sql MERGE BRANCH {branch} INTO main IN rmoff

## Inspecting the published data

In [None]:
%sql SELECT borough, count(*) permit_cnt FROM rmoff.`permits@main` GROUP BY borough

You can also change the REFERENCE context back to `main` and query the table directly

In [None]:
%sql USE REFERENCE main IN rmoff

In [None]:
%sql SELECT borough, count(*) permit_cnt FROM rmoff.permits GROUP BY borough

# Where Next?

* For more information about write-audit-publish see [this talk from Michelle Winters](https://www.youtube.com/watch?v=fXHdeBnpXrg&t=1001s) and [this talk from Sam Redai](https://www.dremio.com/wp-content/uploads/2022/05/Sam-Redai-The-Write-Audit-Publish-Pattern-via-Apache-Iceberg.pdf).