<img src="https://projectnessie.org/img/nessie.svg" alt="lakeFS logo" width=200/> 

## Write-Audit-Publish (WAP) pattern with Nessie

**New to Write-Audit-Publish? This [talk](https://www.youtube.com/watch?v=fXHdeBnpXrg&t=1001s) explains it well.**

[@rmoff](https://twitter.com/rmoff/) 

# Setup & Initialisation

In [1]:
import sys
!{sys.executable} -m pip install pynessie==0.30.0

Collecting pynessie==0.30.0
  Downloading pynessie-0.30.0-py2.py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.7/55.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Collecting confuse==1.7.0
  Downloading confuse-1.7.0-py2.py3-none-any.whl (25 kB)
Collecting desert
  Downloading desert-2022.9.22-py3-none-any.whl (10 kB)
Collecting marshmallow
  Downloading marshmallow-3.19.0-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.1/49.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting requests-aws4auth
  Downloading requests_aws4auth-1.2.3-py2.py3-none-any.whl (24 kB)
Collecting simplejson
  Downloading simplejson-3.19.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (138 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.6/138.6 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m


## Set up Spark 

In [2]:
import os
from pyspark.sql import *
from pyspark import SparkConf
import pynessie

conf = SparkConf()
conf.set("spark.jars.packages","org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.58.1")
conf.set("spark.sql.execution.pyarrow.enabled", "true")
conf.set("spark.sql.catalog.rmoff", "org.apache.iceberg.spark.SparkCatalog")
conf.set("spark.sql.catalog.rmoff.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog")
conf.set("spark.sql.catalog.rmoff.warehouse",  "file://" + os.getcwd() + "/spark_warehouse/iceberg")
conf.set("spark.sql.catalog.rmoff.uri", "http://nessie:19120/api/v1")
conf.set("spark.sql.catalog.rmoff.ref", "main")
conf.set("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions")

spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")
spark

Spark Running


## Load test data

In [3]:
spark.read.option("inferSchema","true").option("multiline","true").json("/home/jovyan/data/nyc_film_permits.json").createOrReplaceTempView("permits_src")

In [4]:
%%sql
SELECT borough, count(*) permit_cnt
FROM permits_src
GROUP BY borough

borough,permit_cnt
Queens,168
Brooklyn,334
Staten Island,7
Manhattan,463
Bronx,28


## Write test data to Iceberg files 

In [5]:
%%sql 

CREATE TABLE rmoff.permits USING ICEBERG
AS SELECT * FROM permits_src

#### Inspect Iceberg metadata

In [6]:
%sql SELECT * FROM rmoff.permits.files

content,file_path,file_format,spec_id,record_count,file_size_in_bytes,column_sizes,value_counts,null_value_counts,nan_value_counts,lower_bounds,upper_bounds,key_metadata,split_offsets,equality_ids,sort_order_id,readable_metrics
0,file:/home/jovyan/work/spark_warehouse/iceberg/permits_58ac8dc8-9b89-4875-be83-6035d042fcdd/data/00000-3-bef8af72-2d04-459c-bbe6-010b6dc84cb7-00001.parquet,PARQUET,0,1000,51115,"{1: 483, 2: 474, 3: 1183, 4: 119, 5: 2736, 6: 5023, 7: 142, 8: 2348, 9: 343, 10: 26704, 11: 1487, 12: 2462, 13: 745, 14: 2358}","{1: 1000, 2: 1000, 3: 1000, 4: 1000, 5: 1000, 6: 1000, 7: 1000, 8: 1000, 9: 1000, 10: 1000, 11: 1000, 12: 1000, 13: 1000, 14: 1000}","{1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0}",{},"{1: bytearray(b'Bronx'), 2: bytearray(b'Commercial'), 3: bytearray(b'0, 2, 3'), 4: bytearray(b'United States of'), 5: bytearray(b'2022-11-04T22:00'), 6: bytearray(b'2022-11-02T13:34'), 7: bytearray(b""Mayor\'s Office o""), 8: bytearray(b'678909'), 9: bytearray(b'DCAS Prep/Shoot/'), 10: bytearray(b'1 AVENUE between'), 11: bytearray(b'0, 10'), 12: bytearray(b'2022-11-03T00:00'), 13: bytearray(b'Cable-episodic'), 14: bytearray(b'0, 10011')}","{1: bytearray(b'Staten Island'), 2: bytearray(b'WEB'), 3: bytearray(b'9'), 4: bytearray(b'United States og'), 5: bytearray(b'2023-02-20T18:01'), 6: bytearray(b'2023-01-18T14:35'), 7: bytearray(b""Mayor\'s Office p""), 8: bytearray(b'691875'), 9: bytearray(b'Theater Load in!'), 10: bytearray(b'WYTHE AVENUE beu'), 11: bytearray(b'94'), 12: bytearray(b'2023-01-20T13:01'), 13: bytearray(b'Variety'), 14: bytearray(b'11693, 11694')}",,[4],,0,"Row(borough=Row(column_size=483, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Bronx', upper_bound='Staten Island'), category=Row(column_size=474, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Commercial', upper_bound='WEB'), communityboard_s=Row(column_size=1183, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 2, 3', upper_bound='9'), country=Row(column_size=119, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='United States of', upper_bound='United States og'), enddatetime=Row(column_size=2736, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-04T22:00', upper_bound='2023-02-20T18:01'), enteredon=Row(column_size=5023, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-02T13:34', upper_bound='2023-01-18T14:35'), eventagency=Row(column_size=142, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound=""Mayor's Office o"", upper_bound=""Mayor's Office p""), eventid=Row(column_size=2348, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='678909', upper_bound='691875'), eventtype=Row(column_size=343, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='DCAS Prep/Shoot/', upper_bound='Theater Load in!'), parkingheld=Row(column_size=26704, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='1 AVENUE between', upper_bound='WYTHE AVENUE beu'), policeprecinct_s=Row(column_size=1487, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 10', upper_bound='94'), startdatetime=Row(column_size=2462, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-03T00:00', upper_bound='2023-01-20T13:01'), subcategoryname=Row(column_size=745, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Cable-episodic', upper_bound='Variety'), zipcode_s=Row(column_size=2358, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 10011', upper_bound='11693, 11694'))"


In [7]:
%sql SELECT * FROM rmoff.permits.history

made_current_at,snapshot_id,parent_id,is_current_ancestor
2023-05-18 11:39:45.960000,8814055265607444574,,True


In [8]:
%sql SELECT * FROM rmoff.permits.snapshots

committed_at,snapshot_id,parent_id,operation,manifest_list,summary
2023-05-18 11:39:45.960000,8814055265607444574,,append,file:/home/jovyan/work/spark_warehouse/iceberg/permits_58ac8dc8-9b89-4875-be83-6035d042fcdd/metadata/snap-8814055265607444574-1-64b8551b-511c-4f7d-a313-17bcb858d42c.avro,"{'spark.app.id': 'local-1684409978518', 'changed-partition-count': '1', 'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '1000', 'total-position-deletes': '0', 'added-files-size': '51115', 'total-delete-files': '0', 'total-files-size': '51115', 'total-records': '1000', 'total-data-files': '1'}"


# The Setup

## Create Nessie branch 

In [9]:
branch='etl_job_42'

In [10]:
%sql CREATE BRANCH {branch} IN rmoff FROM main

refType,name,hash
Branch,etl_job_42,7452f561669eb46a52e5a55b431a9d6d57ac9b06632c93b00802e7404546c6bd


### Use the new branch for reading and writing

#### Now change the `REFERENCE`

In [11]:
%sql USE REFERENCE {branch} IN rmoff

refType,name,hash
Branch,etl_job_42,7452f561669eb46a52e5a55b431a9d6d57ac9b06632c93b00802e7404546c6bd


### Show list of references in Nessie

In [12]:
%sql LIST REFERENCES IN rmoff

refType,name,hash
Branch,etl_job_42,7452f561669eb46a52e5a55b431a9d6d57ac9b06632c93b00802e7404546c6bd
Branch,main,7452f561669eb46a52e5a55b431a9d6d57ac9b06632c93b00802e7404546c6bd


### Check that we still see the same data

In [13]:
%sql SELECT borough, count(*) permit_cnt FROM rmoff.permits GROUP BY borough

borough,permit_cnt
Queens,168
Bronx,28
Manhattan,463
Brooklyn,334
Staten Island,7


# Write

Update the dataframe to remove rows matching predicate. 

In [14]:
%sql DELETE FROM rmoff.permits WHERE borough='Manhattan'

## Inspecting the staged/unpublished data

### Staged/unpublished data

#### The changes are reflected in the table:

In [15]:
%sql SELECT borough, count(*) permit_cnt FROM rmoff.permits GROUP BY borough

borough,permit_cnt
Queens,168
Bronx,28
Brooklyn,334
Staten Island,7


### Published data

The data on the `main` branch remains unchanged. We can validate this by running a query against the data, specifying `main` as the branch using the `@<branch>` suffix:

In [16]:
%sql SELECT borough, count(*) permit_cnt FROM rmoff.`permits@main` GROUP BY borough

borough,permit_cnt
Queens,168
Bronx,28
Manhattan,463
Brooklyn,334
Staten Island,7


# Audit 

At the moment the data is written to the audit branch (`etl_job_42`), but not published to `main`. 

How you audit the data is up to you. The nice thing about the data being staged is that you can do it within the same ETL job, or have another tool do it.

Here's a very simple example of doing in Python. We're going to programatically check that only the four expected boroughs remain in the data.

First, we define those that are expected:

In [17]:
expected_boroughs = {"Queens", "Brooklyn", "Bronx", "Staten Island"}

Then we get a set of the actual boroughs in the staged data

In [18]:
distinct_boroughs = spark.sql("SELECT DISTINCT borough FROM rmoff.permits").toLocalIterator()
boroughs = {row[0] for row in distinct_boroughs}

Now we do two checks:

1. Compare the length of the expected vs actual set
2. Check that the two sets when unioned are still the same length. This is necessary, since the first test isn't sufficient alone

In [19]:
if (   (len(boroughs)          != len(expected_boroughs)) \
      or (len(boroughs)          != len(set.union(boroughs, expected_boroughs))) \
      or (len(expected_boroughs) != len(set.union(boroughs, expected_boroughs)))):
    raise ValueError(f"Audit failed, borough set does not match expected boroughs: {boroughs} != {expected_boroughs}")
else:
    print(f"Audit has passed 🙌🏻")

Audit has passed 🙌🏻


# Publish

Publishing data in Nessie means merging the audit branch back into `main`, making it available to anyone working with the data in that branch.

In [20]:
%sql MERGE BRANCH {branch} INTO main IN rmoff

name,hash
main,d17ddb4c5ef4b321497bd41dc8e8b2a41327ea1df1290e8cdb0ee3d32da74294


## Inspecting the published data

In [21]:
%sql SELECT borough, count(*) permit_cnt FROM rmoff.`permits@main` GROUP BY borough

borough,permit_cnt
Queens,168
Bronx,28
Manhattan,463
Brooklyn,334
Staten Island,7


You can also change the REFERENCE context back to `main` and query the table directly

In [22]:
%sql USE REFERENCE main IN rmoff

refType,name,hash
Branch,main,d17ddb4c5ef4b321497bd41dc8e8b2a41327ea1df1290e8cdb0ee3d32da74294


In [23]:
%sql SELECT borough, count(*) permit_cnt FROM rmoff.permits GROUP BY borough

borough,permit_cnt
Queens,168
Bronx,28
Brooklyn,334
Staten Island,7


# Where Next?

* For more information about write-audit-publish see [this talk from Michelle Winters](https://www.youtube.com/watch?v=fXHdeBnpXrg&t=1001s) and [this talk from Sam Redai](https://www.dremio.com/wp-content/uploads/2022/05/Sam-Redai-The-Write-Audit-Publish-Pattern-via-Apache-Iceberg.pdf).