<img src="https://docs.delta.io/latest/_static/delta-lake-logo.png" alt="Delta Lake logo" width=300/>  &nbsp;  &nbsp; &nbsp; <img src="../images/logo.svg" alt="lakeFS logo" width=300/>

## Write-Audit-Publish (WAP) pattern with Delta Lake and lakeFS

Please see the accompanying blog series for more details: 

1. [Data Engineering Patterns: Write-Audit-Publish (WAP)](https://lakefs.io/blog/data-engineering-patterns-write-audit-publish)
1. [How to Implement Write-Audit-Publish (WAP)](https://lakefs.io/blog/how-to-implement-write-audit-publish)
1. [Putting the Write-Audit-Publish Pattern into Practice with lakeFS](https://lakefs.io/blog/write-audit-publish-with-lakefs/)

[@rmoff](https://twitter.com/rmoff/) 

# Initialisation

## Set up the connection to lakeFS

In [1]:
import lakefs_client
from lakefs_client.client import LakeFSClient

lakefs_config = lakefs_client.Configuration()
lakefs_config.username = 'AKIAIOSFODNN7EXAMPLE'
lakefs_config.password = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
lakefs_config.host = 'http://lakefs:8000'

lakefs = LakeFSClient(lakefs_config)
lakefs_api_client = lakefs_client.ApiClient(lakefs_config)

### Get the first repository present in lakeFS

In [2]:
repo=lakefs.repositories.list_repositories().results[0]
print(f"Using lakeFS repository '{repo.id}' with storage namespace {repo.storage_namespace}")

Using lakeFS repository 'example' with storage namespace s3://example


### Define the data storage directory based on the provided namespace

In [3]:
data_dir=repo.storage_namespace.replace('s3','s3a')
print(f"Using {data_dir} for data storage")

Using s3a://example for data storage


## Set up Spark 

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", "http://lakefs:8000") \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", "AKIAIOSFODNN7EXAMPLE") \
        .config("spark.hadoop.fs.s3a.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY") \
        .config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark




## Load test data and write it to the `main` branch as a Delta table

In [5]:
df = spark.read.option("inferSchema","true").option("multiline","true").json("/home/jovyan/data/nyc_film_permits.json")

permits_file=(f"{data_dir}/main/nyc/permits")
print(permits_file)
df.write.format("delta").mode('overwrite').save(permits_file)

s3a://example/main/nyc/permits


### Inspect the table

In [6]:
%%sql

DESCRIBE EXTENDED delta.`s3a://example/main/nyc/permits`

col_name,data_type,comment
borough,string,
category,string,
communityboard_s,string,
country,string,
enddatetime,string,
enteredon,string,
eventagency,string,
eventid,string,
eventtype,string,
parkingheld,string,


### What does the data look like?

In [7]:
%%sql
SELECT borough, count(*) permit_cnt
FROM delta.`s3a://example/main/nyc/permits`
GROUP BY borough

borough,permit_cnt
Queens,168
Brooklyn,334
Staten Island,7
Manhattan,463
Bronx,28


## Commit the data to the `main` branch

In [8]:
api_client = lakefs_client.ApiClient(lakefs_config)

from lakefs_client.api import commits_api
from lakefs_client.model.commit import Commit
from lakefs_client.model.commit_creation import CommitCreation

api_instance = commits_api.CommitsApi(api_client)
commit_creation = CommitCreation(
    message="First commit of NYC Permit data"
) 


api_instance.commit(repo.id, 'main', commit_creation)

{'committer': 'docker',
 'creation_date': 1684408818,
 'id': '779387f14f627b0bb10645fc155c712c35b5ec2d636ebbcedc882949711c78ef',
 'message': 'First commit of NYC Permit data',
 'meta_range_id': '',
 'metadata': {},
 'parents': ['a0ea2bdaa385dadd8b8484779de8a7760423908b7399d229c27fc68ed81a9ad5']}

# The Setup

lakeFS is based on branches (just like git). Branches are copy-on-write, making them 'cheap' in terms of storage. 

We're going to create a branch to write data to, audit it, and then merge it back if we're happy with the audit. 

In [9]:
branch='etl_job_42'

### Create branch

In [10]:
from lakefs_client.api import branches_api
from lakefs_client.model.branch_creation import BranchCreation

api_instance = branches_api.BranchesApi(lakefs_api_client)
branch_creation = BranchCreation(
    name=branch,
    source="main",
) 

api_instance.create_branch(repo.id, branch_creation)

'779387f14f627b0bb10645fc155c712c35b5ec2d636ebbcedc882949711c78ef'

### Check that we still see the same data

In [11]:
%%sql
SELECT borough, count(*) permit_cnt
FROM delta.`s3a://example/etl_job_42/nyc/permits`
GROUP BY borough

borough,permit_cnt
Queens,168
Brooklyn,334
Staten Island,7
Manhattan,463
Bronx,28


# Write

In [12]:
%%sql

DELETE FROM delta.`s3a://example/etl_job_42/nyc/permits`
WHERE borough='Manhattan'

num_affected_rows
463


## Inspecting the staged/unpublished data

### Staged/unpublished data

#### The changes are reflected in the table:

In [13]:
%%sql

SELECT borough, count(*) permit_cnt
FROM delta.`s3a://example/etl_job_42/nyc/permits`
GROUP BY borough

borough,permit_cnt
Queens,168
Brooklyn,334
Staten Island,7
Bronx,28


### Published data

The data on the `main` branch remains unchanged. We can validate this by running a query against the data, specifying `main` as the branch:

In [14]:
%%sql
SELECT borough, count(*) permit_cnt
FROM delta.`s3a://example/main/nyc/permits`
GROUP BY borough

borough,permit_cnt
Queens,168
Brooklyn,334
Staten Island,7
Manhattan,463
Bronx,28


# Audit 

At the moment the data is written to the audit branch (`etl_job_42`), but not published to `main`. 

How you audit the data is up to you. The nice thing about the data being staged is that you can do it within the same ETL job, or have another tool do it.

Here's a very simple example of doing in Python. We're going to programatically check that only the four expected boroughs remain in the data.

First, we define those that are expected:

In [15]:
expected_boroughs = {"Queens", "Brooklyn", "Bronx", "Staten Island"}

Then we get a set of the actual boroughs in the staged data

In [16]:
distinct_boroughs = spark.read \
                    .format("delta") \
                    .load("s3a://example/etl_job_42/nyc/permits") \
                    .select("borough") \
                    .distinct() \
                    .toLocalIterator()
boroughs = {row[0] for row in distinct_boroughs}

Now we do two checks:

1. Compare the length of the expected vs actual set
2. Check that the two sets when unioned are still the same length. This is necessary, since the first test isn't sufficient alone

In [17]:
if (   (len(boroughs)          != len(expected_boroughs)) \
      or (len(boroughs)          != len(set.union(boroughs, expected_boroughs))) \
      or (len(expected_boroughs) != len(set.union(boroughs, expected_boroughs)))):
    raise ValueError(f"Audit failed, borough set does not match expected boroughs: {boroughs} != {expected_boroughs}")
else:
    print(f"Audit has passed 🙌🏻")

Audit has passed 🙌🏻


# Publish

Publishing data in lakeFS means merging the audit branch back into `main`, making it available to anyone working with the data in that branch.

## Commit the data to the audit branch (`etl_job_42`)

We can add a commit message, as well as optional metadata

In [18]:
from lakefs_client.api import commits_api
from lakefs_client.model.commit import Commit
from lakefs_client.model.commit_creation import CommitCreation

api_instance = commits_api.CommitsApi(api_client)
commit_creation = CommitCreation(
    message="Remove data for Manhattan from permits dataset",
    metadata={
        "etl job name": "etl_job_42",
        "author": "rmoff",
    }
) 

api_instance.commit(repo.id, branch, commit_creation)

{'committer': 'docker',
 'creation_date': 1684408833,
 'id': 'c54dccb3fe5f2d3845ee7a565febddbd824508a46e88fa2667fffea913095471',
 'message': 'Remove data for Manhattan from permits dataset',
 'meta_range_id': '',
 'metadata': {'author': 'rmoff', 'etl job name': 'etl_job_42'},
 'parents': ['779387f14f627b0bb10645fc155c712c35b5ec2d636ebbcedc882949711c78ef']}

## Merge the branch back into `main`

In [19]:
lakefs.refs.merge_into_branch(repository=repo.id, source_ref='etl_job_42', destination_branch='main')

{'reference': '5b9402c139d7c03bb72d3b1f430f6dc65bbf03bf23e79ec51490664096393dfc',
 'summary': {'added': 0, 'changed': 0, 'conflict': 0, 'removed': 0}}

## Inspecting the published data

In [20]:
%%sql
SELECT borough, count(*) permit_cnt
FROM delta.`s3a://example/main/nyc/permits`
GROUP BY borough

borough,permit_cnt
Queens,168
Brooklyn,334
Staten Island,7
Bronx,28


# Where Next?

* For more information about write-audit-publish see [this talk from Michelle Winters](https://www.youtube.com/watch?v=fXHdeBnpXrg&t=1001s) and [this talk from Sam Redai](https://www.dremio.com/wp-content/uploads/2022/05/Sam-Redai-The-Write-Audit-Publish-Pattern-via-Apache-Iceberg.pdf).
* To try out lakeFS check out the [hands-on Quickstart](https://docs.lakefs.io/quickstart/)