<img src="./images/logo.svg" alt="lakeFS logo" width=300/> <img src="https://www.apache.org/logos/res/iceberg/iceberg.png" alt="Apache Iceberg logo" width=300/>  

## lakeFS ❤️ Apache Iceberg - an example using NYC Film Permits dataset

# Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

# Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "lakefs-iceberg-nyc"

### Versioning Information

In [None]:
mainBranch = "main"
devBranch = "dev"

### Import libraries

In [None]:
import os
import lakefs
from assets.lakefs_demo import print_commit, print_diff

### Set environment variables

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

#### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=mainBranch, exist_ok=True)
branchMain = repo.branch(mainBranch)
print(repo)

### Set up Spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Iceberg / Jupyter") \
        .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.0,io.lakefs:lakefs-iceberg:0.1.1,io.lakefs:lakefs-spark-extensions_2.12:0.0.3") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .config("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog") \
        .config("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog") \
        .config("spark.sql.catalog.lakefs.warehouse", f"lakefs://{repo_name}") \
        .config("spark.sql.catalog.lakefs.uri", lakefsEndPoint) \
        .config("spark.sql.catalog.lakefs.cache-enabled", "false") \
        .config("spark.sql.defaultCatalog", "lakefs") \
        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.lakefs.iceberg.extension.LakeFSSparkSessionExtensions") \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

---

---

# Main demo starts here 🚦 👇🏻

# Load some Data

For this demo, we will use the [New York City Film Permits dataset](https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p) available as part of the NYC Open Data initiative. We're using a locally saved copy of a 1000 record sample, but feel free to download the entire dataset to use in this notebook!

We'll save the sample dataset into an Iceberg table called `permits`, using lakeFS for the catalog.

In [None]:
df = spark.read.option("inferSchema","true").option("multiline","true").json("/data/nyc_film_permits.json")

In [None]:
df.write.saveAsTable("lakefs.main.nyc.permits")

In [None]:
from IPython.display import Markdown as md

if lakefsEndPoint=='http://lakefs:8000':
    lakeFSWebUI='http://localhost:8000'
else:
    lakeFSWebUI=lakefsEndPoint

md(f"#### 👉🏻 Optionally, go and view the objects in [lakeFS web UI]({lakeFSWebUI}/repositories/{repo_name}/objects?ref=main&path=nyc%2Fpermits%2F)")

Taking a quick peek at the data, you can see that there are a number of permits for different boroughs in New York.

In [None]:
%%sql

SELECT borough, count(*) AS permit_cnt
FROM lakefs.main.nyc.permits
GROUP BY borough

### Commit the new table and its data

In [None]:
ref = branchMain.commit(
    message="Initial data load",
    metadata={'author': 'lakefs',
              'data source': 'https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p'})
print_commit(ref.get_commit())

# Create a new branch

_This is copy-on-write; we're not duplicating the data_

In [None]:
branchDev = repo.branch(devBranch).create(source_reference=mainBranch, exist_ok=True)
print(f"{devBranch} ref:", branchDev.get_commit().id)

### Confirm that we can see the data on the `dev` branch

In [None]:
%%sql

SELECT count(*)
FROM lakefs.dev.nyc.permits;

# Making [and reverting] changes on the dev branch

Let's go big! Let's see what happens when we delete the contents of the table with a careless `DELETE` omitting an all-important predicate

In [None]:
%sql DELETE FROM lakefs.dev.nyc.permits

How's that data looking now?

In [None]:
%%sql

SELECT count(*)
FROM lakefs.dev.nyc.permits;

But `main` is safe and unsullied 😌

In [None]:
%%sql

SELECT count(*)
FROM lakefs.main.nyc.permits;

## Reverting changes to the `dev` branch

### Uncommitted objects:

In [None]:
print_diff(branchDev.uncommitted())

### Reset the branch

In [None]:
branchDev.reset_changes(path_type='common_prefix', path="nyc/permits/")

_This just resets the changes to the files for this table. To reset the whole branch use_:

```python
branchDev.reset_changes(path_type='reset')
```

### Uncommitted objects:

In [None]:
print_diff(branchDev.uncommitted())

## Our data's back!

In [None]:
%%sql

SELECT count(*)
FROM lakefs.dev.nyc.permits;

# Making changes to the `dev` branch as a collection

## Delete all rows for permits in `Manhattan` from the table

In [None]:
%sql DELETE FROM lakefs.dev.nyc.permits WHERE borough='Manhattan'

## Build an aggregate of the data to show how many permits we issued by category

In [None]:
%%sql

CREATE OR REPLACE TABLE lakefs.dev.nyc.agg_permit_category AS
SELECT category, count(*) permit_cnt
FROM lakefs.dev.nyc.permits
GROUP BY category;

In [None]:
%sql SELECT * FROM lakefs.dev.nyc.agg_permit_category LIMIT 5;

# Compare `main` and `dev`

## `dev`

In [None]:
%%sql

SELECT borough, count(*) permit_cnt
FROM lakefs.dev.nyc.permits
GROUP BY borough

## `main`

In [None]:
%%sql

SELECT borough, count(*) permit_cnt
FROM lakefs.main.nyc.permits
GROUP BY borough

## `Data diff`
refs_data_diff is an SQL table-valued function (TVF). The expression:
##### `refs_data_diff(PREFIX, FROM_SCHEMA, TO_SCHEMA, TABLE)`
yields a relation that compares the "from" table PREFIX.FROM_SCHEMA.TABLE with the "to" table PREFIX.TO_SCHEMA.TABLE. Its output is the difference: a relation (like a view) that adds a single column lakefs_change to the table schema.

* Rows that appear only in the first version of the table  (in the example, on branch main) appear in the difference with lakefs_change==’-’.
* Rows that appear only in the second version of the table  (in the example, on branch dev) appear in the difference with lakefs_change==’+’.
* Rows that appear in both versions of the table do not appear in the difference.

In [None]:
%%sql

SELECT * FROM refs_data_diff('lakefs', 'main', 'dev', 'nyc.permits') LIMIT 5;

In [None]:
%%sql

SELECT lakefs_change, borough, count(*) AS permit_diffs_cnt
FROM refs_data_diff('lakefs', 'main', 'dev', 'nyc.permits')
GROUP BY lakefs_change, borough;

# Partition the data in the `dev` branch

In [None]:
%%sql

CREATE TABLE lakefs.dev.nyc.permits_partitioned
USING iceberg
PARTITIONED BY (borough)
AS SELECT * FROM lakefs.dev.nyc.permits
ORDER BY borough;

In [None]:
%%sql

SELECT borough, count(*) permit_cnt
FROM lakefs.dev.nyc.permits_partitioned
GROUP BY borough

# Commit the changes to the `dev` branch

In [None]:
ref = branchDev.commit(
    message="Remove data for Manhattan from permits dataset, build category aggregate",
    metadata={"etl job name": "etl_job_42",
              "author": "lakefs"})
print_commit(ref.get_commit())

# Merge the branch back into `main`

In [None]:
res = branchDev.merge_into(branchMain)
print(res)

---

---

---

In [None]:
from IPython.display import Markdown as md

if lakefsEndPoint=='http://lakefs:8000':
    lakeFSWebUI='http://localhost:8000'
else:
    lakeFSWebUI=lakefsEndPoint

md(f"### 👉🏻 View the objects in [lakeFS web UI]({lakeFSWebUI}/repositories/{repo_name}/objects)")

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack