<img src="./images/logo.svg" alt="lakeFS logo" width=300/> <img src="https://www.apache.org/logos/res/iceberg/iceberg.png" alt="Apache Iceberg logo" width=300/>  

## lakeFS ❤️ Apache Iceberg - an example using NYC Film Permits dataset

# Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [1]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFODNN7EXAMPLE'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [2]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

# Setup

**(you shouldn't need to change anything in this section, just run it)**

In [3]:
repo_name = "lakefs-iceberg-nyc"

### Create lakeFSClient

In [4]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

#### Verify lakeFS credentials by getting lakeFS version

In [5]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.config.get_lake_fs_version()
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v.version}")

Verifying lakeFS credentials…
…✅lakeFS credentials verified

ℹ️lakeFS version 0.104.0


### Define lakeFS Repository

In [6]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

Repository lakefs-iceberg-nyc does not exist, so going to try and create it now.
Created new repo lakefs-iceberg-nyc using storage namespace s3://example/lakefs-iceberg-nyc


### Set up Spark

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Iceberg / Jupyter") \
        .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.0,io.lakefs:lakefs-iceberg:0.0.1") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .config("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog") \
        .config("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog") \
        .config("spark.sql.catalog.lakefs.warehouse", f"lakefs://{repo_name}") \
        .config("spark.sql.catalog.lakefs.uri", lakefsEndPoint) \
        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

---

---

# Main demo starts here 🚦 👇🏻

# Load some Data

For this demo, we will use the [New York City Film Permits dataset](https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p) available as part of the NYC Open Data initiative. We're using a locally saved copy of a 1000 record sample, but feel free to download the entire dataset to use in this notebook!

We'll save the sample dataset into an Iceberg table called `permits`, using lakeFS for the catalog.

In [8]:
df = spark.read.option("inferSchema","true").option("multiline","true").json("/data/nyc_film_permits.json")

In [10]:
df.write.saveAsTable("lakefs.main.nyc.permits")

<strong style="color:red;">If the above step fails, try re-running it. See https://github.com/treeverse/lakefs-iceberg/issues/23 for more details</em>

In [11]:
from IPython.display import Markdown as md

if lakefsEndPoint=='http://lakefs:8000':
    lakeFSWebUI='http://localhost:8000'
else:
    lakeFSWebUI=lakefsEndPoint

md(f"#### 👉🏻 Optionally, go and view the objects in [lakeFS web UI]({lakeFSWebUI}/repositories/{repo.id}/objects?ref=main&path=nyc%2Fpermits%2F)")

#### 👉🏻 Optionally, go and view the objects in [lakeFS web UI](http://localhost:8000/repositories/lakefs-iceberg-nyc/objects?ref=main&path=nyc%2Fpermits%2F)

Taking a quick peek at the data, you can see that there are a number of permits for different boroughs in New York.

In [13]:
%%sql

SELECT borough, count(*) AS permit_cnt
FROM lakefs.main.nyc.permits
GROUP BY borough

borough,permit_cnt
Queens,168
Bronx,28
Manhattan,463
Brooklyn,334
Staten Island,7


### Commit the new table and its data

In [14]:
lakefs.commits.commit(repo.id, "main", CommitCreation(
    message="Initial data load",
    metadata={'author': 'rmoff',
              'data source': 'https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p'}
) )

{'committer': 'everything-bagel',
 'creation_date': 1687363839,
 'id': '2678267a124da3a6dc76bef3d1af2f13e99a7336530a92db9109c43bdf4f6ded',
 'message': 'Initial data load',
 'meta_range_id': '',
 'metadata': {'author': 'rmoff',
              'data source': 'https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p'},
 'parents': ['cecd7e91857ac11a2be565a0c04533b356a8f1c8ec27202422b0921bf16d0afe']}

# Create a new branch

_This is copy-on-write; we're not duplicating the data_

In [15]:
lakefs.branches.create_branch(repo.id, 
                              BranchCreation(name="dev",
                                             source="main"))

'2678267a124da3a6dc76bef3d1af2f13e99a7336530a92db9109c43bdf4f6ded'

### Confirm that we can see the data on the `dev` branch

In [16]:
%%sql

SELECT count(*)
FROM lakefs.dev.nyc.permits;

count(1)
1000


# Making [and reverting] changes on the dev branch

Let's go big! Let's see what happens when we delete the contents of the table with a careless `DELETE` omitting an all-important predicate

In [17]:
%sql DELETE FROM lakefs.dev.nyc.permits

How's that data looking now?

In [18]:
%%sql

SELECT count(*)
FROM lakefs.dev.nyc.permits;

count(1)
0


But `main` is safe and unsullied 😌

In [19]:
%%sql

SELECT count(*)
FROM lakefs.main.nyc.permits;

count(1)
1000


## Reverting changes to the `dev` branch

### Uncommitted objects:

In [22]:
lakefs.branches.diff_branch(repo.id, "dev").results

[]

### Reset the branch

In [23]:
lakefs.branches.reset_branch(repo.id, 
                             "dev",
                             ResetCreation(type="common_prefix", 
                                           path="nyc/permits/"))

_This just resets the changes to the files for this table. To reset the whole branch use_:

```python
lakefs.branches.reset_branch(repo.id, 
                             "dev",
                             ResetCreation(type="reset"))
```

### Uncommitted objects:

In [24]:
lakefs.branches.diff_branch(repo.id, "dev").results

[]

## Our data's back!

In [23]:
%%sql

SELECT count(*)
FROM lakefs.dev.nyc.permits;

count(1)
1000


# Making changes to the `dev` branch as a collection

## Delete all rows for permits in `Manhattan` from the table

In [25]:
%sql DELETE FROM lakefs.dev.nyc.permits WHERE borough='Manhattan'

## Build an aggregate of the data to show how many permits we issued by category

In [26]:
%%sql

CREATE OR REPLACE TABLE lakefs.dev.nyc.agg_permit_category AS
SELECT category, count(*) permit_cnt
FROM lakefs.dev.nyc.permits
GROUP BY category;

In [27]:
%sql SELECT * FROM lakefs.dev.nyc.agg_permit_category LIMIT 5;

category,permit_cnt
Television,429
WEB,17
Commercial,26
Film,24
Theater,27


# Compare `main` and `dev`

## `dev`

In [28]:
%%sql

SELECT borough, count(*) permit_cnt
FROM lakefs.dev.nyc.permits
GROUP BY borough

borough,permit_cnt
Queens,168
Bronx,28
Brooklyn,334
Staten Island,7


## `main`

In [29]:
%%sql

SELECT borough, count(*) permit_cnt
FROM lakefs.main.nyc.permits
GROUP BY borough

borough,permit_cnt
Queens,168
Bronx,28
Manhattan,463
Brooklyn,334
Staten Island,7


# Commit the changes to the `dev` branch

In [30]:
lakefs.commits.commit(repo.id, "dev", 
                      CommitCreation(
                          message="Remove data for Manhattan from permits dataset, build category aggregate",
                          metadata={"etl job name": "etl_job_42",
                                    "author": "rmoff"}
                      ))

{'committer': 'everything-bagel',
 'creation_date': 1687364001,
 'id': '7fcd65559413cb106baf2448ba890617d087b8b920f2782b71ea5def8887f337',
 'message': 'Remove data for Manhattan from permits dataset, build category '
            'aggregate',
 'meta_range_id': '',
 'metadata': {'author': 'rmoff', 'etl job name': 'etl_job_42'},
 'parents': ['2678267a124da3a6dc76bef3d1af2f13e99a7336530a92db9109c43bdf4f6ded']}

# Merge the branch back into `main`

In [31]:
lakefs.refs.merge_into_branch(repository=repo.id, 
                              source_ref="dev", 
                              destination_branch="main")

{'reference': '628b75ecdf287c7a895fdd729310ce97307449830b84bd81a9f5ef4204fe76e1'}

---

---

---

In [None]:
from IPython.display import Markdown as md

if lakefsEndPoint=='http://lakefs:8000':
    lakeFSWebUI='http://localhost:8000'
else:
    lakeFSWebUI=lakefsEndPoint

md(f"### 👉🏻 View the objects in [lakeFS web UI]({lakeFSWebUI}/repositories/{repo.id}/objects)")