<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# lakeFS and Delta Lake diff

This shows the use of Delta Lake with lakeFS and the Delta Lake diff plugin.

For more details see [the published blog article](https://lakefs.io/blog/lakefs-supports-delta-lake-diff/).

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "delta-lake-diff"

### Create lakeFSClient

In [None]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

#### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.config.get_config()
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v['version_config']['latest_version']}")

### Define lakeFS Repository

In [None]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

### Set up Spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
                    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
                    .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
                    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
                    .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
                    .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
                    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0") \
                    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
                    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
                    .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") \
                    .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

---

# Main demo starts here 🚦 👇🏻

## Load some data into lakeFS

Read a parquet file from URL

In [None]:
df = spark.read.parquet(f"/data/userdata/userdata1.parquet")

How many rows of data?

In [None]:
display(df.count())

What does the data look like?

In [None]:
display(df.show(n=1,vertical=True))

## Write data to lakeFS (on the `main` branch) in Delta format

In [None]:
branch='main'

In [None]:
df.write.format("delta").mode('overwrite').save('s3a://'+repo.id+'/'+branch+'/demo/users')

#### 👉🏻[The data as seen from LakeFS](http://localhost:8000/repositories/example/objects?ref=main&path=demo%2Fusers%2F)

### Commit the new file in `main`

In [None]:
lakefs.commits.commit(repository=repo.id,
                      branch=branch,
                      commit_creation=CommitCreation(
                          message="Initial user data load"
                     ))

## Create a branch

In [None]:
branch='modify_user_data'

In [None]:
lakefs.branches.create_branch(repository=repo.id, 
                              branch_creation=BranchCreation(name=branch, 
                                                                    source="main")
                             )

### List the current branches in the repository

In [None]:
for b in lakefs.branches.list_branches(repo.id).results:
    display(b.id)

## Add some new data with merge

In [None]:
from delta.tables import *
from pyspark.sql.functions import *

In [None]:
new_df = spark.read.parquet(f"/data/userdata/userdata2.parquet")

In [None]:
users_deltaTable = DeltaTable.forPath(spark, 's3a://'+repo.id+'/'+branch+'/demo/users')

In [None]:
users_deltaTable.alias("users").merge(
    source = new_df.alias("new_users"),
    condition = "users.id = new_users.id") \
  .whenNotMatchedInsertAll() \
  .execute()

### Commit in lakeFS

In [None]:
lakefs.commits.commit(repository=repo.id,
                      branch=branch,
                      commit_creation=CommitCreation(
                          message="Merge in new user data"
                     ))

## Update some data

In [None]:
deltaTable = DeltaTable.forPath(spark, f"s3a://{repo.id}/{branch}/demo/users")

In [None]:
deltaTable.toDF().filter(col("country").isin("Portugal", "China")).select("country","ip_address").show(5)

In [None]:
deltaTable.update(
    condition = "country == 'Portugal'",
    set = { "ip_address" : "'x.x.x.x'" })

In [None]:
deltaTable.toDF().filter(col("country").isin("Portugal", "China")).select("country","ip_address").show(10)

### Commit in lakeFS

In [None]:
lakefs.commits.commit(repository=repo.id,
                      branch=branch,
                      commit_creation=CommitCreation(
                          message="Mask all IPs for users in Portugal"
                     ))

## Delete some data

In [None]:
deltaTable.toDF().filter(col("salary") > 60000).count()

In [None]:
deltaTable.delete(col("salary") > 60000)

In [None]:
deltaTable.toDF().filter(col("salary") > 60000).count()

### Commit in lakeFS

In [None]:
lakefs.commits.commit(repository=repo.id,
                      branch=branch,
                      commit_creation=CommitCreation(
                            message="Delete users with salary over 60k"
                     ))

### Look at the data and diffs in LakeFS

In [None]:
from IPython.display import Markdown as md

if lakefsEndPoint=='http://lakefs:8000':
    lakeFSWebUI='http://localhost:8000'
else:
    lakeFSWebUI=lakefsEndPoint

md(f"### 👉🏻 Go to lakeFS UI and click on [Show table changes]({lakeFSWebUI}/repositories/{repo.id}/compare?ref=main&compare=modify_user_data&prefix=demo%2F)")