<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Integration of lakeFS with Delta Lake and Python

* [📚 lakeFS Delta Integration Docs](https://docs.lakefs.io/integrations/delta.html)
* [Delta Lake](https://delta.io/)
* [delta-rs deltalake package for Python](https://delta-io.github.io/delta-rs/python/)

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "delta-lake-python-demo"

### Install and load libraries

In [None]:
! pip install deltalake

In [None]:
import os
import lakefs
from assets.lakefs_demo import print_commit
import pandas as pd
import deltalake

### Set environment variables

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch='main', exist_ok=True)
branchMain = repo.branch('main')
print(repo)

### lakeFS S3 gateway config

In [None]:
storage_options = {"AWS_ACCESS_KEY_ID": lakefsAccessKey, 
                   "AWS_SECRET_ACCESS_KEY":lakefsSecretKey,
                   "AWS_ENDPOINT": lakefsEndPoint,
                   "AWS_REGION": "us-east-1",
                   "AWS_ALLOW_HTTP": "true",
                   "AWS_S3_ALLOW_UNSAFE_RENAME": "true"
                  }

---

# Main demo starts here 🚦 👇🏻

## Load some test data

In [None]:
df = pd.read_parquet('/data/userdata/userdata1.parquet')

In [None]:
subset = df.sample(frac=0.011, random_state=42)
print(f"There are {subset.shape[0]} rows in the sample dataset")

In [None]:
subset

## Write the test data to the `main` branch as a Delta table

Uses the delta-rs [`deltalake` Python library](https://delta-io.github.io/delta-rs/python/usage.html#writing-delta-tables)

In [None]:
storage_options

In [None]:
deltalake.write_deltalake(table_or_uri=f"s3a://{repo_name}/main/userdata/", 
                          data = subset,
                          mode='overwrite',
                          storage_options=storage_options)

## Read Deltalake from lakeFS and Python

In [None]:
my_new_dt = deltalake.DeltaTable(f"s3a://{repo_name}/main/userdata/", storage_options=storage_options)

In [None]:
my_new_dt.history()

In [None]:
my_new_dt.version()

In [None]:
print(f"{my_new_dt.to_pandas().shape[0]} rows read in the table")

## Write some more data to the table

In [None]:
subset = df.sample(frac=0.011, random_state=21)
print(f"There are {subset.shape[0]} rows in the sample dataset")

In [None]:
subset

In [None]:
deltalake.write_deltalake(table_or_uri=f"s3a://{repo_name}/main/userdata/", 
                          data = subset,
                          mode='append',
                          storage_options=storage_options)

## Re-Read the Deltalake table

In [None]:
my_new_dt = deltalake.DeltaTable(f"s3a://{repo_name}/main/userdata/", storage_options=storage_options)

In [None]:
my_new_dt.history()

In [None]:
my_new_dt.version()

In [None]:
my_new_dt.file_uris()

In [None]:
print(f"{my_new_dt.to_pandas().shape[0]} rows read in the table")

## Commit the data in lakeFS

In [None]:
ref = branchMain.commit(message="Initial data load",
    metadata={'using': 'python_api'})
print_commit(ref.get_commit())

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack