<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Integration of lakeFS with Spark and Python

Use Case: Isolated Testing Environment

Access lakeFS using the S3A gateway

## Config

### lakeFS endpoint and credentials

Change these if using lakeFS other than provided in the samples repo. 

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Storage Information

If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://example/' # e.g. "s3://bucket"

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "spark-demo"

## Versioning Information 

In [None]:
sourceBranch = "main"
newBranch = "experiment01"
newPath = "partitioned_data"
fileName = "userdata/userdata1.parquet"

### Import libraries

In [None]:
import os
import lakefs
from assets.lakefs_demo import print_commit, print_diff

### Set environment variables

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=sourceBranch, exist_ok=True)
branchMain = repo.branch(sourceBranch)
print(repo)

### Start Spark session

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .getOrCreate()

spark

## Upload a file

In [None]:
obj = branchMain.object(path=fileName)

with open(f"./data/{fileName}", mode='rb') as reader, obj.writer(mode='wb', metadata={'using': 'python_wrapper', 'source':'Spark Demo'}) as writer:
    writer.write(reader.read())

## Commit changes and attach some metadata

In [None]:
ref = branchMain.commit(message='Added my first file!', metadata={'using': 'python_sdk'})
print_commit(ref.get_commit())

## Reading data by using lakeFS File System

In [None]:
dataPath = f"lakefs://{repo_name}/{sourceBranch}/{fileName}"
print(f"Reading Parquet file from {dataPath}")
df = spark.read.parquet(dataPath)
df.show()

# Experimentation Starts

## List the repository branches by using lakeFS Python API

In [None]:
for branch in repo.branches():
    print(branch.id)

## Create a new branch

In [None]:
branchNew = repo.branch(newBranch).create(source_reference=sourceBranch)
print(f"{newBranch} ref:", branchNew.get_commit().id)

## Partition the data and write to new branch by using lakeFS File System

In [None]:
newDataPath = f"lakefs://{repo_name}/{newBranch}/{newPath}"

df.write.partitionBy("gender").parquet(newDataPath)

## Commit changes and attach some metadata

In [None]:
ref = branchNew.commit(message='Partitioned Parquet file!', metadata={'using': 'python_sdk'})
print_commit(ref.get_commit())

## Diff between the new branch and the source branch

In [None]:
diff = branchMain.diff(other_ref=branchNew)
print_diff(diff)

# Experimentation Completes

## Option A: Experimentation succeeds, so merge new branch to the main branch (atomic promotion to production)

### Do the merge

In [None]:
res = branchNew.merge_into(branchMain)
print(res)

### If you merged new branch to the main branch then you can atomically rollback all changes

In [None]:
branchMain.revert(parent_number=1, reference=sourceBranch)

## Option B: Experimentation fails, so just delete the new branch

In [None]:
# Uncomment if you want to run this

#branchNew.delete()

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack