<img src="./images/logo.png" alt="lakeFS logo" width=300/> 

# Integration of lakeFS with Spark and Python

Use Case: Isolated Testing Environment

Access lakeFS using the S3A gateway

## Config

### lakeFS endpoint and credentials

Change these if using lakeFS other than provided in the samples repo. 

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Storage Information

If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

In [None]:
repo_name = "spark-demo"

## Setup

## Set environment variables

In [None]:
import os
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

### Define lakeFS Repository

In [None]:
import lakefs

try:
    repo=lakefs.repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.properties.storage_namespace}")
except lakefs.exceptions.NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}")
        print(f"Created new repo {repo.id} using storage namespace {repo.properties.storage_namespace}")
    except lakefs.exceptions.LakeFSException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
except lakefs.exceptions.LakeFSException as e:
    print(f"Error getting repo {repo_name}: {e}")

### Set up Spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

## Versioning Information 

In [None]:
sourceBranch = "main"
newBranch = "experiment01"
newPath = "partitioned_data"
fileName = "userdata/userdata1.parquet"

## Upload a file

In [None]:
main = repo.branch(sourceBranch)
obj = main.object(path=fileName)

with open(f"/data/{fileName}", mode='rb') as reader, obj.writer(mode='wb', metadata={'using': 'python_wrapper', 'source':'Spark Demo'}) as writer:
    writer.write(reader.read())

## Commit changes and attach some metadata

In [None]:
ref = main.commit(message='Added my first file!', metadata={'using': 'python_sdk'})
print(ref.get_commit())

## Reading data by using S3A Gateway

In [None]:
dataPath = f"s3a://{repo.id}/{sourceBranch}/{fileName}"
print(f"Reading Parquet file from {dataPath}")
df = spark.read.parquet(dataPath)
df.show()

# Experimentation Starts

## List the repository branches by using lakeFS Python API

In [None]:
for branch in repo.branches():
    print(branch.id)

## Create a new branch

In [None]:
branch1 = repo.branch(newBranch).create(source_reference=sourceBranch)
print(f"{newBranch} ref:", branch1.get_commit().id)

## Partition the data and write to new branch by using S3A Gateway

In [None]:
newDataPath = f"s3a://{repo.id}/{newBranch}/{newPath}"

df.write.partitionBy("gender").parquet(newDataPath)

## Commit changes and attach some metadata

In [None]:
ref = branch1.commit(message='Partitioned Parquet file!', metadata={'using': 'python_sdk'})
print(ref.get_commit())

## Diff between the new branch and the source branch

In [None]:
for diff in main.diff(other_ref=branch1):
    print(diff)

# Experimentation Completes

## Option A: Experimentation succeeds, so merge new branch to the main branch (atomic promotion to production)

### Do the merge

In [None]:
res = branch1.merge_into(main)
print(res)

### If you merged new branch to the main branch then you can atomically rollback all changes

In [None]:
main.revert(parent_number=1, reference_id=sourceBranch)

## Option B: Experimentation fails, so just delete the new branch

In [None]:
# Uncomment if you want to run this

#branch1.delete()

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack