# Data Lineage with lakeFS

## Use Case: Understand data transformations by using commits with metadata and "Blame" functionality

## Prerequisites

###### This Notebook requires connecting to a lakeFS Server. 
###### To spin up lakeFS quickly - use lakeFS Cloud (https://lakefs.cloud) which provides lakeFS server on-demand with a single click; 
###### Or, alternatively, refer to lakeFS Quickstart doc (https://docs.lakefs.io/quickstart/installing.html).

![lineage](./Images/CommitFlow.png)

## Change your lakeFS credentials

In [None]:
lakefsEndPoint = 'https://' # e.g. 'https://username.aws_region_name.lakefscloud.io'
lakefsAccessKey = ''
lakefsSecretKey = ''

## Storage Information
#### Change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = '' # e.g. "s3://username-lakefs-cloud/"

## Versioning Information

In [None]:
productionBranch = "main"
ingestionBranch1 = "ingest1"
ingestionBranch2 = "ingest2"
transformationBranch = "transformation"
newPath = "partitioned_data"
fileName = "Employees.csv"

## Working with the lakeFS Python client API

In [None]:
%xmode Minimal
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

client = LakeFSClient(configuration)

## You can change lakeFS repo name (it can be an existing repo or provide another repo name)

In [None]:
repo = "iddo-repo-lineage-example"

## If above mentioned repo already exists on your lakeFS server then you can skip following step otherwise create a new repo:

In [None]:
client.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=repo,
        storage_namespace=storageNamespace,
        default_branch=productionBranch))

## S3A Gateway configuration

##### Note: lakeFS can be configured to work with Spark in two ways:
###### * Access lakeFS using the S3A gateway https://docs.lakefs.io/integrations/spark.html#access-lakefs-using-the-s3a-gateway.
###### * Access lakeFS using the lakeFS-specific Hadoop FileSystem https://docs.lakefs.io/integrations/spark.html#access-lakefs-using-the-lakefs-specific-hadoop-filesystem.

In [None]:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", lakefsAccessKey)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", lakefsSecretKey)
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", lakefsEndPoint)
sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")

## Ingest data into the first ingestion branch

In [None]:
client.branches.create_branch(
    repository=repo,
    branch_creation=models.BranchCreation(
        name=ingestionBranch1,
        source=productionBranch))

In [None]:
import os
contentToUpload = open(os.path.expanduser('~')+'/'+fileName, 'rb') # Only a single file per upload which must be named \\\"content\\\"
client.objects.upload_object(
    repository=repo,
    branch=ingestionBranch1,
    path=fileName, content=contentToUpload)

## Commit changes to first ingest branch and attach some metadata

In [None]:
client.commits.commit(
    repository=repo,
    branch=ingestionBranch1,
    commit_creation=models.CommitCreation(
        message='Ingesting employees IDs',
        metadata={'using': 'python_api',
                  'codeVersion': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
                  'source': 'Employees.csv'}))

## Ingest data into the second ingestion branch

In [None]:
client.branches.create_branch(
    repository=repo,
    branch_creation=models.BranchCreation(
        name=ingestionBranch2,
        source=productionBranch))

In [None]:
fileName = "Salaries.csv"

import os
contentToUpload = open(os.path.expanduser('~')+'/'+fileName, 'rb') # Only a single file per upload which must be named \\\"content\\\"
client.objects.upload_object(
    repository=repo,
    branch=ingestionBranch2,
    path=fileName, content=contentToUpload)

## Commit changes to second ingest branch with metadata

In [None]:
client.commits.commit(
    repository=repo,
    branch=ingestionBranch2,
    commit_creation=models.CommitCreation(
        message='Ingesting Salaries',
        metadata={'using': 'python_api',
                  'codeVersion': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
                  'source': '/Salaries.csv'}))

## Merge the lists in a transformation branch

In [None]:
client.branches.create_branch(
    repository=repo,
    branch_creation=models.BranchCreation(
        name=transformationBranch,
        source=productionBranch))

In [None]:
client.refs.merge_into_branch(
    repository=repo,
    source_ref=ingestionBranch1, 
    destination_branch=transformationBranch)

In [None]:
client.refs.merge_into_branch(
    repository=repo,
    source_ref=ingestionBranch2, 
    destination_branch=transformationBranch)

In [None]:
employeeFile="Employees.csv"
SalariesFile="Salaries.csv"

In [None]:
dataPath = f"s3a://{repo}/{transformationBranch}/{employeeFile}"

df1 = spark.read.option("header", "true").csv(dataPath)
df1.show()


In [None]:
dataPath = f"s3a://{repo}/{transformationBranch}/{SalariesFile}"

df2 = spark.read.option("header", "true").csv(dataPath)
df2.show()

In [None]:
mergedDataset = df1.join(df2,["id"])
mergedDataset.show()

## Partition by department

In [None]:
newDataPath = f"s3a://{repo}/{transformationBranch}/{newPath}"

mergedDataset.write.partitionBy("department").csv(newDataPath)

## Commit with metadata

In [None]:
client.commits.commit(
    repository=repo,
    branch=transformationBranch,
    commit_creation=models.CommitCreation(
        message='Repartitioned by departments',
        metadata={'using': 'python_api',
                  'codeVersion': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb'}))

## Atomically promote data to Production

In [None]:
client.refs.merge_into_branch(
    repository=repo,
    source_ref=transformationBranch, 
    destination_branch=productionBranch)

## Where did a dataset come from?

In [None]:
commits = client.refs.log_commits(repository=repo, ref='main', amount=1, limit=True, prefixes=['partitioned_data/department=Engineering/'])
print(commits)

In [None]:
commits = client.refs.log_commits(repository=repo, ref='main', amount=1, objects=['Employees.csv'])
print(commits)


In [None]:
lakefs_client.__version__