# [Integration of lakeFS with Delta Lake](https://docs.lakefs.io/integrations/delta.html)

## Use Cases:
### 1. Isolating ETL job and atomic promotion to production
### 2. Atomic rollback of Multi-Table Transactions

## Prerequisites

###### This Notebook requires connecting to a lakeFS Server. 
###### To spin up lakeFS quickly - use the Playground (https://demo.lakefs.io) which provides lakeFS server on-demand with a single click; 
###### Or, alternatively, refer to lakeFS Quickstart doc (https://docs.lakefs.io/quickstart/installing.html).

## Setup Task: Change your lakeFS credentials

In [None]:
lakefsEndPoint = '<lakeFS Endpoint URL>' # e.g. 'https://playground-name.lakefs-demo.io' or 'http://host.docker.internal:8000' (if lakeFS is running in local Docker container)
lakefsAccessKey = '<lakeFS Access Key>'
lakefsSecretKey = '<lakeFS Secret Key>'

## Setup Task: You can change lakeFS repo name (it can be an existing repo or provide a new repo name)

In [None]:
repo = "my-repo"

## Setup Task: Versioning Information

In [None]:
mainBranch = "main"
deltaLakeETLBranch = "delta-lake-etl-branch"
customersTable = "customers"
ordersTable = "orders"
orderUpdatesTable = "order_updates"

## Setup Task: Storage Information - Optional on Playground
#### Change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://<S3 Bucket Name>/'  # e.g. 's3://treeverse-demo-lakefs-storage-production/user_playground-name/my-repo' or 'local://my-bucket'

## Setup Task: Run additional [Setup](./deltaLake/deltaLakeSetup.ipynb) tasks here

In [None]:
%run ./deltaLake/deltaLakeSetup.ipynb

## Create Repository - Optional on Playground or if repository exists

In [None]:
client.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=repo, 
        storage_namespace=storageNamespace, 
        default_branch=mainBranch))

## For this demo - we'll be utilizing a dataset - [Orion Star - Sports and outdoors RDBMS dataset](https://www.kaggle.com/datasets/chethanp11/orion-star-sports-and-outdoors-rdbms-dataset) from [Kaggle](https://www.kaggle.com/).

## Create Customers delta table in the main branch (using [CUSTOMER.csv](./data/samples/OrionStar/CUSTOMER.csv) file)

In [None]:
customersTablePath = f"s3a://{repo}/{mainBranch}/{customersTable}"
df = spark.read.csv('./data/samples/OrionStar/CUSTOMER.csv',header=True,schema=customersSchema)
df.write.format("delta").mode("overwrite").save(customersTablePath)
df.show(10)

## Create Orders delta table in the main branch (using [ORDER_FACT.csv](./data/samples/OrionStar/ORDER_FACT.csv) file)

In [None]:
ordersTablePath = f"s3a://{repo}/{mainBranch}/{ordersTable}"
df = spark.read.csv('./data/samples/OrionStar/ORDER_FACT.csv',header=True,schema=ordersSchema)
df.write.format("delta").mode("overwrite").save(ordersTablePath)
df.show(10)

## Commit changes and attach some metadata

In [None]:
client.commits.commit(
    repository=repo,
    branch=mainBranch,
    commit_creation=models.CommitCreation(
        message='Added customers and orders Delta tables!', 
        metadata={'using': 'python_api'}))

# ETL Job Starts

## Create a new branch

In [None]:
client.branches.create_branch(
    repository=repo, 
    branch_creation=models.BranchCreation(
        name=deltaLakeETLBranch, source=mainBranch))

## List the repository branches by using lakeFS Python client API

In [None]:
lakefs_demo.print_branches(
    client.branches.list_branches(
        repository=repo))

## Apply POS (Point of Sale) Transactions to Delta Lake: delete data for a customer on the new branch

In [None]:
from delta.tables import *

ordersTablePath = f"s3a://{repo}/{deltaLakeETLBranch}/{ordersTable}"
deltaTable = DeltaTable.forPath(spark, ordersTablePath)
deltaTable.delete("Customer_ID = 19444")

In [None]:
customersTablePath = f"s3a://{repo}/{deltaLakeETLBranch}/{customersTable}"
deltaTable = DeltaTable.forPath(spark, customersTablePath)
deltaTable.delete("Customer_ID = 19444")

## Apply POS Transactions to Delta Lake: update data for a customer on the new branch

In [None]:
customersTablePath = f"s3a://{repo}/{deltaLakeETLBranch}/{customersTable}"
deltaTable = DeltaTable.forPath(spark, customersTablePath)
deltaTable.update(
  condition = expr("Customer_ID = 63"),
  set = { "Customer_FirstName": "'Jim'",
          "Customer_Name": "'Jim Klisurich'"})

## Apply POS Transactions to Delta Lake: batch upsert (5 updated and 10 new orders in [ORDER_FACT_UPDATES.csv](./data/samples/OrionStar/ORDER_FACT_UPDATES.csv) file)

In [None]:
ordersTablePath = f"s3a://{repo}/{deltaLakeETLBranch}/{ordersTable}"
deltaTableOrders = DeltaTable.forPath(spark, ordersTablePath)

orderUpdatesTablePath = f"s3a://{repo}/{deltaLakeETLBranch}/{orderUpdatesTable}"
dfOrderUpdates = spark.read.csv('./data/samples/OrionStar/ORDER_FACT_UPDATES.csv',header=True,schema=ordersSchema)
dfOrderUpdates.write.format("delta").mode("overwrite").save(orderUpdatesTablePath)

deltaTableOrders.alias('orders') \
  .merge(
    dfOrderUpdates.alias('orderUpdates'),
    'orders.Order_ID = orderUpdates.Order_ID AND orders.Product_ID = orderUpdates.Product_ID'
  ) \
  .whenMatchedUpdate(set =
    {
      "Customer_ID": "orderUpdates.Customer_ID",
      "Employee_ID": "orderUpdates.Employee_ID",
      "Street_ID": "orderUpdates.Street_ID",
      "Order_Date": "orderUpdates.Order_Date",
      "Delivery_Date": "orderUpdates.Delivery_Date",
      "Order_ID": "orderUpdates.Order_ID",
      "Order_Type": "orderUpdates.Order_Type",
      "Product_ID": "orderUpdates.Product_ID",
      "Quantity": "orderUpdates.Quantity",
      "Total_Retail_Price": "orderUpdates.Total_Retail_Price",
      "CostPrice_Per_Unit": "orderUpdates.CostPrice_Per_Unit",
      "Discount": "orderUpdates.Discount"
    }
  ) \
  .whenNotMatchedInsert(values =
    {
      "Customer_ID": "orderUpdates.Customer_ID",
      "Employee_ID": "orderUpdates.Employee_ID",
      "Street_ID": "orderUpdates.Street_ID",
      "Order_Date": "orderUpdates.Order_Date",
      "Delivery_Date": "orderUpdates.Delivery_Date",
      "Order_ID": "orderUpdates.Order_ID",
      "Order_Type": "orderUpdates.Order_Type",
      "Product_ID": "orderUpdates.Product_ID",
      "Quantity": "orderUpdates.Quantity",
      "Total_Retail_Price": "orderUpdates.Total_Retail_Price",
      "CostPrice_Per_Unit": "orderUpdates.CostPrice_Per_Unit",
      "Discount": "orderUpdates.Discount"
    }
  ) \
  .execute()

## Data Validation: Compare Customers delta table in the main and new branch

In [None]:
customersTablePath = f"s3a://{repo}/{mainBranch}/{customersTable}"
df = spark.read.format("delta").load(customersTablePath).where("Customer_ID = 19444 OR Customer_ID = 63").show()

In [None]:
customersTablePath = f"s3a://{repo}/{deltaLakeETLBranch}/{customersTable}"
spark.read.format("delta").load(customersTablePath).where("Customer_ID = 19444 OR Customer_ID = 63").show()

## Data Validation: Compare Customers count in the main and new branch

In [None]:
refs = [mainBranch, deltaLakeETLBranch]

delta_table_compare_branches(customersTable, refs)

## Data Validation: Compare Orders delta table in the main and new branch

In [None]:
ordersTablePath = f"s3a://{repo}/{mainBranch}/{ordersTable}"
df = spark.read.format("delta").load(ordersTablePath).where("Customer_ID = 19444").show()

In [None]:
ordersTablePath = f"s3a://{repo}/{deltaLakeETLBranch}/{ordersTable}"
df = spark.read.format("delta").load(ordersTablePath).where("Customer_ID = 19444").show()

## Data Validation: Compare Orders count in the main and new branch

In [None]:
refs = [mainBranch, deltaLakeETLBranch]

delta_table_compare_branches(ordersTable, refs)

## Commit changes and attach some metadata

In [None]:
client.commits.commit(
    repository=repo,
    branch=deltaLakeETLBranch,
    commit_creation=models.CommitCreation(
        message='Deleted and updated customers. Deleted and upserted orders.', 
        metadata={'using': 'python_api'}))

## Diff between the new branch and the source branch

In [None]:
lakefs_demo.print_diff_refs(
    client.refs.diff_refs(
        repository=repo,
        left_ref=mainBranch,
        right_ref=deltaLakeETLBranch))

# ETL Job Completes

## Delete new branch if ETL job fails or merge new branch to main branch if ETL job succeeds

## Delete new branch if ETL job fails

In [None]:
client.branches.delete_branch(
    repository=repo,
    branch=deltaLakeETLBranch)

## Or merge new branch to the main branch if ETL job succeeds (atomic promotion to production)

In [None]:
client.refs.merge_into_branch(
    repository=repo,
    source_ref=deltaLakeETLBranch, 
    destination_branch=mainBranch)

## Data Validation: Read data from the main branch

In [None]:
customersTablePath = f"s3a://{repo}/{mainBranch}/{customersTable}"
spark.read.format("delta").load(customersTablePath).where("Customer_ID = 19444 OR Customer_ID = 63").show()

In [None]:
ordersTablePath = f"s3a://{repo}/{mainBranch}/{ordersTable}"
df = spark.read.format("delta").load(ordersTablePath).where("Customer_ID = 19444").show()

## Data Validation: Compare Customers count in the main and new branch

In [None]:
refs = [mainBranch, deltaLakeETLBranch]

delta_table_compare_branches(customersTable, refs)

## Data Validation: Compare Orders count in the main and new branch

In [None]:
refs = [mainBranch, deltaLakeETLBranch]

delta_table_compare_branches(ordersTable, refs)

## If you merged new branch to the main branch then you can atomically rollback Multi-Table Transactions

### Go to lakeFS UI and get the commit ID or copy the 'reference' from the previous merge statement

In [None]:
commit_id = "<lakeFS Commit Id>"
client.branches.revert_branch(
    repository=repo,
    branch=mainBranch, 
    revert_creation=models.RevertCreation(
        ref=commit_id, parent_number=1))

## Data Validation: Read data again from the main branch

In [None]:
customersTablePath = f"s3a://{repo}/{mainBranch}/{customersTable}"
spark.read.format("delta").load(customersTablePath).where("Customer_ID = 19444 OR Customer_ID = 63").show()

In [None]:
ordersTablePath = f"s3a://{repo}/{mainBranch}/{ordersTable}"
df = spark.read.format("delta").load(ordersTablePath).where("Customer_ID = 19444").show()

## Data Validation: Compare Customers count in the main and new branch

In [None]:
refs = [mainBranch, deltaLakeETLBranch]

delta_table_compare_branches(customersTable, refs)

## Data Validation: Compare Orders count in the main and new branch

In [None]:
refs = [mainBranch, deltaLakeETLBranch]

delta_table_compare_branches(ordersTable, refs)

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack