<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Integration of lakeFS with Delta Lake

[📚 Docs](https://docs.lakefs.io/integrations/delta.html)

## Use Cases:

1. Isolating ETL job and atomic promotion to production
2. Atomic rollback of Multi-Table Transactions

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [1]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [2]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [3]:
repo_name = "delta-lake-demo"

### Create lakeFSClient

In [4]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

### Define lakeFS Repository

In [5]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

Repository delta-lake-demo does not exist, so going to try and create it now.
Created new repo delta-lake-demo using storage namespace s3://example/delta-lake-demo


### Set up Spark

In [6]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

### Import some libraries

In [7]:
from pyspark.sql.types import ByteType, IntegerType, LongType, StringType, StructType, StructField
from pyspark.sql.functions import *

### Versioning Information

In [8]:
mainBranch = "main"
deltaLakeETLBranch = "delta-lake-etl-branch"
customersTable = "customers"
ordersTable = "orders"
orderUpdatesTable = "order_updates"

### Define a function to count records in a Delta table in different branches

In [9]:
def delta_table_compare_branches(table, refs):
  spark.createDataFrame(
    data=zip(
      refs,
      map(lambda r: spark.read.format('delta').load(f's3a://{repo.id}/{r}/{table}').count(), refs)
    ), 
    schema=StructType([ 
      StructField("Branch", StringType(), True),
      StructField("Count", IntegerType(), True)
    ])
  ).show(truncate=False)

### Define some helper functions

In [10]:
def print_diff_refs(diff_refs):
    results = map(
        lambda n:[n.path,n.path_type,n.size_bytes,n.type],
        diff_refs.results)
    return results

### Define CUSTOMER.csv data file schema

In [11]:
customersSchema = StructType([
  StructField("Customer_ID", IntegerType(), False),
  StructField("Country", StringType(), False),
  StructField("Gender", StringType(), False),
  StructField("Personal_ID", IntegerType(), True),
  StructField("Customer_Name", StringType(), False),
  StructField("Customer_FirstName", StringType(), False),
  StructField("Customer_LastName", StringType(), False),
  StructField("Birth_Date", StringType(), False),
  StructField("Customer_Address", StringType(), False),
  StructField("Street_ID", LongType(), False),
  StructField("Street_Number", IntegerType(), False),
  StructField("Customer_Type_ID", IntegerType(), False)
])

### Define ORDER_FACT.csv data file schema

In [12]:
ordersSchema = StructType([
  StructField("Customer_ID", IntegerType(), False),
  StructField("Employee_ID", IntegerType(), False),
  StructField("Street_ID", LongType(), False),
  StructField("Order_Date", StringType(), False),
  StructField("Delivery_Date", StringType(), False),
  StructField("Order_ID", LongType(), True),
  StructField("Order_Type", ByteType(), False),
  StructField("Product_ID", LongType(), False),
  StructField("Quantity", ByteType(), False),
  StructField("Total_Retail_Price", StringType(), False),
  StructField("CostPrice_Per_Unit", StringType(), False),
  StructField("Discount", LongType(), False)
])

---

# Main demo starts here 🚦 👇🏻

For this demo - we'll be utilizing a dataset - [Orion Star - Sports and outdoors RDBMS dataset](https://www.kaggle.com/datasets/chethanp11/orion-star-sports-and-outdoors-rdbms-dataset) from [Kaggle](https://www.kaggle.com/).

## Create Customers delta table in the main branch (using [CUSTOMER.csv](./data/samples/OrionStar/CUSTOMER.csv) file)

In [13]:
customersTablePath = f"s3a://{repo.id}/{mainBranch}/{customersTable}"
print(customersTablePath)

s3a://delta-lake-demo/main/customers


In [14]:
df = spark.read.csv('/data/OrionStar/CUSTOMER.csv',header=True,schema=customersSchema)
df.write.format("delta").mode("overwrite").save(customersTablePath)
df.show(10)

+-----------+-------+------+-----------+-----------------+------------------+-----------------+----------+--------------------+----------+-------------+----------------+
|Customer_ID|Country|Gender|Personal_ID|    Customer_Name|Customer_FirstName|Customer_LastName|Birth_Date|    Customer_Address| Street_ID|Street_Number|Customer_Type_ID|
+-----------+-------+------+-----------+-----------------+------------------+-----------------+----------+--------------------+----------+-------------+----------------+
|          4|     US|     M|       null|    James Kvarniq|             James|          Kvarniq| 27JUN1974|      4382 Gralyn Rd|9260106519|         4382|            1020|
|          5|     US|     F|       null|Sandrina Stephano|          Sandrina|         Stephano| 09JUL1979|    6468 Cog Hill Ct|9260114570|         6468|            2020|
|          9|     DE|     F|       null|   Cornelia Krahl|          Cornelia|            Krahl| 27FEB1974|   Kallstadterstr. 9|3940106659|            

## Create Orders delta table in the main branch (using [ORDER_FACT.csv](./data/samples/OrionStar/ORDER_FACT.csv) file)

In [15]:
ordersTablePath = f"s3a://{repo.id}/{mainBranch}/{ordersTable}"
df = spark.read.csv('/data/OrionStar/ORDER_FACT.csv',header=True,schema=ordersSchema)
df.write.format("delta").mode("overwrite").save(ordersTablePath)
df.show(10)

+-----------+-----------+----------+----------+-------------+----------+----------+------------+--------+------------------+------------------+--------+
|Customer_ID|Employee_ID| Street_ID|Order_Date|Delivery_Date|  Order_ID|Order_Type|  Product_ID|Quantity|Total_Retail_Price|CostPrice_Per_Unit|Discount|
+-----------+-----------+----------+----------+-------------+----------+----------+------------+--------+------------------+------------------+--------+
|         63|     121039|9260125492| 11JAN2003|    11JAN2003|1230058123|         1|220101300017|       1|            $16.50|             $7.45|    null|
|          5|   99999999|9260114570| 15JAN2003|    19JAN2003|1230080101|         2|230100500026|       1|           $247.50|           $109.55|    null|
|         45|   99999999|9260104847| 20JAN2003|    22JAN2003|1230106883|         2|240600100080|       1|            $28.30|             $8.55|    null|
|         41|     120174|1600101527| 28JAN2003|    28JAN2003|1230147441|         1

## Commit changes and attach some metadata

In [16]:
lakefs.commits.commit(
    repository=repo.id,
    branch=mainBranch,
    commit_creation=CommitCreation(
        message='Added customers and orders Delta tables!', 
        metadata={'using': 'python_api'}))

{'committer': 'everything-bagel',
 'creation_date': 1689579974,
 'id': '504a707d66bcfafae6ae8af25b68e2dba46b60a23f1b60e7de03a76b65c854cc',
 'message': 'Added customers and orders Delta tables!',
 'meta_range_id': '',
 'metadata': {'using': 'python_api'},
 'parents': ['bc4c25af7ebcdfa29bf8e5200c6c689360aca3608acbbbf3d112d97cb856dca0']}

# 🟢 ETL Job Starts

## Create a new branch

In [17]:
lakefs.branches.create_branch(
    repository=repo.id, 
    branch_creation=BranchCreation(
        name=deltaLakeETLBranch, source=mainBranch))

'504a707d66bcfafae6ae8af25b68e2dba46b60a23f1b60e7de03a76b65c854cc'

## List the repository branches by using lakeFS Python client API

In [18]:
lakefs.branches.list_branches(repository=repo.id).results

[{'commit_id': '504a707d66bcfafae6ae8af25b68e2dba46b60a23f1b60e7de03a76b65c854cc',
  'id': 'delta-lake-etl-branch'},
 {'commit_id': '504a707d66bcfafae6ae8af25b68e2dba46b60a23f1b60e7de03a76b65c854cc',
  'id': 'main'}]

## Apply POS (Point of Sale) Transactions to Delta Lake: delete data for a customer on the new branch

In [19]:
from delta.tables import *

ordersTablePath = f"s3a://{repo.id}/{deltaLakeETLBranch}/{ordersTable}"
deltaTable = DeltaTable.forPath(spark, ordersTablePath)
deltaTable.delete("Customer_ID = 19444")

In [20]:
customersTablePath = f"s3a://{repo.id}/{deltaLakeETLBranch}/{customersTable}"
deltaTable = DeltaTable.forPath(spark, customersTablePath)
deltaTable.delete("Customer_ID = 19444")

## Apply POS Transactions to Delta Lake: update data for a customer on the new branch

In [21]:
customersTablePath = f"s3a://{repo.id}/{deltaLakeETLBranch}/{customersTable}"
deltaTable = DeltaTable.forPath(spark, customersTablePath)
deltaTable.update(
  condition = expr("Customer_ID = 63"),
  set = { "Customer_FirstName": "'Jim'",
          "Customer_Name": "'Jim Klisurich'"})

## Apply POS Transactions to Delta Lake: batch upsert (5 updated and 10 new orders in [ORDER_FACT_UPDATES.csv](/data/samples/OrionStar/ORDER_FACT_UPDATES.csv) file)

In [22]:
ordersTablePath = f"s3a://{repo.id}/{deltaLakeETLBranch}/{ordersTable}"
deltaTableOrders = DeltaTable.forPath(spark, ordersTablePath)

orderUpdatesTablePath = f"s3a://{repo.id}/{deltaLakeETLBranch}/{orderUpdatesTable}"
dfOrderUpdates = spark.read.csv('/data/OrionStar/ORDER_FACT_UPDATES.csv',header=True,schema=ordersSchema)
dfOrderUpdates.write.format("delta").mode("overwrite").save(orderUpdatesTablePath)

deltaTableOrders.alias('orders') \
  .merge(
    dfOrderUpdates.alias('orderUpdates'),
    'orders.Order_ID = orderUpdates.Order_ID AND orders.Product_ID = orderUpdates.Product_ID'
  ) \
  .whenMatchedUpdate(set =
    {
      "Customer_ID": "orderUpdates.Customer_ID",
      "Employee_ID": "orderUpdates.Employee_ID",
      "Street_ID": "orderUpdates.Street_ID",
      "Order_Date": "orderUpdates.Order_Date",
      "Delivery_Date": "orderUpdates.Delivery_Date",
      "Order_ID": "orderUpdates.Order_ID",
      "Order_Type": "orderUpdates.Order_Type",
      "Product_ID": "orderUpdates.Product_ID",
      "Quantity": "orderUpdates.Quantity",
      "Total_Retail_Price": "orderUpdates.Total_Retail_Price",
      "CostPrice_Per_Unit": "orderUpdates.CostPrice_Per_Unit",
      "Discount": "orderUpdates.Discount"
    }
  ) \
  .whenNotMatchedInsert(values =
    {
      "Customer_ID": "orderUpdates.Customer_ID",
      "Employee_ID": "orderUpdates.Employee_ID",
      "Street_ID": "orderUpdates.Street_ID",
      "Order_Date": "orderUpdates.Order_Date",
      "Delivery_Date": "orderUpdates.Delivery_Date",
      "Order_ID": "orderUpdates.Order_ID",
      "Order_Type": "orderUpdates.Order_Type",
      "Product_ID": "orderUpdates.Product_ID",
      "Quantity": "orderUpdates.Quantity",
      "Total_Retail_Price": "orderUpdates.Total_Retail_Price",
      "CostPrice_Per_Unit": "orderUpdates.CostPrice_Per_Unit",
      "Discount": "orderUpdates.Discount"
    }
  ) \
  .execute()

## Data Validation: Compare Customers delta table in the main and new branch

In [23]:
customersTablePath = f"s3a://{repo.id}/{mainBranch}/{customersTable}"
df = spark.read.format("delta").load(customersTablePath).where("Customer_ID = 19444 OR Customer_ID = 63").show()

+-----------+-------+------+-----------+---------------+------------------+-----------------+----------+-------------------+----------+-------------+----------------+
|Customer_ID|Country|Gender|Personal_ID|  Customer_Name|Customer_FirstName|Customer_LastName|Birth_Date|   Customer_Address| Street_ID|Street_Number|Customer_Type_ID|
+-----------+-------+------+-----------+---------------+------------------+-----------------+----------+-------------------+----------+-------------+----------------+
|         63|     US|     M|       null|James Klisurich|             James|        Klisurich| 25DEC1969|  25 Briarforest Pl|9260125492|           25|            2020|
|      19444|     IL|     M|       null|  Avinoam Zweig|           Avinoam|            Zweig| 28SEP1959|Mivtza Kadesh St 61|4750100001|           61|            1040|
+-----------+-------+------+-----------+---------------+------------------+-----------------+----------+-------------------+----------+-------------+----------------

In [24]:
customersTablePath = f"s3a://{repo.id}/{deltaLakeETLBranch}/{customersTable}"
spark.read.format("delta").load(customersTablePath).where("Customer_ID = 19444 OR Customer_ID = 63").show()

+-----------+-------+------+-----------+-------------+------------------+-----------------+----------+-----------------+----------+-------------+----------------+
|Customer_ID|Country|Gender|Personal_ID|Customer_Name|Customer_FirstName|Customer_LastName|Birth_Date| Customer_Address| Street_ID|Street_Number|Customer_Type_ID|
+-----------+-------+------+-----------+-------------+------------------+-----------------+----------+-----------------+----------+-------------+----------------+
|         63|     US|     M|       null|Jim Klisurich|               Jim|        Klisurich| 25DEC1969|25 Briarforest Pl|9260125492|           25|            2020|
+-----------+-------+------+-----------+-------------+------------------+-----------------+----------+-----------------+----------+-------------+----------------+



## Data Validation: Compare Customers count in the main and new branch

In [25]:
refs = [mainBranch, deltaLakeETLBranch]

delta_table_compare_branches(customersTable, refs)

+---------------------+-----+
|Branch               |Count|
+---------------------+-----+
|main                 |77   |
|delta-lake-etl-branch|76   |
+---------------------+-----+



## Data Validation: Compare Orders delta table in the main and new branch

In [26]:
ordersTablePath = f"s3a://{repo.id}/{mainBranch}/{ordersTable}"
df = spark.read.format("delta").load(ordersTablePath).where("Customer_ID = 19444").show()

+-----------+-----------+----------+----------+-------------+----------+----------+------------+--------+------------------+------------------+--------+
|Customer_ID|Employee_ID| Street_ID|Order_Date|Delivery_Date|  Order_ID|Order_Type|  Product_ID|Quantity|Total_Retail_Price|CostPrice_Per_Unit|Discount|
+-----------+-----------+----------+----------+-------------+----------+----------+------------+--------+------------------+------------------+--------+
|      19444|   99999999|4750100001| 16MAY2003|    21MAY2003|1230744524|         2|220100700024|       1|            $99.70|            $47.45|    null|
|      19444|   99999999|4750100001| 16MAY2003|    21MAY2003|1230744524|         2|220101000002|       1|            $17.70|             $8.00|    null|
|      19444|   99999999|4750100001| 01OCT2003|    06OCT2003|1231500373|         2|220101200006|       1|            $52.20|            $20.95|    null|
|      19444|   99999999|4750100001| 06JUL2004|    11JUL2004|1233248920|         2

In [27]:
ordersTablePath = f"s3a://{repo.id}/{deltaLakeETLBranch}/{ordersTable}"
df = spark.read.format("delta").load(ordersTablePath).where("Customer_ID = 19444").show()

+-----------+-----------+---------+----------+-------------+--------+----------+----------+--------+------------------+------------------+--------+
|Customer_ID|Employee_ID|Street_ID|Order_Date|Delivery_Date|Order_ID|Order_Type|Product_ID|Quantity|Total_Retail_Price|CostPrice_Per_Unit|Discount|
+-----------+-----------+---------+----------+-------------+--------+----------+----------+--------+------------------+------------------+--------+
+-----------+-----------+---------+----------+-------------+--------+----------+----------+--------+------------------+------------------+--------+



## Data Validation: Compare Orders count in the main and new branch

In [28]:
refs = [mainBranch, deltaLakeETLBranch]

delta_table_compare_branches(ordersTable, refs)

+---------------------+-----+
|Branch               |Count|
+---------------------+-----+
|main                 |607  |
|delta-lake-etl-branch|612  |
+---------------------+-----+



## Commit changes and attach some metadata

In [29]:
lakefs.commits.commit(
    repository=repo.id,
    branch=deltaLakeETLBranch,
    commit_creation=CommitCreation(
        message='Deleted and updated customers. Deleted and upserted orders.', 
        metadata={'using': 'python_api'}))

{'committer': 'everything-bagel',
 'creation_date': 1689580041,
 'id': '894fc48f55fd5cb133b15f37638434d4ee49a6898fe39cbcee2717b7c1be93f2',
 'message': 'Deleted and updated customers. Deleted and upserted orders.',
 'meta_range_id': '',
 'metadata': {'using': 'python_api'},
 'parents': ['504a707d66bcfafae6ae8af25b68e2dba46b60a23f1b60e7de03a76b65c854cc']}

## Diff between the new branch and the source branch

In [30]:
for a in print_diff_refs(
    lakefs.refs.diff_refs(
        repository=repo.id,
        left_ref=mainBranch,
        right_ref=deltaLakeETLBranch)):
    print(a)

['customers/_delta_log/00000000000000000001.json', 'object', 24358, 'added']
['customers/_delta_log/00000000000000000002.json', 'object', 24358, 'added']
['customers/part-00000-01fd87f0-e5c6-492f-9de1-1f9798fdfe92-c000.snappy.parquet', 'object', 24358, 'added']
['customers/part-00000-6936124f-8ea7-4945-a7d5-aab365610f15-c000.snappy.parquet', 'object', 24358, 'added']
['order_updates/_delta_log/', 'object', 24358, 'added']
['order_updates/_delta_log/00000000000000000000.json', 'object', 24358, 'added']
['order_updates/part-00000-6fd48561-233e-4768-a717-3d0e4d2b4bf1-c000.snappy.parquet', 'object', 24358, 'added']
['orders/_delta_log/00000000000000000001.json', 'object', 24358, 'added']
['orders/_delta_log/00000000000000000002.json', 'object', 24358, 'added']
['orders/part-00000-01ece2fb-9740-4cee-9850-ea7cb52a6919-c000.snappy.parquet', 'object', 24358, 'added']
['orders/part-00000-7953994d-f41f-4a4d-9c07-be047cfa0cab-c000.snappy.parquet', 'object', 24358, 'added']


# ETL Job Completes

## Delete new branch if ETL job fails or merge new branch to main branch if ETL job succeeds

## Delete new branch if ETL job fails

In [31]:
#lakefs.branches.delete_branch(
#    repository=repo.id,
#    branch=deltaLakeETLBranch)

## Or merge new branch to the main branch if ETL job succeeds (atomic promotion to production)

In [32]:
merge_result=lakefs.refs.merge_into_branch(
            repository=repo.id,
            source_ref=deltaLakeETLBranch, 
            destination_branch=mainBranch)
commit_id=merge_result.reference
print(merge_result)

{'reference': '457399b7ebd81f9ce2bcc2754fc8240eb9e7e4221667ebd9a6a8c01ace0235f4'}


## Data Validation: Read data from the main branch

In [33]:
customersTablePath = f"s3a://{repo.id}/{mainBranch}/{customersTable}"
spark.read.format("delta").load(customersTablePath).where("Customer_ID = 19444 OR Customer_ID = 63").show()

+-----------+-------+------+-----------+-------------+------------------+-----------------+----------+-----------------+----------+-------------+----------------+
|Customer_ID|Country|Gender|Personal_ID|Customer_Name|Customer_FirstName|Customer_LastName|Birth_Date| Customer_Address| Street_ID|Street_Number|Customer_Type_ID|
+-----------+-------+------+-----------+-------------+------------------+-----------------+----------+-----------------+----------+-------------+----------------+
|         63|     US|     M|       null|Jim Klisurich|               Jim|        Klisurich| 25DEC1969|25 Briarforest Pl|9260125492|           25|            2020|
+-----------+-------+------+-----------+-------------+------------------+-----------------+----------+-----------------+----------+-------------+----------------+



In [34]:
ordersTablePath = f"s3a://{repo.id}/{mainBranch}/{ordersTable}"
df = spark.read.format("delta").load(ordersTablePath).where("Customer_ID = 19444").show()

+-----------+-----------+---------+----------+-------------+--------+----------+----------+--------+------------------+------------------+--------+
|Customer_ID|Employee_ID|Street_ID|Order_Date|Delivery_Date|Order_ID|Order_Type|Product_ID|Quantity|Total_Retail_Price|CostPrice_Per_Unit|Discount|
+-----------+-----------+---------+----------+-------------+--------+----------+----------+--------+------------------+------------------+--------+
+-----------+-----------+---------+----------+-------------+--------+----------+----------+--------+------------------+------------------+--------+



## Data Validation: Compare Customers count in the main and new branch

In [35]:
refs = [mainBranch, deltaLakeETLBranch]

delta_table_compare_branches(customersTable, refs)

+---------------------+-----+
|Branch               |Count|
+---------------------+-----+
|main                 |76   |
|delta-lake-etl-branch|76   |
+---------------------+-----+



## Data Validation: Compare Orders count in the main and new branch

In [36]:
refs = [mainBranch, deltaLakeETLBranch]

delta_table_compare_branches(ordersTable, refs)

+---------------------+-----+
|Branch               |Count|
+---------------------+-----+
|main                 |612  |
|delta-lake-etl-branch|612  |
+---------------------+-----+



## If you merged new branch to the main branch then you can atomically rollback Multi-Table Transactions

In [37]:
lakefs.branches.revert_branch(
    repository=repo.id,
    branch=mainBranch, 
    revert_creation=RevertCreation(
        ref=commit_id, parent_number=1))

## Data Validation: Read data again from the main branch

In [38]:
customersTablePath = f"s3a://{repo.id}/{mainBranch}/{customersTable}"
spark.read.format("delta").load(customersTablePath).where("Customer_ID = 19444 OR Customer_ID = 63").show()

+-----------+-------+------+-----------+---------------+------------------+-----------------+----------+-------------------+----------+-------------+----------------+
|Customer_ID|Country|Gender|Personal_ID|  Customer_Name|Customer_FirstName|Customer_LastName|Birth_Date|   Customer_Address| Street_ID|Street_Number|Customer_Type_ID|
+-----------+-------+------+-----------+---------------+------------------+-----------------+----------+-------------------+----------+-------------+----------------+
|         63|     US|     M|       null|James Klisurich|             James|        Klisurich| 25DEC1969|  25 Briarforest Pl|9260125492|           25|            2020|
|      19444|     IL|     M|       null|  Avinoam Zweig|           Avinoam|            Zweig| 28SEP1959|Mivtza Kadesh St 61|4750100001|           61|            1040|
+-----------+-------+------+-----------+---------------+------------------+-----------------+----------+-------------------+----------+-------------+----------------

In [39]:
ordersTablePath = f"s3a://{repo.id}/{mainBranch}/{ordersTable}"
df = spark.read.format("delta").load(ordersTablePath).where("Customer_ID = 19444").show()

+-----------+-----------+----------+----------+-------------+----------+----------+------------+--------+------------------+------------------+--------+
|Customer_ID|Employee_ID| Street_ID|Order_Date|Delivery_Date|  Order_ID|Order_Type|  Product_ID|Quantity|Total_Retail_Price|CostPrice_Per_Unit|Discount|
+-----------+-----------+----------+----------+-------------+----------+----------+------------+--------+------------------+------------------+--------+
|      19444|   99999999|4750100001| 16MAY2003|    21MAY2003|1230744524|         2|220100700024|       1|            $99.70|            $47.45|    null|
|      19444|   99999999|4750100001| 16MAY2003|    21MAY2003|1230744524|         2|220101000002|       1|            $17.70|             $8.00|    null|
|      19444|   99999999|4750100001| 01OCT2003|    06OCT2003|1231500373|         2|220101200006|       1|            $52.20|            $20.95|    null|
|      19444|   99999999|4750100001| 06JUL2004|    11JUL2004|1233248920|         2

## Data Validation: Compare Customers count in the main and new branch

In [40]:
refs = [mainBranch, deltaLakeETLBranch]

delta_table_compare_branches(customersTable, refs)

+---------------------+-----+
|Branch               |Count|
+---------------------+-----+
|main                 |77   |
|delta-lake-etl-branch|76   |
+---------------------+-----+



## Data Validation: Compare Orders count in the main and new branch

In [41]:
refs = [mainBranch, deltaLakeETLBranch]

delta_table_compare_branches(ordersTable, refs)

+---------------------+-----+
|Branch               |Count|
+---------------------+-----+
|main                 |607  |
|delta-lake-etl-branch|612  |
+---------------------+-----+



## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack