<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Using [Lua Hooks](https://docs.lakefs.io/howto/hooks/lua.html) in lakeFS (similar to GitHub Actions)

This notebook demonstrated how to create a pre-merge hook in lakeFS that validates the schema files before merging them into the production branch. 

1. Define a hook configuration file and a Lua script for schema validation. 
2. Perform an ETL process by creating an ingestion branch, defining the table schema, and creating a table and atomically promoted the data to the production branch through a merge.
3. Attempt to change the schema of the table and promote it to production again. 
4. The pre-merge hook prevented the promotion due to schema changes, resulting in a Precondition Failed error.


![Actions UI](./images/LuaHooks/schemaValidationFlow.png)

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "schema-validation-example-repo"

### Versioning Information

In [None]:
mainBranch = "main"
ingestionBranch = "ingestion_branch"

### Import libraries

In [None]:
import os
import lakefs
from assets.lakefs_demo import print_commit, print_diff
from pyspark.sql.types import ByteType, IntegerType, LongType, StringType, StructType, StructField

### Set environment variables

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

#### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=mainBranch, exist_ok=True)
branchMain = repo.branch(mainBranch)
print(repo)

### Set up Spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

---

# Main demo starts here 🚦 👇🏻

## Setup and Configure Hook

### Configure hooks in the repository

* Upload [Hooks config YAML file](./hooks/pre-merge-schema-validation.yaml) for schema validation to check for any schema changes before data is merged to main branch
* Hooks config file must be uploaded to "_lakefs_actions" prefix

In [None]:
hooks_config_yaml = "pre-merge-schema-validation.yaml"
hooks_prefix = "_lakefs_actions"

contentToUpload = open(f'./hooks/{hooks_config_yaml}', 'r').read()
print(branchMain.object(f'{hooks_prefix}/{hooks_config_yaml}').upload(data=contentToUpload, mode='wb', pre_sign=False))

### Upload script

##### The script [parquet_schema_change.lua](./hooks/parquet_schema_change.lua) checks for any schema changes

In [None]:
lua_script_file_name = "parquet_schema_change.lua"
lua_scripts_path = "scripts"

contentToUpload = open(f'./hooks/{lua_script_file_name}', 'r').read()
print(branchMain.object(f'{lua_scripts_path}/{lua_script_file_name}').upload(data=contentToUpload, mode='wb', pre_sign=False))

### Commit changes to the lakeFS repo and attach some metadata

In [None]:
ref = branchMain.commit(message='Added hooks config file and schema validation script')
print_commit(ref.get_commit())

# ETL Job Starts

## Create a new branch which will be used to ingest data

In [None]:
branchIngestion = repo.branch(ingestionBranch).create(source_reference=mainBranch, exist_ok=True)
print(f"{ingestionBranch} ref:", branchIngestion.get_commit().id)

## For this demo - we'll be utilizing a dataset - [Orion Star - Sports and outdoors RDBMS dataset](https://www.kaggle.com/datasets/chethanp11/orion-star-sports-and-outdoors-rdbms-dataset) from [Kaggle](https://www.kaggle.com/).

## Define [CUSTOMER.csv](./data/samples/OrionStar/CUSTOMER.csv) data file schema

In [None]:
customersSchema = StructType([
  StructField("User_ID", IntegerType(), False), 
  StructField("Country", StringType(), False),
  StructField("Gender", StringType(), False),
  StructField("Personal_ID", IntegerType(), True),
  StructField("Customer_Name", StringType(), False),
  StructField("Customer_FirstName", StringType(), False),
  StructField("Customer_LastName", StringType(), False),
  StructField("Birth_Date", StringType(), False),
  StructField("Customer_Address", StringType(), False),
  StructField("Street_ID", LongType(), False),
  StructField("Street_Number", IntegerType(), False),
  StructField("Customer_Type_ID", IntegerType(), False)
])

## Create Customers delta table in the new branch (using [CUSTOMER.csv](./data/samples/OrionStar/CUSTOMER.csv) file)

In [None]:
customersTablePath = f"s3a://{repo.id}/{ingestionBranch}/tables/customers"
df = spark.read.csv('/data/OrionStar/CUSTOMER.csv',header=True,schema=customersSchema)
df.write.format("delta").mode("overwrite").save(customersTablePath)
df.show(10)

## Commit changes and attach some metadata

In [None]:
ref = branchIngestion.commit(message='Added customers Delta table', 
        metadata={'using': 'python_api'})
print_commit(ref.get_commit())

## Promote the Data into production

#### Merging the ingestion branch with the current schema to the production branch

In [None]:
res = branchIngestion.merge_into(branchMain)
print(res)

# On the next ETL Cycle - Change the schema and try to promote new data

## Change "Country" column to "Country_Name" in the schema

In [None]:
customersSchema = StructType([
  StructField("User_ID", IntegerType(), False),
  StructField("Country_Name", StringType(), False), # Column name changes from Country to Country_name
  StructField("Gender", StringType(), False),
  StructField("Personal_ID", IntegerType(), True),
  StructField("Customer_Name", StringType(), False),
  StructField("Customer_FirstName", StringType(), False),
  StructField("Customer_LastName", StringType(), False),
  StructField("Birth_Date", StringType(), False),
  StructField("Customer_Address", StringType(), False),
  StructField("Street_ID", LongType(), False),
  StructField("Street_Number", IntegerType(), False),
  StructField("Customer_Type_ID", IntegerType(), False)
])

## Create Customers delta table in the new branch (using [CUSTOMER.csv](./data/samples/OrionStar/CUSTOMER.csv) file)

In [None]:
customersTablePath = f"s3a://{repo.id}/{ingestionBranch}/tables/customers"
df = spark.read.csv('/data/OrionStar/CUSTOMER.csv',header=True,schema=customersSchema)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(customersTablePath)
df.show(10)

## Commit changes and attach some metadata

In [None]:
ref = branchIngestion.commit(message='Added customers Delta tables with schema changes!', 
        metadata={'using': 'python_api'})
print_commit(ref.get_commit())

## Merge new branch to the main branch

Merge will fail because schema changed. 

Note the error message: `(412) Reason: Precondition Failed`

In [None]:
res = branchIngestion.merge_into(branchMain)
print(res)

## You can also review all Actions in lakeFS UI

http://localhost:8000/repositories/schema-validation-example-repo/actions


![Actions UI](./images/LuaHooks/SchemaValidation.gif)

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack