# [Using Hooks or Git like actions](https://docs.lakefs.io/hooks/)

## Use Cases:
### 1. Don't allow PII data
### 2. Don't allow schema changes

## Prerequisites

###### This Notebook requires connecting to a lakeFS Server. 
###### To spin up lakeFS quickly - use the Playground (https://demo.lakefs.io) which provides lakeFS server on-demand with a single click; 
###### Or, alternatively, refer to lakeFS Quickstart doc (https://docs.lakefs.io/quickstart/installing.html).

## Setup Task: Change your lakeFS credentials

In [None]:
lakefsEndPoint = '<lakeFS Endpoint URL>' # e.g. 'https://playground-name.lakefs-demo.io' or 'http://host.docker.internal:8000' (if lakeFS is running in local Docker container)
lakefsAccessKey = '<lakeFS Access Key>'
lakefsSecretKey = '<lakeFS Secret Key>'

## Setup Task: You can change lakeFS repo name (it can be an existing repo or provide a new repo name)

In [None]:
repo = "my-repo"

## Setup Task: Versioning Information

In [None]:
mainBranch = "main"
schemaValidationBranch1stAttempt = "schema_validation_branch_1st_attempt"
schemaValidationBranch2ndAttempt = "schema_validation_branch_2nd_attempt"
schemaChangeBranch = "schema_change_branch"

## Setup Task: Storage Information - Optional on Playground
#### Change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://<S3 Bucket Name>/' # e.g. "s3://username-lakefs-cloud/"

## Setup Task: Import Python packages

In [None]:
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient
import lakefs_demo

import os
from pyspark.sql.types import ByteType, IntegerType, LongType, StringType, StructType, StructField

## Setup Task: Working with the lakeFS Python client API

###### Note: To learn more about lakeFS Python integration visit https://docs.lakefs.io/integrations/python.html

In [None]:
%xmode Minimal
if not 'client' in locals():
    # lakeFS credentials and endpoint
    configuration = lakefs_client.Configuration()
    configuration.username = lakefsAccessKey
    configuration.password = lakefsSecretKey
    configuration.host = lakefsEndPoint

    client = LakeFSClient(configuration)
    print("Created lakeFS client.")

## Setup Task: Run PySpark with the Delta Lake package and additional configurations

In [None]:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages io.delta:delta-core_2.12:2.0.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" pyspark-shell'

## Setup Task: S3A Gateway configuration

##### Note: lakeFS can be configured to work with Spark in two ways:
###### * Access lakeFS using the S3A gateway https://docs.lakefs.io/integrations/spark.html#use-the-s3-gateway.
###### * Access lakeFS using the lakeFS-specific Hadoop FileSystem https://docs.lakefs.io/integrations/spark.html#use-the-lakefs-hadoop-filesystem.

In [None]:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", lakefsAccessKey)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", lakefsSecretKey)
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", lakefsEndPoint)
sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")

## Setup Task: Create lakeFS Repository - Optional on Playground or if repository exists

In [None]:
client.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=repo,
        storage_namespace=storageNamespace,
        default_branch=mainBranch))

## Setup Task: Upload [Hooks config YAML file](./LuaHooks/pre-merge-schema-validation.yaml) for schema validation to check for any blocked PII columns and to check for any schema changes before data is merged to main branch

### Hooks config file must be uploaded to "_lakefs_actions" prefix

In [None]:
hooks_config_yaml = "pre-merge-schema-validation.yaml"
hooks_prefix = "_lakefs_actions"

In [None]:
with open(f'./LuaHooks/{hooks_config_yaml}', 'rb') as f:
    client.objects.upload_object(repository=repo, 
                                 branch=mainBranch, 
                                 path=f'{hooks_prefix}/{hooks_config_yaml}', 
                                 content=f
                                )

## Setup Task: Upload [Schema Validator script](./LuaHooks/parquet_schema_validator.lua) to check for any blocked PII columns

In [None]:
lua_script_file_name = "parquet_schema_validator.lua"
lua_scripts_path = "scripts"

In [None]:
with open(f'./LuaHooks/{lua_script_file_name}', 'rb') as f:
    client.objects.upload_object(repository=repo, 
                                 branch=mainBranch, 
                                 path=f'{lua_scripts_path}/{lua_script_file_name}', 
                                 content=f
                                )

## Setup Task: Upload [Schema Change script](./LuaHooks/parquet_schema_change.lua) to check for any schema changes

In [None]:
lua_script_file_name = "parquet_schema_change.lua"

In [None]:
with open(f'./LuaHooks/{lua_script_file_name}', 'rb') as f:
    client.objects.upload_object(repository=repo, 
                                 branch=mainBranch, 
                                 path=f'{lua_scripts_path}/{lua_script_file_name}', 
                                 content=f
                                )

## Setup Task: Commit changes and attach some metadata

In [None]:
client.commits.commit(
    repository=repo,
    branch=mainBranch,
    commit_creation=models.CommitCreation(
        message='Added hooks config file and schema validation scripts'))

# ETL Job Starts

## Create a new branch which will be used to ingest data

In [None]:
client.branches.create_branch(
    repository=repo, 
    branch_creation=models.BranchCreation(
        name=schemaValidationBranch1stAttempt, source=mainBranch))

## For this demo - we'll be utilizing a dataset - [Orion Star - Sports and outdoors RDBMS dataset](https://www.kaggle.com/datasets/chethanp11/orion-star-sports-and-outdoors-rdbms-dataset) from [Kaggle](https://www.kaggle.com/).

## Define [CUSTOMER.csv](../data/samples/OrionStar/CUSTOMER.csv) data file schema

#### Notice that 1st column, "user_id" is not allowed as blocked PII columns

In [None]:
customersSchema = StructType([
  StructField("user_id", IntegerType(), False), # "user_id" is not allowed as blocked PII columns.
  StructField("Country", StringType(), False),
  StructField("Gender", StringType(), False),
  StructField("Personal_ID", IntegerType(), True),
  StructField("Customer_Name", StringType(), False),
  StructField("Customer_FirstName", StringType(), False),
  StructField("Customer_LastName", StringType(), False),
  StructField("Birth_Date", StringType(), False),
  StructField("Customer_Address", StringType(), False),
  StructField("Street_ID", LongType(), False),
  StructField("Street_Number", IntegerType(), False),
  StructField("Customer_Type_ID", IntegerType(), False)
])

## Define [ORDER_FACT.csv](../data/samples/OrionStar/ORDER_FACT.csv) data file schema

#### Notice that 1st column "user_id" is not allowed as blocked PII columns

In [None]:
ordersSchema = StructType([
  StructField("user_id", IntegerType(), False), # "user_id" is not allowed as blocked PII columns.
  StructField("Employee_ID", IntegerType(), False),
  StructField("Street_ID", LongType(), False),
  StructField("Order_Date", StringType(), False),
  StructField("Delivery_Date", StringType(), False),
  StructField("Order_ID", LongType(), True),
  StructField("Order_Type", ByteType(), False),
  StructField("Product_ID", LongType(), False),
  StructField("Quantity", ByteType(), False),
  StructField("Total_Retail_Price", StringType(), False),
  StructField("CostPrice_Per_Unit", StringType(), False),
  StructField("Discount", LongType(), False)
])

## Create Customers delta table in the new branch (using [CUSTOMER.csv](./data/samples/OrionStar/CUSTOMER.csv) file)

In [None]:
customersTablePath = f"s3a://{repo}/{schemaValidationBranch1stAttempt}/tables/customers"
df = spark.read.csv('./data/samples/OrionStar/CUSTOMER.csv',header=True,schema=customersSchema)
df.write.format("delta").mode("overwrite").save(customersTablePath)
df.show(10)

## Create Orders delta table in the new branch (using [ORDER_FACT.csv](./data/samples/OrionStar/ORDER_FACT.csv) file)

In [None]:
ordersTablePath = f"s3a://{repo}/{schemaValidationBranch1stAttempt}/tables/orders"
df = spark.read.csv('./data/samples/OrionStar/ORDER_FACT.csv',header=True,schema=ordersSchema)
df.write.format("delta").mode("overwrite").save(ordersTablePath)
df.show(10)

## Commit changes and attach some metadata

In [None]:
client.commits.commit(
    repository=repo,
    branch=schemaValidationBranch1stAttempt,
    commit_creation=models.CommitCreation(
        message='Added customers and orders Delta tables!', 
        metadata={'using': 'python_api'}))

## Merge new branch to the main branch.

#### Merge will fail because Delta tables have blocked column i.e. user_id.  Review the error message.

In [None]:
client.refs.merge_into_branch(
    repository=repo,
    source_ref=schemaValidationBranch1stAttempt, 
    destination_branch=mainBranch)

## Let's attempt to ingest data again without any PII columns

#### Create a new branch for 2nd attempt

In [None]:
client.branches.create_branch(
    repository=repo, 
    branch_creation=models.BranchCreation(
        name=schemaValidationBranch2ndAttempt, source=mainBranch))

## Change "user_id" column to "Customer_ID" in the schema

In [None]:
customersSchema = StructType([
  StructField("Customer_ID", IntegerType(), False), # Change "user_id" column to "Customer_ID"
  StructField("Country", StringType(), False),
  StructField("Gender", StringType(), False),
  StructField("Personal_ID", IntegerType(), True),
  StructField("Customer_Name", StringType(), False),
  StructField("Customer_FirstName", StringType(), False),
  StructField("Customer_LastName", StringType(), False),
  StructField("Birth_Date", StringType(), False),
  StructField("Customer_Address", StringType(), False),
  StructField("Street_ID", LongType(), False),
  StructField("Street_Number", IntegerType(), False),
  StructField("Customer_Type_ID", IntegerType(), False)
])

In [None]:
ordersSchema = StructType([
  StructField("Customer_ID", IntegerType(), False), # Change "user_id" column to "Customer_ID"
  StructField("Employee_ID", IntegerType(), False),
  StructField("Street_ID", LongType(), False),
  StructField("Order_Date", StringType(), False),
  StructField("Delivery_Date", StringType(), False),
  StructField("Order_ID", LongType(), True),
  StructField("Order_Type", ByteType(), False),
  StructField("Product_ID", LongType(), False),
  StructField("Quantity", ByteType(), False),
  StructField("Total_Retail_Price", StringType(), False),
  StructField("CostPrice_Per_Unit", StringType(), False),
  StructField("Discount", LongType(), False)
])

## Create Customers delta table in the new branch (using [CUSTOMER.csv](./data/samples/OrionStar/CUSTOMER.csv) file)

In [None]:
customersTablePath = f"s3a://{repo}/{schemaValidationBranch2ndAttempt}/tables/customers"
df = spark.read.csv('./data/samples/OrionStar/CUSTOMER.csv',header=True,schema=customersSchema)
df.write.format("delta").mode("overwrite").save(customersTablePath)
df.show(10)

## Create Orders delta table in the new branch (using [ORDER_FACT.csv](./data/samples/OrionStar/ORDER_FACT.csv) file)

In [None]:
ordersTablePath = f"s3a://{repo}/{schemaValidationBranch2ndAttempt}/tables/orders"
df = spark.read.csv('./data/samples/OrionStar/ORDER_FACT.csv',header=True,schema=ordersSchema)
df.write.format("delta").mode("overwrite").save(ordersTablePath)
df.show(10)

## Commit changes and attach some metadata

In [None]:
client.commits.commit(
    repository=repo,
    branch=schemaValidationBranch2ndAttempt,
    commit_creation=models.CommitCreation(
        message='Added customers and orders Delta tables without any PII columns!', 
        metadata={'using': 'python_api'}))

## Merge new branch to the main branch

#### Merge will succeed this time because there are no PII columns in the Delta tables

In [None]:
client.refs.merge_into_branch(
    repository=repo,
    source_ref=schemaValidationBranch2ndAttempt, 
    destination_branch=mainBranch)

# Check for any schema changes next

## Create a new branch which will be used to ingest data

In [None]:
client.branches.create_branch(
    repository=repo, 
    branch_creation=models.BranchCreation(
        name=schemaChangeBranch, source=mainBranch))

## Change "Country" column to "Country_Name" in the schema

In [None]:
customersSchema = StructType([
  StructField("Customer_ID", IntegerType(), False),
  StructField("Country_Name", StringType(), False), # Column name changes from Country to Country_name
  StructField("Gender", StringType(), False),
  StructField("Personal_ID", IntegerType(), True),
  StructField("Customer_Name", StringType(), False),
  StructField("Customer_FirstName", StringType(), False),
  StructField("Customer_LastName", StringType(), False),
  StructField("Birth_Date", StringType(), False),
  StructField("Customer_Address", StringType(), False),
  StructField("Street_ID", LongType(), False),
  StructField("Street_Number", IntegerType(), False),
  StructField("Customer_Type_ID", IntegerType(), False)
])

## Change data type for column "Quantity" from ByteType to LongType

In [None]:
ordersSchema = StructType([
  StructField("Customer_ID", IntegerType(), False),
  StructField("Employee_ID", IntegerType(), False),
  StructField("Street_ID", LongType(), False),
  StructField("Order_Date", StringType(), False),
  StructField("Delivery_Date", StringType(), False),
  StructField("Order_ID", LongType(), True), 
  StructField("Order_Type", ByteType(), False),
  StructField("Product_ID", LongType(), False),
  StructField("Quantity", LongType(), False), # Data type changes from ByteType() to LongType()
  StructField("Total_Retail_Price", StringType(), False),
  StructField("CostPrice_Per_Unit", StringType(), False),
  StructField("Discount", LongType(), False)
])

## Create Customers delta table in the new branch (using [CUSTOMER.csv](./data/samples/OrionStar/CUSTOMER.csv) file)

In [None]:
customersTablePath = f"s3a://{repo}/{schemaChangeBranch}/tables/customers"
df = spark.read.csv('./data/samples/OrionStar/CUSTOMER.csv',header=True,schema=customersSchema)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(customersTablePath)
df.show(10)

## Create Orders delta table in the new branch (using [ORDER_FACT.csv](./data/samples/OrionStar/ORDER_FACT.csv) file)

In [None]:
ordersTablePath = f"s3a://{repo}/{schemaChangeBranch}/tables/orders"
df = spark.read.csv('./data/samples/OrionStar/ORDER_FACT.csv',header=True,schema=ordersSchema)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(ordersTablePath)
df.show(10)

## Commit changes and attach some metadata

In [None]:
client.commits.commit(
    repository=repo,
    branch=schemaChangeBranch,
    commit_creation=models.CommitCreation(
        message='Added customers and orders Delta tables with schema changes!', 
        metadata={'using': 'python_api'}))

## Merge new branch to the main branch

#### Merge will fail because schema changed. Review the error message.

In [None]:
client.refs.merge_into_branch(
    repository=repo,
    source_ref=schemaChangeBranch, 
    destination_branch=mainBranch)

## You can also review all Actions in lakeFS UI

![Actions UI](./Images/LuaHooks/Actions.png)

## Click on any Run ID to review Action details in lakeFS UI

#### Click on "pre merge checks on main branch" Action on left panel. Expand multiple sections on right panel to see logs and error messages.

![Action Details UI](./Images/LuaHooks/ActionDetails.png)

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack