<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Running data quality checks with lakeFS web hooks

_This notebook shows how to use [lakeFS Hooks](https://docs.lakefs.io/hooks/overview.html)_.

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [1]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [2]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [3]:
repo_name = "hooks-demo-01"

### Create lakeFSClient

In [4]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

#### Verify lakeFS credentials by getting lakeFS version

In [5]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.config.get_lake_fs_version()
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version{v.version}")

Verifying lakeFS credentials…
…✅lakeFS credentials verified

ℹ️lakeFS version0.104.0


### Define lakeFS Repository

In [6]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

Repository hooks-demo-01 does not exist, so going to try and create it now.
Created new repo hooks-demo-01 using storage namespace s3://example/hooks-demo-01


### Set up Spark

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

In [8]:
ingest_branch = "ingest-landing-area"
staging_branch = "staging-area"
prod_branch = "main"

In [9]:
from datetime import date, time

---

# Main demo starts here 🚦 👇🏻

## Creating Ingest and Staging branches

In [10]:
lakefs.branches.list_branches(repo_name)


{'pagination': {'has_more': False,
                'max_per_page': 1000,
                'next_offset': '',
                'results': 1},
 'results': [{'commit_id': '12d0159dfe4602c890b466b53374d7b5243a7a5547c8d823cd6a7a30c5c82176',
              'id': 'main'}]}

In [11]:
lakefs.branches.create_branch(repository=repo_name, 
                              branch_creation=BranchCreation(name=ingest_branch, 
                                                                    source=prod_branch)
                             )

'12d0159dfe4602c890b466b53374d7b5243a7a5547c8d823cd6a7a30c5c82176'

In [12]:
lakefs.branches.create_branch(repository=repo_name, 
                              branch_creation=BranchCreation(name=staging_branch, 
                                                                    source=prod_branch)
                             )

'12d0159dfe4602c890b466b53374d7b5243a7a5547c8d823cd6a7a30c5c82176'

In [13]:
lakefs.branches.list_branches(repo_name)


{'pagination': {'has_more': False,
                'max_per_page': 1000,
                'next_offset': '',
                'results': 3},
 'results': [{'commit_id': '12d0159dfe4602c890b466b53374d7b5243a7a5547c8d823cd6a7a30c5c82176',
              'id': 'ingest-landing-area'},
             {'commit_id': '12d0159dfe4602c890b466b53374d7b5243a7a5547c8d823cd6a7a30c5c82176',
              'id': 'main'},
             {'commit_id': '12d0159dfe4602c890b466b53374d7b5243a7a5547c8d823cd6a7a30c5c82176',
              'id': 'staging-area'}]}

## Uploading movies data to ingest branch

In [14]:
ingest_data = "movies.csv"

ingest_path = f'dt={str(date.today())}/{ingest_data}'
ingest_path


'dt=2023-07-17/movies.csv'

In [15]:
with open(f'/data/{ingest_data}', 'rb') as f:
    lakefs.objects.upload_object(repository=repo_name, 
                                 branch=ingest_branch, 
                                 path=ingest_path, 
                                 content=f
                                )


In [16]:
lakefs.branches.diff_branch(repository=repo_name, 
                            branch=ingest_branch).results


[{'path': 'dt=2023-07-17/movies.csv',
  'path_type': 'object',
  'size_bytes': 1071619,
  'type': 'added'}]

In [17]:
lakefs.commits.commit(repository=repo_name,
                      branch=ingest_branch,
                      commit_creation=CommitCreation(
                          message="netflix movie data arrived at landing area (today's partition)")
                     )

{'committer': 'everything-bagel',
 'creation_date': 1689579977,
 'id': '38ffd4c01f2b0be5329135a10fb9dcba54dca28d47eac1164c5068c2b55390c8',
 'message': "netflix movie data arrived at landing area (today's partition)",
 'meta_range_id': '',
 'metadata': {},
 'parents': ['12d0159dfe4602c890b466b53374d7b5243a7a5547c8d823cd6a7a30c5c82176']}

## Uploading actions.yaml config file to staging branch

* We want to run data quality tests on staging branch before merging the data into production. Hooks config file `actions.yaml` needs to be in the branch on which the tests are run.

* So add `_lakefs_actions/actions.yaml` to staging branch
* `actions.yaml` contains a pre-merge hook configured to check for file format validation.

In [18]:
hooks_config_yaml = "actions.yaml"
hooks_prefix = "_lakefs_actions"


In [19]:
with open(f'./hooks/{hooks_config_yaml}', 'rb') as f:
    lakefs.objects.upload_object(repository=repo_name, 
                                 branch=staging_branch, 
                                 path=f'{hooks_prefix}/{hooks_config_yaml}', 
                                 content=f
                                )


In [20]:
lakefs.branches.diff_branch(repository=repo_name, 
                            branch=staging_branch).results


[{'path': '_lakefs_actions/actions.yaml',
  'path_type': 'object',
  'size_bytes': 420,
  'type': 'added'}]

In [21]:
lakefs.commits.commit(repository=repo_name,
                      branch=staging_branch,
                      commit_creation=CommitCreation(
                          message='Added hooks config file - actions.yaml to staging area')
                     )


{'committer': 'everything-bagel',
 'creation_date': 1689579978,
 'id': 'efd25c6c816d6aa20dcd8afee361fa104050fb4da1598eac14f595ccc491e5d3',
 'message': 'Added hooks config file - actions.yaml to staging area',
 'meta_range_id': '',
 'metadata': {},
 'parents': ['12d0159dfe4602c890b466b53374d7b5243a7a5547c8d823cd6a7a30c5c82176']}

## Extracting data from ingest branch for transformation

In [22]:
ingest_long_path = f"s3a://{repo_name}/{ingest_branch}/{ingest_path}"
ingest_long_path


's3a://hooks-demo-01/ingest-landing-area/dt=2023-07-17/movies.csv'

In [23]:
movies_df = spark.read.option("header","true").csv(ingest_long_path)
print(movies_df.count())
print(movies_df.printSchema())


8791
root
 |-- show_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: string (nullable = true)
 |-- release_year: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)

None


In [24]:
movies_df.show(10)

+-------+-------+--------------------+-------------------+--------------+----------+------------+------+---------+--------------------+
|show_id|   type|               title|           director|       country|date_added|release_year|rating| duration|           listed_in|
+-------+-------+--------------------+-------------------+--------------+----------+------------+------+---------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|    Kirsten Johnson| United States| 9/25/2021|        2020| PG-13|   90 min|       Documentaries|
|     s3|TV Show|           Ganglands|    Julien Leclercq|        France| 9/24/2021|        2021| TV-MA| 1 Season|Crime TV Shows, I...|
|     s6|TV Show|       Midnight Mass|      Mike Flanagan| United States| 9/24/2021|        2021| TV-MA| 1 Season|TV Dramas, TV Hor...|
|    s14|  Movie|Confessions of an...|      Bruno Garotti|        Brazil| 9/22/2021|        2021| TV-PG|   91 min|Children & Family...|
|     s8|  Movie|             Sankofa|       Hai

In [25]:
movies_df = movies_df.sample(False,0.1,0)


## Loading transformed data into Staging Area/Branch

In [26]:
staging_long_path = f"s3a://{repo_name}/{staging_branch}"
staging_long_path


's3a://hooks-demo-01/staging-area'

## Scenario #1

### Writing parquet files to staging area

In [27]:
movies_df.write.option("header",True)\
        .partitionBy("type")\
        .mode("append")\
        .parquet(f"{staging_long_path}/analytics/movies-by-type-parquet")

In [28]:
lakefs.commits.commit(repository=repo_name,
                      branch=staging_branch,
                      commit_creation=CommitCreation(
                          message='loaded paritioned movies parquet to staging area'))


{'committer': 'everything-bagel',
 'creation_date': 1689580015,
 'id': 'f545bddbb91832c2ec2f11a4f98bc14ef539aec39bd9db31a5334f25f6dda544',
 'message': 'loaded paritioned movies parquet to staging area',
 'meta_range_id': '',
 'metadata': {},
 'parents': ['efd25c6c816d6aa20dcd8afee361fa104050fb4da1598eac14f595ccc491e5d3']}

### Pushing parquet files to Prod

In [29]:
lakefs.refs.merge_into_branch(repository=repo_name, 
                              source_ref=staging_branch, 
                              destination_branch=prod_branch)


{'reference': '6637b8c64326cfc8fba848cb5da243aced82721a0191621025e53841fd418a97'}

## Scenario #2

### Writing csv files to staging area

In [30]:
movies_df.write.option("header",True)\
        .partitionBy("type")\
        .mode("append")\
        .csv(f"{staging_long_path}/analytics/movies-by-type-csv")
    

In [31]:
lakefs.commits.commit(repository=repo_name,
                      branch=staging_branch,
                      commit_creation=CommitCreation(
                          message='loaded paritioned movies csv to staging area'))


{'committer': 'everything-bagel',
 'creation_date': 1689580019,
 'id': 'af391b975a6c92732ff6ba7210729c6eec2bc4bf66c8509ec9847f8fbd8490a7',
 'message': 'loaded paritioned movies csv to staging area',
 'meta_range_id': '',
 'metadata': {},
 'parents': ['f545bddbb91832c2ec2f11a4f98bc14ef539aec39bd9db31a5334f25f6dda544']}

### Pushing csv files to Prod

In [32]:
lakefs.refs.merge_into_branch(repository=repo_name, 
                              source_ref=staging_branch, 
                              destination_branch=prod_branch)


ApiException: (412)
Reason: Precondition Failed
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Request-Id': 'e5d20b2f-217f-4836-bd58-c84b6196f1a8', 'Date': 'Mon, 17 Jul 2023 07:46:59 GMT', 'Content-Length': '261'})
HTTP response body: {"message":"update branch main: pre-merge hook aborted, run id '5hvqk3a2tk3c76sjukh0': 1 error occurred:\n\t* hook run id '0000_0000' failed on action 'ParquetOnlyInProduction' hook 'production_format_validator': webhook request failed (status code: 400)\n\n"}



### Why did the merge operation fail?
If you look deeper into the error log, you'll see that the merge request failed with status code '412' (precondition failed). The actions file was executed and blocked a commit with a csv file to merge into main.

## Action execution history

👉🏻 You can see previous actions run [here](http://localhost:8000/repositories/hooks-demo-01/actions)