# Integration of lakeFS with Airflow via Hooks

## Use Case: Isolated Ingestion & ETL Environment

## Prerequisites

###### This Notebook requires connecting to a lakeFS Server. 
###### Run lakeFS locally with Docker (https://docs.lakefs.io/quickstart/run.html).

##### Also, make sure that lakeFS server can connect to Airflow server either directly or using Virtual Private Network(VPN).

## Setup Task: Change your lakeFS credentials (Access Key and Secret Key)

In [None]:
lakefsEndPoint = 'http://host.docker.internal:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io'
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

## Setup Task: You can change lakeFS repo name (it can be an existing repo or provide another repo name)

In [None]:
repo = "airflow-hooks-repo"

## Setup Task: Versioning Information

In [None]:
sourceBranch = "main"
newBranch = "ingest"
airflowBranch = "etl_airflow"
newPath = "partitioned_data"
successFileName = "success.txt"

## Setup Task: Storage Information
#### Change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://example/' + repo # e.g. "s3://bucket"

## Setup Task: Run additional [Setup](./airflow/Hooks/HooksSetup.ipynb) tasks here

In [None]:
%run ./airflow/Hooks/HooksSetup.ipynb

### You will run following steps in this notebook (refer to the image below):

##### - Create repository with the Main branch
##### - Create Ingest branch from the Main branch, add data file to ingest branch and commit the changes
##### - Post-Commit hook will trigger Airflow Transformation DAG
##### - Airflow Transformation DAG will create ETL branch from the Ingest branch
##### - Airflow Transformation DAG will run transformation task and will create Success file if transformation succeeds
##### - Airflow Transformation DAG will commit the changes and will merge ETL branch into Ingest branch
##### - Merge Ingest branch into the Main branch
##### - Pre-Merge hook will trigger another Airflow DAG which will look for the Success file in the Ingest branch, will confirm successful completion of the ETL job and merge will succeed
##### - If Pre-Merge hook DAG fails then merge will also fail

![Step 1](./Images/AirflowHooks/1-AllSteps.png)

## If repo already exists on your lakeFS server then you can skip following step otherwise create a new repo

![Step 1](./Images/AirflowHooks/15.png)

In [None]:
repository = lakefs.Repository(repo).create(storage_namespace=storageNamespace, default_branch=sourceBranch, exist_ok=True)
main = repository.branch(sourceBranch)
print(repository)

# Ingest and ETL Process Starts

## Create ingest branch

![Step 1](./Images/AirflowHooks/14.png)

In [None]:
branch = repository.branch(newBranch).create(source_reference=sourceBranch)
print(branch)

## Upload bad data file

![Step 1](./Images/AirflowHooks/13.png)

In [None]:
import os
contentToUpload = open(os.path.expanduser('~')+'/airflow/Hooks/data/bad_data_file/'+fileName, 'r').read()
branch.object(fileName).upload(data=contentToUpload, mode='wb', pre_sign=False)

## Upload [Post-Commit Actions](./airflow/Hooks/actions_post_commit.yaml) file. This will invoke Post-Commit DAG.

#### You can review [Post-Commit DAG](./airflow/Hooks/lakefs_hooks_post_commit_dag.py) program.

In [None]:
contentToUpload = open(os.path.expanduser('~')+'/airflow/Hooks/actions_post_commit.yaml', 'r').read() # Only a single file per upload which must be named \\\"content\\\"
branch.object('_lakefs_actions/actions_post_commit.yaml').upload(data=contentToUpload, mode='wb', pre_sign=False)

## Upload [Pre-Merge Actions](./airflow/Hooks/actions_pre_merge.yaml) file. This will invoke Pre-Merge DAG to verify if Post-Commit DAG was successful or not.

#### You can review [Pre-Merge DAG](./airflow/Hooks/lakefs_hooks_pre_merge_dag.py) program. DAG verifies success.txt file which is created by Post-Commit DAG.

In [None]:
contentToUpload = open(os.path.expanduser('~')+'/airflow/Hooks/actions_pre_merge.yaml', 'r').read() # Only a single file per upload which must be named \\\"content\\\"
branch.object('_lakefs_actions/actions_pre_merge.yaml').upload(data=contentToUpload, mode='wb', pre_sign=False)

## Compare ingest to main branch

In [None]:
results = map(
    lambda n:[n.path,n.path_type,n.size_bytes,n.type],
    branch.uncommitted())

from tabulate import tabulate
print(tabulate(
    results,
    headers=['Path','Path Type','Size(Bytes)','Type']))

## Commit changes and attach some metadata

![Step 1](./Images/AirflowHooks/12.png)

In [None]:
ref = branch.commit(message='Uploaded bad data file!',
        metadata={'airflow dag url': 'http://127.0.0.1:8080/dags/lakefs_hooks_post_commit_dag/grid',
                  'ml model version': 'v1.0'})
print(ref.get_commit())

## Post-Commit DAG will get triggered

![Step 1](./Images/AirflowHooks/11.png)

### Visualize [Post-Commit DAG Graph](http://127.0.0.1:8080/dags/lakefs_hooks_post_commit_dag/graph) in Airflow UI. Login by using username "airflow" and password "airflow".

##### Toggle Auto Refresh switch in DAG Graph to see the continuous progress of the workflow.
##### Click on any lakeFS related task box, then click on "lakeFS UI" button (this URL will take you to applicable branch/commit/data file in lakeFS). You will also find this URL in the Airflow log if you click on Log button and search for "lakeFS URL".

![Step 1](./Images/AirflowHooks/10.png)

## DAG will create ETL branch (with timestamp)

![Step 1](./Images/AirflowHooks/9.png)

## Transformation job fails due to bad data

### Task "transformation" will fail in this case. Click on "transformation" task box, then click on Log button and search for "Exception". You will notice following exception:
### "Partition column _c4 not found in schema struct<_c0:string,_c1:string,_c2:string,_c3:string>"

### This exception happens because column "_c4" (or 5th column) is missing in the latest file.

![Step 1](./Images/AirflowHooks/8.png)

## Try to merge ingest branch into the main branch. Merge will fail if either Post-Commit DAG fails or DAG is still running.

In [None]:
res = branch.merge_into(main)
print(res)

### Visualize [Pre-Merge DAG Graph](http://127.0.0.1:8080/dags/lakefs_hooks_pre_merge_dag/graph) in Airflow UI

### Task "sense_success_file" will fail in this case. Click on "sense_success_file" task box, then click on Log button. You will notice following message in the log:
### File 'success.txt' not found on branch 'ingest'

## Upload correct data file

![Step 1](./Images/AirflowHooks/7.png)

In [None]:
contentToUpload = open(os.path.expanduser('~')+'/airflow/Hooks/data/correct_data_file/'+fileName, 'r').read()
branch.object(fileName).upload(contentToUpload, mode='wb', pre_sign=False)

## Add Release Notes

In [None]:
contentToUpload = open(os.path.expanduser('~')+'/airflow/Hooks/data/ReleaseNotes.txt', 'r').read()
branch.object('ReleaseNotes.txt').upload(contentToUpload, mode='wb', pre_sign=False)

## Commit changes and attach some metadata. Post-Commit DAG will get triggered again.

![Step 1](./Images/AirflowHooks/6.png)

In [None]:
ref = branch.commit(message='Uploaded correct data file!',
        metadata={'airflow dag url': 'http://127.0.0.1:8080/dags/lakefs_hooks_post_commit_dag/grid',
                  'ml model version': 'v1.0'})
print(ref.get_commit())

### Visualize [Post-Commit DAG Graph](http://127.0.0.1:8080/dags/lakefs_hooks_post_commit_dag/graph) in Airflow UI

![Step 1](./Images/AirflowHooks/5.png)

## DAG will create ETL branch (with timestamp)

![Step 1](./Images/AirflowHooks/4.png)

## Transformation job will succeed and will create Success file

![Step 1](./Images/AirflowHooks/3.png)

## Merge ETL branch into Ingest branch

![Step 1](./Images/AirflowHooks/2.png)

## Add tag for future use

In [None]:
tag = 'v1.0'
lakefs.Tag(repo, tag).create(newBranch, exist_ok=True)

## Merge Ingest branch into the Main branch. Merge will succeed this time because Post-Commit DAG succeeds.

![Step 1](./Images/AirflowHooks/1.png)

In [None]:
res = branch.merge_into(main)
print(res)

### Visualize [Pre-Merge DAG Graph](http://127.0.0.1:8080/dags/lakefs_hooks_pre_merge_dag/graph) in Airflow UI

## Read data by using tag

In [None]:
tag = 'v1.0'
dataPath = f"s3a://{repo}/{tag}/{fileName}"

df = spark.read.csv(dataPath)
df.show()

## If you want you can atomically rollback all changes

In [None]:
main.revert(parent_number=1, reference_id=sourceBranch)

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack