# Integration of lakeFS with Airflow via Hooks

## Use Case: Isolated Ingestion & ETL Environment

## Prerequisites

###### This Notebook requires connecting to a lakeFS Server. 
###### To spin up lakeFS quickly - use the Playground (https://demo.lakefs.io) which provides lakeFS server on-demand with a single click; 
###### Or, alternatively, refer to lakeFS Quickstart doc (https://docs.lakefs.io/quickstart/run.html).

## Change your lakeFS credentials

In [None]:
lakefsEndPoint = 'http://host.docker.internal:8000'
lakefsUIEndPoint = 'http://127.0.0.1:8000'
lakefsAccessKey = 'AKIAJ64AEHEF4JPRG24Q'
lakefsSecretKey = '5OeObqYul3zPYIW31lQj4x/aEF22oc8FE/dUVqZ8'

## Storage Information
#### Change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 'local://ml-data-repo' # e.g. "s3://username-lakefs-cloud/"

## Versioning Information

In [1]:
sourceBranch = "main"
newBranch = "ingest"
airflowBranch = "etl_airflow"
fileName = "lakefs_test.csv"
newPath = "partitioned_data"
successFileName = "success.txt"

## You can change lakeFS repo name (it can be an existing repo or provide another repo name)

In [None]:
repo = "ml-data-repo"

## Working with the lakeFS Python client API

###### Note: To learn more about lakeFS Python integration visit https://docs.lakefs.io/integrations/python.html

In [None]:
%xmode Minimal
if not 'client' in locals():
    import lakefs_client
    from lakefs_client import models
    from lakefs_client.client import LakeFSClient

    # lakeFS credentials and endpoint
    configuration = lakefs_client.Configuration()
    configuration.username = lakefsAccessKey
    configuration.password = lakefsSecretKey
    configuration.host = lakefsEndPoint

    client = LakeFSClient(configuration)
    print("Created lakeFS client.")

## Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials")
client.config.get_lake_fs_version()
print("lakeFS credentials verified")

## S3A Gateway configuration

##### Note: lakeFS can be configured to work with Spark in two ways:
###### * Access lakeFS using the S3A gateway https://docs.lakefs.io/integrations/spark.html#access-lakefs-using-the-s3a-gateway.
###### * Access lakeFS using the lakeFS-specific Hadoop FileSystem https://docs.lakefs.io/integrations/spark.html#access-lakefs-using-the-lakefs-specific-hadoop-filesystem.

In [None]:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", lakefsAccessKey)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", lakefsSecretKey)
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", lakefsEndPoint)
sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")

## Start Airflow

In [None]:
print("Starting Airflow")

In [None]:
! pkill airflow
! pkill airflow
! pkill airflow
! pkill airflow

In [None]:
%env AIRFLOW__API__AUTH_BACKENDS=airflow.api.auth.backend.basic_auth

In [None]:
%%script bash --bg --out script_out --err script_error
airflow standalone

In [None]:
# Wait for Airflow to start
! sleep 20

In [None]:
#%%script bash --bg --out script_out --err script_error
#airflow scheduler

In [None]:
print("Airflow Started")

## Create Airflow connections for lakeFS and Spark

In [None]:
! airflow connections delete conn_lakefs
lakeFSConnectionCommand = 'airflow connections add conn_lakefs --conn-type=http --conn-host=' + lakefsEndPoint + ' --conn-extra=\'{"access_key_id":"' + lakefsAccessKey + '","secret_access_key":"' + lakefsSecretKey + '"}\''
! $lakeFSConnectionCommand

! airflow connections delete conn_spark
sparkConnectionCommand = 'airflow connections add conn_spark --conn-type=spark --conn-host=local[*]'
! $sparkConnectionCommand

## Set Airflow variables which are used by the demo workflow

In [None]:
! airflow variables set lakefsAccessKey $lakefsAccessKey
! airflow variables set lakefsSecretKey $lakefsSecretKey
! airflow variables set lakefsEndPoint $lakefsEndPoint
! airflow variables set lakefsUIEndPoint $lakefsUIEndPoint
! airflow variables set repo $repo
! airflow variables set sourceBranch $newBranch
! airflow variables set newBranch $airflowBranch
! airflow variables set fileName $fileName
! airflow variables set newPath $newPath
! airflow variables set successFileName $successFileName
! airflow variables set conn_lakefs 'conn_lakefs'

import os
spark_home = os.getenv('SPARK_HOME')
! airflow variables set spark_home $spark_home

## Copy DAG programs to Airflow DAGs directory and sync to Airflow database

In [1]:
! cp ./airflow/Hooks/lakefs_hooks_post_commit_dag.py ./airflow/dags
! cp ./airflow/Hooks/lakefs_hooks_pre_merge_dag.py ./airflow/dags

from airflow.models import DagBag
dagbag = DagBag(include_examples=False)
dagbag.sync_to_db()

cp: cannot stat './airflow/Hooks/lakefs_hooks_post_commit_dag.py': No such file or directory
cp: cannot stat './airflow/Hooks/lakefs_hooks_pre_merge_dag.py': No such file or directory
[[34m2022-12-07 23:16:51,676[0m] {[34mdagbag.py:[0m508} INFO[0m - Filling up the DagBag from /home/jovyan/airflow/dags[0m
[[34m2022-12-07 23:16:51,976[0m] {[34mdagbag.py:[0m321} ERROR[0m - Failed to import: /home/jovyan/airflow/dags/.ipynb_checkpoints/lakefs_hooks_post_commit_dag-checkpoint.py[0m
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/airflow/models/dagbag.py", line 318, in parse
    loader.exec_module(new_module)
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/jovyan/airflow/dags/.ipynb_checkpoints/lakefs_hooks_post_commit_dag-checkpoint.py", line 100, in <module>
    post_commit_dag = lakefs_hooks_post_commit_dag()
  File "/opt/con

## Unpause Airflow DAGs

In [2]:
! airflow dags unpause lakefs_hooks_post_commit_dag
! airflow dags unpause lakefs_hooks_pre_merge_dag

Dag: lakefs_hooks_post_commit_dag, paused: False
Dag: lakefs_hooks_pre_merge_dag, paused: False
