# [Integration of lakeFS with Airflow](https://docs.lakefs.io/integrations/airflow.html)

## Use Case: Troubleshooting production issues

## Change your lakeFS credentials

In [None]:
lakefsAccessKey = '<lakeFS Access Key>'
lakefsSecretKey = '<lakeFS Secret Key>'
lakefsEndPoint = '<lakeFS Endpoint URL>' # e.g. 'https://username.aws_region_name.lakefscloud.io'

## You can change lakeFS repo name (it can be an existing repo or provide another repo name)

In [None]:
repo = "my-repo"

## Versioning Information

In [None]:
sourceBranch = "main"
newBranch = "airflow_demo"
newPath = "partitioned_data"

## Working with the lakeFS Python client API

In [None]:
%xmode Minimal
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

client = LakeFSClient(configuration)

## Verify lakeFS credentials by getting lakeFS version

In [None]:
client.config.get_lake_fs_version()

## Storage Information - Optional on Playground
#### Change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://<S3 Bucket Name>/' # e.g. "s3://username-lakefs-cloud/"

## Create Repository - Optional on Playground or if repository exists

In [None]:
client.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=repo,
        storage_namespace=storageNamespace,
        default_branch=sourceBranch))

## Review and copy [demo Workflow](./Airflow/lakefs-dag.py) program to Airflow DAGs directory. If you make any changes in the program, then run this command again.

In [None]:
! sudo cp ./Airflow/lakefs-dag.py /root/airflow/dags

## Start Airflow

In [None]:
%%script bash --bg --out script_out --err script_error
sudo pkill airflow
sudo airflow standalone

## Create Airflow connections for lakeFS and Spark

In [None]:
# Wait for Airflow to start
! sleep 10

! sudo airflow connections delete conn_lakefs
lakeFSConnectionCommand = 'airflow connections add conn_lakefs --conn-type=http --conn-host=' + lakefsEndPoint + ' --conn-extra=\'{"access_key_id":"' + lakefsAccessKey + '","secret_access_key":"' + lakefsSecretKey + '"}\''
! sudo $lakeFSConnectionCommand

! sudo airflow connections delete conn_spark
sparkConnectionCommand = 'airflow connections add conn_spark --conn-type=spark --conn-host=local[*]'
! sudo $sparkConnectionCommand

## Set Airflow variables which are used by the demo workflow

In [None]:
! sudo airflow variables set lakefsAccessKey $lakefsAccessKey
! sudo airflow variables set lakefsSecretKey $lakefsSecretKey
! sudo airflow variables set lakefsEndPoint $lakefsEndPoint
! sudo airflow variables set repo $repo
! sudo airflow variables set sourceBranch $sourceBranch
! sudo airflow variables set newBranch $newBranch
! sudo airflow variables set newPath $newPath
! sudo airflow variables set conn_lakefs 'conn_lakefs'

import os
spark_home = os.getenv('SPARK_HOME')
! sudo airflow variables set spark_home $spark_home

## Set the fileName Airflow variable. This file is used by the demo workflow.

In [None]:
fileName = "lakefs_test.csv"
! sudo airflow variables set fileName $fileName

## Find Airflow admin password and copy the password

In [None]:
! sudo cat /root/airflow/standalone_admin_password.txt

## Visualize [demo workflow DAG Graph](http://127.0.0.1:8080/dags/lakeFS_workflow/graph) in Airflow UI. Login by using username admin and password received in the previous command.

## Trigger demo workflow

In [None]:
! sudo airflow dags unpause lakeFS_workflow
! sudo airflow dags trigger lakeFS_workflow

## Visualize [demo workflow DAG Graph](http://127.0.0.1:8080/dags/lakeFS_workflow/graph).
### Toggle Auto Refresh switch in DAG Graph to see the continuous progress of the workflow.
### Click on any task box, then click on Log button and search for "lakeFS URL" (this URL will take you to applicable branch/commit/data file).

## Once the demo workflow finishes in around 5 minutes, you can use the latest or new file. This file has bad data, and it will cause workflow to fail.

In [None]:
fileName = "lakefs_test_latest_file.csv"
! sudo airflow variables set fileName $fileName

## Trigger demo workflow again by using the latest file

In [None]:
! sudo airflow dags trigger lakeFS_workflow

## Visualize [demo workflow DAG Graph](http://127.0.0.1:8080/dags/lakeFS_workflow/graph) for the new run with the latest file.

### Task "etl_task3" will fail in this case. Click on "etl_task3" task box, then click on Log button and search for "Exception". You will notice following exception:
### "Partition column _c4 not found in schema struct<_c0:string,_c1:string,_c2:string,_c3:string>"

### This exception happens because column "_c4" (or 5th column) is missing in the latest file.