# Integration of lakeFS with Dagster

## Use Case: Troubleshooting production issues

## Prerequisites

###### This Notebook requires connecting to a lakeFS Server.
###### You can either install lakeFS Server locally(https://docs.lakefs.io/quickstart.html), or spin up for free on the lakeFS cloud (https://lakefs.cloud).

## Setup Task: Change your lakeFS credentials

In [None]:
lakefsEndPoint = '<lakeFS Endpoint URL>' # e.g. 'https://username.aws_region_name.lakefscloud.io'
lakefsAccessKey = '<lakeFS Access Key>'
lakefsSecretKey = '<lakeFS Secret Key>'

## Setup Task: You can change lakeFS repo name (it can be an existing repo or provide another repo name)

In [None]:
repo_name = "dagster-new-dag-repo"

## Setup Task: Storage Information
#### Change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://<S3 Bucket Name>/' # e.g. "s3://username-lakefs-cloud/"

## Setup Task: Run additional [Setup](./jobs/New_DAG/NewDAGSetup.ipynb) tasks here

In [None]:
%run ./jobs/New_DAG/NewDAGSetup.ipynb

## You can review [lakeFS New DAG](./jobs/New_DAG/lakefs_new_dag.py) program

## Launch and run lakeFS New DAG (it will take around 30 seconds to run the DAG)
#### Click on any URLs generated in the output log. These URLs will take you to applicable branch/commit/data file in lakeFS. You will also find these URLs in the Dagster logs.

In [None]:
date_time = datetime.datetime.now()
newBranchTS = newBranch + date_time.strftime("_%Y%m%dT%H%M%S")

launchRunCommand = 'dagster job launch -f ./jobs/New_DAG/lakefs_new_dag.py --job lakefs_new_dag \
--config-json \'{"resources": {"variables": {"config": {"repo": "' + repo_name + '", "sourceBranch": "' + \
sourceBranch + '", "newBranch": "' + newBranchTS + '", "newPath": "' + newPath + '", "fileName": "' + fileName + '"}}}}\' \
--run-id ' + newBranchTS

!! $launchRunCommand

## Run following cell to generate the Dagit UI URL

In [None]:
from IPython.display import Markdown as md

md("Click [here to visualize lakeFS New DAG Graph](http://127.0.0.1:3000/runs/%s) in Dagit UI"%(newBranchTS))

## Now use the latest or new file. This file has bad data, and it will cause DAG to fail.

In [None]:
fileName = "lakefs_test_latest_file.csv"

## Launch and run lakeFS New DAG again by using the latest file

#### Operation "etl_task3" will fail in this case. You will notice following exception:
#### "Column _c4 not found in schema struct<_c0:string,_c1:string,_c2:string,_c3:string>"

#### This exception happens because column "_c4" (or 5th column) is missing in the latest file.

In [None]:
date_time = datetime.datetime.now()
newBranchTSLatest = newBranch + date_time.strftime("_%Y%m%dT%H%M%S")

launchRunCommand = 'dagster job launch -f ./jobs/New_DAG/lakefs_new_dag.py --job lakefs_new_dag \
--config-json \'{"resources": {"variables": {"config": {"repo": "' + repo_name + '", "sourceBranch": "' + \
sourceBranch + '", "newBranch": "' + newBranchTSLatest + '", "newPath": "' + newPath + '", "fileName": "' + fileName + '"}}}}\' \
--run-id ' + newBranchTSLatest

!! $launchRunCommand

## Run following cell to generate the Dagit UI URL

In [None]:
from IPython.display import Markdown as md

md("Click [here to visualize lakeFS New DAG Graph](http://127.0.0.1:3000/runs/%s) in Dagit UI"%(newBranchTSLatest))

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack