# Use Case: Reprocess and Backfill Data with new ETL logic

### You will run following steps in this notebook (refer to the image below):

#### Step 1: Create repository with the Main branch
#### Step 2: Create ingestion branch from the Main branch, ingest data file, run the ETL job, commit the changes and merge ingestion branch to the Main branch
#### Step 3: Create new-logic branch from the Main branch, fix ETL logic and commit the changes
#### Step 4: Repetition of step # 2
#### Step 5: Create backfill-and-deploy branch from the Main branch, run new ETL logic, overwrite processed data and commit the changes.
#### Step 6: Merge backfill-and-deploy branch to the Main branch

![Reprocess](./Images/ReprocessData/Reprocess.png)

## Prerequisites

###### This Notebook requires connecting to a lakeFS Server. 
###### To spin up lakeFS quickly - use the Playground (https://demo.lakefs.io) which provides lakeFS server on-demand with a single click; 
###### Or, alternatively, refer to lakeFS Quickstart doc (https://docs.lakefs.io/quickstart/installing.html).

## Setup Task: Change your lakeFS credentials

In [None]:
lakefsEndPoint = '<lakeFS Endpoint URL>' # e.g. 'https://username.aws_region_name.lakefscloud.io'
lakefsAccessKey = '<lakeFS Access Key>'
lakefsSecretKey = '<lakeFS Secret Key>'

## Setup Task: Storage Information
#### Change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://<S3 Bucket Name>/' # e.g. "s3://username-lakefs-cloud/"

## Setup Task: Versioning Information

In [None]:
mainBranch = "main"
ingestBranch = "ingest"
fileName = "lakefs_test.csv"
processedFileName = "lakefs_test_processed.csv"

## Run additional [Setup](./ReprocessData/Setup.ipynb) tasks here

In [None]:
%run ./ReprocessData/Setup.ipynb

## You can change lakeFS repo name (it can be an existing repo or provide another repo name)

In [None]:
repo = "my-repo"

## Step 1: Create repository with the Main branch

### (if above mentioned repo already exists on your lakeFS server then you can skip this operation)

![Step 1](./Images/ReprocessData/Step1.png)

In [None]:
client.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=repo,
        storage_namespace=storageNamespace,
        default_branch=mainBranch))

## Step 2: Create ingestion branch from the Main branch, ingest data file, run the ETL job, commit the changes and merge ingestion branch to the Main branch

### ([ETL](./ReprocessData/ETL.ipynb) job normally run as a batch job but run ETL job manually here for the demo. It will take around a minute to run this step)

![Step 2](./Images/ReprocessData/Step2.png)

In [None]:
%run ./ReprocessData/ETL.ipynb

# Reprocessing Starts

## Step 3: Create new-logic branch from the Main branch, fix ETL logic and commit the changes
### (you can change the name for reprocessing branch and run [Reprocessing](./ReprocessData/Reprocessing.ipynb) job here)

![Step 3](./Images/ReprocessData/Step3.png)

In [None]:
reprocessBranch = "new-logic"
%run ./ReprocessData/Reprocessing.ipynb

## While ETL logic is getting fixed, old ETL job is still running in parallel.

## Received new data file

In [None]:
fileName = "lakefs_test_new.csv"

## Step 4: Repetition of step # 2

### (run [ETL](./ReprocessData/ETL.ipynb) job again)

![Step 4](./Images/ReprocessData/Step4.png)

In [None]:
%run ./ReprocessData/ETL.ipynb

## Now Reprocessing branch is behind Main branch in terms of data

In [None]:
print("Processed data on " + reprocessBranch + " branch")
dataPath = f"s3a://{repo}/{reprocessBranch}/{processedFileName}"

df = spark.read.format("csv").schema(processedDataFileSchema).load(dataPath)
df.show()

In [None]:
print("Processed data on main branch")
dataPath = f"s3a://{repo}/{mainBranch}/{processedFileName}"

df = spark.read.format("csv").schema(processedDataFileSchema).load(dataPath)
df.show()

## Once ETL logic is fixed, pause the old ETL job to deploy new ETL logic

## Step 5: Create backfill-and-deploy branch from the Main branch, run new ETL logic, overwrite processed data and commit the changes
### (you can change the name for the "Backfill and Deploy" branch and run [Reprocessing](./ReprocessData/Reprocessing.ipynb) job again on "Backfill and Deploy" branch)

![Step 5](./Images/ReprocessData/Step5.png)

In [None]:
backfillAndDeployBranch = "backfill-and-deploy"
reprocessBranch = backfillAndDeployBranch
%run ./ReprocessData/Reprocessing.ipynb

## Now "Backfill and Deploy" branch has same data as Main branch and correct ETL logic

In [None]:
print("Processed data on " + backfillAndDeployBranch + " branch")
dataPath = f"s3a://{repo}/{backfillAndDeployBranch}/{processedFileName}"

df = spark.read.format("csv").schema(processedDataFileSchema).load(dataPath)
df.show()

In [None]:
print("Processed data on main branch")
dataPath = f"s3a://{repo}/{mainBranch}/{processedFileName}"

df = spark.read.format("csv").schema(processedDataFileSchema).load(dataPath)
df.show()

## Step 6: Merge backfill-and-deploy branch to the Main branch

![Step 6](./Images/ReprocessData/Step6.png)

In [None]:
client.refs.merge_into_branch(
    repository=repo, source_ref=backfillAndDeployBranch, 
    destination_branch=mainBranch)

# Reprocessing and Backfill completes

## Verify data on Main branch

In [None]:
print("Processed data on main branch")
dataPath = f"s3a://{repo}/{mainBranch}/{processedFileName}"

df = spark.read.format("csv").schema(processedDataFileSchema).load(dataPath)
df.show()

## Now you can schedule the new ETL job

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack