<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Reprocess and Backfill Data with new ETL logic

_Note that whilst this works, it's a bit of a hack!_

You will run following steps in this notebook (refer to the image below):

1. Create repository with the Main branch
2. Create ingestion branch from the Main branch, ingest data file, run the ETL job, commit the changes and merge ingestion branch to the Main branch
3. Create new-logic branch from the Main branch, fix ETL logic and commit the changes
4. Repetition of step # 2
5. Create backfill-and-deploy branch from the Main branch, run new ETL logic, overwrite processed data and commit the changes.
6. Merge backfill-and-deploy branch to the Main branch

![Reprocess](./images/reprocess-data/Reprocess.png)

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [1]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [2]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [3]:
repo_name = "reprocess-backfill-data"

### Create lakeFSClient

In [4]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

#### Verify lakeFS credentials by getting lakeFS version

In [5]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.config.get_config()
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v['version_config']['version']}")

Verifying lakeFS credentials…
…✅lakeFS credentials verified

ℹ️lakeFS version0.104.0


### Set up Spark

In [6]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

### Versioning Information

In [7]:
mainBranch = "main"
ingestBranch = "ingest"
fileName = "lakefs_test.csv"
processedFileName = "lakefs_test_processed.csv"

### Define data file schema + python libs

In [8]:
from pyspark.sql.types import DoubleType, StructType, StructField
import datetime
import os

dataFileSchema = StructType([
  StructField("Apparel_Sales", DoubleType(), False),
  StructField("Books_Sales", DoubleType(), False),
  StructField("Electronics_Sales", DoubleType(), False),
  StructField("Furniture_Sales", DoubleType(), False),
  StructField("Toys_Sales", DoubleType(), False)
])
processedDataFileSchema = StructType([
  StructField("Apparel_Sales", DoubleType(), False),
  StructField("Books_Sales", DoubleType(), False),
  StructField("Electronics_Sales", DoubleType(), False),
  StructField("Furniture_Sales", DoubleType(), False),
  StructField("Toys_Sales", DoubleType(), False),
  StructField("Total_Sales", DoubleType(), False),
  StructField("Average_Sales_per_Product_Category", DoubleType(), False)
])

---

## Step 1: Create repository with the Main branch

### (if above mentioned repo already exists on your lakeFS server then you can skip this operation)

![Step 1](./images/reprocess-data/Step1.png)

In [9]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

Repository reprocess-backfill-data does not exist, so going to try and create it now.
Created new repo reprocess-backfill-data using storage namespace s3://example/reprocess-backfill-data


## Step 2: Create ingestion branch from the Main branch, ingest data file, run the ETL job, commit the changes and merge ingestion branch to the Main branch

### ([ETL](./ReprocessData/ETL.ipynb) job normally run as a batch job but run ETL job manually here for the demo. It will take around a minute to run this step)

![Step 2](./images/reprocess-data/Step2.png)

In [10]:
%run ./reprocess-data/etl.ipynb

🟩 Created ingestion branch: ingest_2023-07-17_07-55-23
🟩 Ingested data file: lakefs_test.csv

🟩 Reading data from ingestion branch
+-------------+-----------+-----------------+---------------+----------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|
+-------------+-----------+-----------------+---------------+----------+
|          1.0|        2.0|              3.0|            4.0|       5.0|
+-------------+-----------+-----------------+---------------+----------+

🟩 Processed data with wrong value for Average field. Average value is Total divided 4 instead of dividing by 5
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|Total_Sales|Average_Sales_per_Product_Category|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|          1.0|        2.0|     

# Reprocessing Starts

## Step 3: Create new-logic branch from the Main branch, fix ETL logic and commit the changes
### (you can change the name for reprocessing branch and run [Reprocessing](./ReprocessData/Reprocessing.ipynb) job here)

![Step 3](./images/reprocess-data/Step3.png)

In [11]:
reprocessBranch = "new-logic"
%run ./reprocess-data/reprocessing.ipynb

🟩 Created new-logic branch from main branch

🟩 Reading data from new-logic branch
+-------------+-----------+-----------------+---------------+----------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|
+-------------+-----------+-----------------+---------------+----------+
|          1.0|        2.0|              3.0|            4.0|       5.0|
+-------------+-----------+-----------------+---------------+----------+

🟩 Processed data with correct value for Average field
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|Total_Sales|Average_Sales_per_Product_Category|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|          1.0|        2.0|              3.0|            4.0|       5.0|       15.0|                               3.0|
+-------------+---

## While ETL logic is getting fixed, old ETL job is still running in parallel.

## Received new data file

In [12]:
fileName = "lakefs_test_new.csv"

## Step 4: Repetition of step # 2

### (run [ETL](./ReprocessData/ETL.ipynb) job again)

![Step 4](./images/reprocess-data/Step4.png)

In [13]:
%run ./reprocess-data/etl.ipynb

🟩 Created ingestion branch: ingest_2023-07-17_07-55-41
🟩 Ingested data file: lakefs_test_new.csv

🟩 Reading data from ingestion branch
+-------------+-----------+-----------------+---------------+----------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|
+-------------+-----------+-----------------+---------------+----------+
|         10.0|       20.0|             30.0|           40.0|      50.0|
+-------------+-----------+-----------------+---------------+----------+

🟩 Processed data with wrong value for Average field. Average value is Total divided 4 instead of dividing by 5
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|Total_Sales|Average_Sales_per_Product_Category|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|         10.0|       20.0| 

## Now Reprocessing branch is behind Main branch in terms of data

In [14]:
print("Processed data on " + reprocessBranch + " branch")
dataPath = f"s3a://{repo.id}/{reprocessBranch}/{processedFileName}"

df = spark.read.format("csv").schema(processedDataFileSchema).load(dataPath)
df.show()

Processed data on new-logic branch
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|Total_Sales|Average_Sales_per_Product_Category|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|          1.0|        2.0|              3.0|            4.0|       5.0|       15.0|                               3.0|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+



In [15]:
print("Processed data on main branch")
dataPath = f"s3a://{repo.id}/{mainBranch}/{processedFileName}"

df = spark.read.format("csv").schema(processedDataFileSchema).load(dataPath)
df.show()

Processed data on main branch
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|Total_Sales|Average_Sales_per_Product_Category|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|         10.0|       20.0|             30.0|           40.0|      50.0|      150.0|                              37.5|
|          1.0|        2.0|              3.0|            4.0|       5.0|       15.0|                              3.75|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+



## Once ETL logic is fixed, pause the old ETL job to deploy new ETL logic

## Step 5: Create backfill-and-deploy branch from the Main branch, run new ETL logic, overwrite processed data and commit the changes
### (you can change the name for the "Backfill and Deploy" branch and run [Reprocessing](./ReprocessData/Reprocessing.ipynb) job again on "Backfill and Deploy" branch)

![Step 5](./images/reprocess-data/Step5.png)

In [16]:
backfillAndDeployBranch = "backfill-and-deploy"
reprocessBranch = backfillAndDeployBranch
%run ./reprocess-data/reprocessing.ipynb

🟩 Created backfill-and-deploy branch from main branch

🟩 Reading data from backfill-and-deploy branch
+-------------+-----------+-----------------+---------------+----------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|
+-------------+-----------+-----------------+---------------+----------+
|         10.0|       20.0|             30.0|           40.0|      50.0|
|          1.0|        2.0|              3.0|            4.0|       5.0|
+-------------+-----------+-----------------+---------------+----------+

🟩 Processed data with correct value for Average field
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|Total_Sales|Average_Sales_per_Product_Category|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|         10.0|       20.0|             30.0|

## Now "Backfill and Deploy" branch has same data as Main branch and correct ETL logic

In [17]:
print("Processed data on " + backfillAndDeployBranch + " branch")
dataPath = f"s3a://{repo.id}/{backfillAndDeployBranch}/{processedFileName}"

df = spark.read.format("csv").schema(processedDataFileSchema).load(dataPath)
df.show()

Processed data on backfill-and-deploy branch
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|Total_Sales|Average_Sales_per_Product_Category|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|         10.0|       20.0|             30.0|           40.0|      50.0|      150.0|                              30.0|
|          1.0|        2.0|              3.0|            4.0|       5.0|       15.0|                               3.0|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+



In [18]:
print("Processed data on main branch")
dataPath = f"s3a://{repo.id}/{mainBranch}/{processedFileName}"

df = spark.read.format("csv").schema(processedDataFileSchema).load(dataPath)
df.show()

Processed data on main branch
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|Total_Sales|Average_Sales_per_Product_Category|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|         10.0|       20.0|             30.0|           40.0|      50.0|      150.0|                              37.5|
|          1.0|        2.0|              3.0|            4.0|       5.0|       15.0|                              3.75|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+



## Step 6: Merge backfill-and-deploy branch to the Main branch

![Step 6](./images/reprocess-data/Step6.png)

In [19]:
lakefs.refs.merge_into_branch(
    repository=repo.id, source_ref=backfillAndDeployBranch, 
    destination_branch=mainBranch)

{'reference': '1681878bdebe8e4b665ec5e7ae837ebac1e441336b7410eabf3732cc0b0b2a82'}

# Reprocessing and Backfill completes

## Verify data on Main branch

In [20]:
print("Processed data on main branch")
dataPath = f"s3a://{repo.id}/{mainBranch}/{processedFileName}"

df = spark.read.format("csv").schema(processedDataFileSchema).load(dataPath)
df.show()

Processed data on main branch
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|Apparel_Sales|Books_Sales|Electronics_Sales|Furniture_Sales|Toys_Sales|Total_Sales|Average_Sales_per_Product_Category|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+
|         10.0|       20.0|             30.0|           40.0|      50.0|      150.0|                              30.0|
|          1.0|        2.0|              3.0|            4.0|       5.0|       15.0|                               3.0|
+-------------+-----------+-----------------+---------------+----------+-----------+----------------------------------+



## Now you can schedule the new ETL job

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack