<a href="https://colab.research.google.com/github/treeverse/lakeFS-samples/blob/main/Colab/dev_test_environments_with_lakefs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Isolated dev/test environments with lakeFS

--- 

<small>Expected time to complete: **5-10 minutes**</small>

--- 

In this quick tutorial, we will use lakeFS to create a test/dev environment for data.

## What we'll cover:

1. Setting up a [lakeFS playground](https://demo.lakefs.io/) envirotment (preloaded with sample data)
1. Using [lakeFS Branches](https://docs.lakefs.io/understand/model.html#branches) to create an isolated, zero-copy environment to experiment on
1. Manipulating data on a branch without affecting other branches
1. Committing changes as well as undoing changes we didn't like

## ▶️ Let's get started! 

Let's start by setting up a lakeFS playground environment! 

This environment lives for 24h and then automatically destroyed. 
Please use it only for educational purposes 😊

In [None]:
#@title 🛠️ Create a new lakeFS playground instance { vertical-output: true }

# @markdown 👈 Enter your email address and hit the **Run Cell** button

import time
import sys
from google.colab import data_table

print('⏳ Installing dependencies... ') 

!pip install lakefs-playground-utils > /dev/null
!wget -qO- https://github.com/treeverse/lakeFS/releases/download/v0.86.0/lakeFS_0.86.0_Linux_x86_64.tar.gz | tar xvz -C /usr/local/bin/ >/dev/null

import playground
import pandas as pd

data_table.enable_dataframe_formatter()
pd.set_option('display.max_rows', 1000)

print('✅ Dependencies installed!\n')


email = "" #@param {type:"string"}

# @markdown <small>By submitting this form I agree to receive communications from lakeFS</small>

if not email or not playground.check_email(email):
  print('❌ Invalid e-mail provided')
else:
  print('⏳ setting up a lakeFS playground environment...')
  # Creates a lakeFS playground environment, or returns an existing one
  conn = playground.get_or_create(email) 
  # Sets up a `lakefs://` protocol handler for pandas, 
  #  pre-configured to read+write from our playground environment
  playground.mount(conn) 
  # Returns An instance of the lakeFS Python SDK. We'll use this later to mess around with branches
  client = playground.client(conn)  


## 🔎 Let's explore our new environment

1. Log into your playground environment using the credentials given above
1. You'll notice there's a repository already set up for you, called `my-repo`
1. Navigate into `my-repo` - you'll be greeted by the an object browser, defaulting to the **`main`** branch.
1. In a production environment, this branch will likely be used by all data consumers. This is assumed to be our "production" data environment. Let's be careful not to modify anything here!
1. For now, let's get a taste of what's stored here, by reading from it!
1. In this example, we'll use [pandas](https://pandas.pydata.org/):


In [None]:
# lakeFS URIs adhere to this scheme: lakefs://<REPOSITORY NAME>/<REFERENCE>/<PATH>
# where <REFERENCE> could be a branch name, commit ID or tag name!
# Let's read a set of parquet files that exist on our main branch:
# (data used in this example is a parquet version of the public World Cities Database Population available on Kaggle: https://www.kaggle.com/datasets/arslanali4343/world-cities-database-population-oct2022)
df = pd.read_parquet('lakefs://my-repo/main/world-cities-database-population/raw/')
_ = df.groupby('region')['population'].sum().plot.pie(figsize=(8,8), legend=True)

## 🪄 Branching

Branches in lakeFS are [zero-copy](https://docs.lakefs.io/understand/performance-best-practices.html#zero-copy). 

Even if we have Petabytes of data on the main branch, creating a new branch takes only milliseconds since it's a metadata only operation.

We can create a branch from the lakeFS UI, the [lakectl cli tool](https://docs.lakefs.io/reference/commands.html), the [lakeFS API](https://docs.lakefs.io/reference/api.html) or using [one of the existing SDKs](https://github.com/treeverse/lakeFS/tree/master/clients).

---

For the sake of this example, let's see how this is done through the `lakectl` CLI as well as the [Python SDK](https://docs.lakefs.io/integrations/python.html):



In [None]:
# Using lakectl
!lakectl branch list lakefs://my-repo/

# Using the Python SDK
client.branches.list_branches(repository='my-repo').get('results')

Cool, so we can see that we have a `main` branch that points to a commit.
Let's create a new branch and examine this again:

In [None]:
!lakectl branch create lakefs://my-repo/my-dev-branch --source lakefs://my-repo/main

# 🧐 What do you think this will return?
!lakectl branch list lakefs://my-repo/

If we look closely, we'll see that our newly created `my-dev-branch` branch has the exact same commit ID as the `main` branch it was derived from! 

💡 *Branching in lakeFS is a metadata operation, no data was actually copied!*

## Let's play around with our dev branch!

In [None]:
# Let's read the exact same data, only this time we'll replace `main` with the name of our new branch:
df = pd.read_parquet('lakefs://my-repo/my-dev-branch/world-cities-database-population/raw/')
_ = df.groupby('region')['population'].sum().plot.pie(figsize=(8,8), legend=True)

Now, let's mess around with the data! Let's say for the sake of the exercise, We only care about [Megacities](https://en.wikipedia.org/wiki/Megacity) - Those with 10,000,000 residents or more. This will naturally make up only a fraction of the cities.

Let's see what this looks like:

In [None]:
# Filter dataframe by population
df = df.query('population >= 10_000_000')
_ = df.groupby('region')['population'].sum().plot.pie(figsize=(8,8), legend=True)

In [None]:
# Let's remove the old data
!lakectl fs rm --recursive lakefs://my-repo/my-dev-branch/world-cities-database-population/raw/

# ...and write our clean version
df.to_parquet('lakefs://my-repo/my-dev-branch/world-cities-database-population/raw/clean.parquet')

## 🧐 Let's see what changed!

We can use the [lakeFS Diff API](https://docs.lakefs.io/understand/model.html#version-control) to see what changes were made to a branch:

In [None]:
!lakectl diff lakefs://my-repo/my-dev-branch

Or, for those who prefer a more Python approach:

In [None]:
diff = client.branches.diff_branch('my-repo', 'my-dev-branch').get('results')
pd.DataFrame([c.to_dict() for c in diff])[['type', 'path']]

## 😱 Comparing and validating our results

Let's compare the table we modified with the one on `main` - we want to make sure that even though we deleted files and rewrote the table on `my-dev-branch`, the data on `main` is unharmed:


In [None]:
main_df = pd.read_parquet('lakefs://my-repo/main/world-cities-database-population/raw/')
dev_df  = pd.read_parquet('lakefs://my-repo/my-dev-branch/world-cities-database-population/raw/')

print(f'original data size: {len(main_df.index):,} rows. cleaned data size: {len(dev_df.index):,} rows. 😮‍💨')

## 💾 Committing

We can, at this point, decide if we want to commit this change, which would allow us to return to this current state of the data, or simply revert all uncommitted changes we've made. Let's see committing in action:

In [None]:
commit_message = 'I deleted A LOT of rows!' # @param {type:"string"}
!lakectl commit --no-color lakefs://my-repo/my-dev-branch -m "{commit_message}"

# Let's also see a log of commits for this branch:
commit_log = client.refs.log_commits('my-repo', 'my-dev-branch').get('results')
pd.DataFrame([c.to_dict() for c in commit_log])[['message', 'id']]

## ↩️ Reverting commits

Sometimes we want to go back to a last known good state of the system. This is very useful when we accidently introdudce (and even commit!) a change we're unhappy with. Let's try that:

In [None]:
# Using the branch revert command to undo the commit
!lakectl branch revert -y lakefs://my-repo/my-dev-branch "my-dev-branch~0"

# Let's see that log again:
commit_log = client.refs.log_commits('my-repo', 'my-dev-branch').get('results')
pd.DataFrame([c.to_dict() for c in commit_log])[['message', 'id']]

🔎 Let's see a Diff of that Revert commit:

In [None]:
# Now, let's see what this new revert commit actually did!
# We do this by diffing the current commit with the previous one
!lakectl diff lakefs://my-repo/my-dev-branch~1 lakefs://my-repo/my-dev-branch

As you can see, that `Revert` operation creates a new commit that performs the inverse of the commit we asked to revert: added files are removed, removed files are restored. 

Let's make sure the data now looks the way we expect it to:

In [None]:
# And we're now back to square one!
main_df = pd.read_parquet('lakefs://my-repo/main/world-cities-database-population/raw/')
dev_df  = pd.read_parquet('lakefs://my-repo/my-dev-branch/world-cities-database-population/raw/')

print(f'main row count {len(main_df.index):,} == my-dev-branch row count {len(dev_df.index):,}')

## 🤗 Only one thing left to do - clean up!

So now that we've seen how to create new branches and (ab)use them, let's imagine we're unhappy with the change. Not all experiments are succesful so we don't want to apply these changes to our `main` branch.

We can, at this point, drop this branch and pretend none of these changes ever happened:


In [None]:
!lakectl branch delete --yes lakefs://my-repo/my-dev-branch 
!lakectl branch list lakefs://my-repo/

## 💪🎉 You've completed this tutorial!

We're done! we've successfully read from a lakeFS repository, created a branch, modified data in-place, examined the results and deleted the branch!

A real world example might run an entire ETL pipeline using Apache Spark jobs orchestraed by Airflow to do the same on much larger, more complex datasets - but the idea is the same: we use branches to isolate changes without affecting production. No need to copy large amounts of data, maintain these copies or worry about the complexity involved with creating staging environments for data operations.

### Next Steps

➡️ Read more about [building a Dev/Test isolated environment for data](https://docs.lakefs.io/use_cases/iso_env.html) on the lakeFS docs, including more examples, illustrations and case studies

➡️ Run lakeFS locally to get started on your own environment

➡️ Get a free trial of [lakeFS Cloud](https://lakefs.cloud/) to start with a secure, servless fully managed lakeFS environment

➡️ Try the [📖 CI/CD for data interactive notebook](https://docs.lakefs.io/use_cases/cicd_for_data.html) to learn about using lakeFS to its full extent, ensuring production changes are enforced for quality, best practices and are fully consistent.
