# Introduction

In this notebook, we will complete a small end-to-end data science tutorial that employs `lakeFS-spec` for data versioning. We will use weather data to train a random forest classifier to predict whether a given day from now is a raining day given the current weather.

We will do the following:
* Environment setup
* lakeFS setup
* Data ingestion
    * Transactions
    * PUT a file
* Model training
* Updating data and retraining a model
* Accessing data versions and Reproducing Experiments
* Using a tag instead of a commit SHA for semantic versioning

To execute the code in this tutorial as a Jupyter notebook, download this `.ipynb` file to a convenient location on your machine. You can also clone the whole `lakefs-spec` repository. During the execution of this tutorial, in the same directory, a folder 'data' will be created. We will also download a file `.lakectl.yaml` to the current user's home directory.

Prerequisites before we start:
* Python 3.9 or higher
* Docker Desktop installed - [see guidance](https://www.docker.com/get-started/)
* git installed

# Environment Setup
To set up the environment, run the following commands in your console:

Create a virtual environment:

`python -m venv .venv`

Activate the environment.

On macOS and Linux:

`source .venv/bin/activate`

On Windows:

`.venv\Scripts\activate`

Install the libraries necessary for this notebook on the environment you have just created: 

`pip install notebook lakefs-spec numpy pandas scikit-learn`.

From a terminal, start a Jupyter notebook server if one is not already running, by executing `jupyter notebook`. Double click the notebook to autostart a kernel using the created virtual environment.

# lakeFS Setup

Ensure you have Docker Desktop or a similar runtime available and running.

Set up LakeFS. You can do this by executing the following `docker run` command (the lakeFS quickstart) in your console:

`docker run --name lakefs --pull always --rm --publish 8000:8000 treeverse/lakefs:latest run --quickstart`

Open a browser and navigate to the lakeFS instance - by default: http://localhost:8000/. 

Authenticate with the following credentials:

    Access Key ID    : AKIAIOSFOLQUICKSTART
    Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

As an email, you can enter anything, we won't need it in this example.

Before we instantiate the filesystem connector `LakeFSFilesystem`, we need to configure authentication.

There are multiple ways to authenticate to lakeFS from Python code - in this tutorial, we choose the convenient YAML file configuration. Execute the code below to download the YAML file including the lakeFS quickstart credentials and server URL to your user directory.

In [None]:
import urllib.request
import os

destination = os.path.expanduser("~/.lakectl.yaml")
urllib.request.urlretrieve(
    "https://raw.githubusercontent.com/appliedAI-Initiative/lakefs-spec/main/docs/tutorials/.lakectl.yaml", destination)

Now we can instantiate the `LakeFSFileSystem` and it will use the credentials we just downloaded for authentication. Alternatively, we could have passed the credentials in the code. It is important, that the credentials are available at the time of filesystem instantiation. 

In [None]:
from lakefs_spec import LakeFSFileSystem

fs = LakeFSFileSystem()

REPO_NAME = 'weather'

We will create a repository using a helper function provided by `lakefs-spec`. If you created one in the UI, make sure to set the `REPO_NAME` variable in the cell above accordingly. You can re-execute if necessary.
Otherwise, execute the next cell.

In [None]:
from lakefs_spec.client_helpers import create_repository

repo = create_repository(client=fs.client, name=REPO_NAME, storage_namespace=f"local://{REPO_NAME}")

# Data Ingestion

Now it's time to get some data. We will use the [Open Meteo api](https://open-meteo.com/), where we can pull weather data from an API for free (as long as we are non-commercial) and without an API-token.

First, create the folder 'data' inside a directory when your notebook is located:

In [None]:
os.makedirs("data", exist_ok=True)

Then, for the purpose of training, get the full data of the 2010s from Munich:

In [None]:
destination = "data/weather-2010s.json"
urllib.request.urlretrieve("https://archive-api.open-meteo.com/v1/archive?latitude=52.52&longitude=13.41&start_date=2010-01-01&end_date=2019-12-31&hourly=temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m", destination)

# PUT a file

The data is in JSON format. Therefore, we need to wrangle the data a bit to make it usable. But first we will save it into our lakeFS instance.

lakeFS works similar to `git` as a versioning system. You can create *commits* that contain specific changes to the data. You can also work with *branches* to fork your own copy of the data such that you don't interfere with your colleagues. Every commit (on any branch) is identified by a commit SHA. This SHA can be used to programmatically interact with specific states of your data and enables logging of the specific data versions used to create a certain model. We will cover all of this in this demo.

For now, we will `put` the file we have on a new branch, `transform-raw-data`, specifically created for our data.

In [None]:
from lakefs_spec import LakeFSFileSystem
LakeFSFileSystem.clear_instance_cache()

NEW_BRANCH_NAME = 'transform-raw-data'

fs = LakeFSFileSystem()
fs.put('./data/weather-2010s.json', f'{REPO_NAME}/{NEW_BRANCH_NAME}/weather-2010.json')

Going to the LakeFS UI in your browser, you can change the branch view to `transform-raw-data` and see the saved file. However, the change is not yet committed. While you can do that manually via the uncommitted changes tab in the UI, we will commit the file in a different way.

## Transactions

To easily carry out versioning operations while uploading files, you can use a *transaction*. A transaction is a context manager that keeps track of all files that were uploaded in its scope, as well as all versioning operations happening in between file uploads. All operations are deferred to the end of the transaction, and are executed sequentially on completion.

To create a commit after a file upload, you can run the following transaction:

In [None]:
with fs.transaction as tx:
    fs.put('data/weather-2010s.json', f'{REPO_NAME}/{NEW_BRANCH_NAME}/weather-2010.json', precheck=True, autocommit=False)
    tx.commit(repository=REPO_NAME, branch=NEW_BRANCH_NAME, message="Add weather data")

This transaction, if successful, will create a commit. Since we already uploaded the file, lakeFS will skip the upload as the checksums of the local and remote file match.

If we want to execute the upload even for an unchanged file, we can do so by passing `precheck=False` to the `fs.put()` operation.

# Data Transformation
Now let's transform the data for our use case. We put the transformation into a function to be able to reuse it later.

In this notebook, we follow a simple toy example to predict whether it is raining at the same time tomorrow given weather data from right now.

We will skip a lot of possible feature engineering etc. in order to focus on the application of lakeFS and the `LakeFSFileSystem`.

In [None]:
import json

import pandas as pd


def transform_json_weather_data(filepath):
    with open(filepath,"r") as f:
        data = json.load(f)

    df = pd.DataFrame.from_dict(data["hourly"])
    df.time = pd.to_datetime(df.time)
    df['is_raining'] = df.rain > 0
    df['is_raining_in_1_day'] = df.is_raining.shift(24)
    df = df.dropna()
    return df
    
df = transform_json_weather_data('data/weather-2010s.json')
df.head(5)

Next, we save this data as a CSV file into the main branch. When the transaction commit helper is called, the newly put CSV file is committed. You can verify the saving worked in the LakeFS UI in your browser by switching to the commits tab of the `main` branch.

In [None]:
with fs.transaction as tx:
    df.to_csv(f'lakefs://{REPO_NAME}/main/weather_2010s.csv')
    tx.commit(repository=REPO_NAME, branch="main", message="Update weather data")

# Model Training
First we will do a train-test split:

In [None]:
import sklearn.model_selection

model_data = df.drop('time', axis=1)

train, test = sklearn.model_selection.train_test_split(model_data, random_state=7)

We save these train and test datasets into a new `training` branch. If the branch does not exist yet, as in this case, it is implicitly created by default. You can control this behaviour with the `create_branch_ok` flag when initializing the `LakeFSFileSystem`. By default, `create_branch_ok` is set to `True`, so we need to only set `fs = LakeFSFileSystem()` to enable implicit branch creation.

In [None]:
TRAINING_BRANCH = 'training'

with fs.transaction as tx:
    train.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather.csv')
    test.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather.csv')
    tx.commit(repository=REPO_NAME, branch=TRAINING_BRANCH, message="Add train-test split of 2010s weather data")

Implicit branch creation is a convenient way to create new branches programmatically. However, one drawback is that typos in your code might result in new accidental branch creations. If you want to avoid this implicit behavior and raise errors instead, you can disable implicit branch creation by setting `fs.create_branch_ok=False`.

We can now read train and test files directly from the remote LakeFS instance. (You can verify that neither the train nor the test file are saved in the `/data` directory).

In [None]:
train = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather.csv', index_col=0)
test = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather.csv', index_col=0)

train.head()

Let's check the shape of train and test data. Later on we will train to get back to this data version and reproduce the results of the experiment.

In [None]:
print(f'Initial train data shape: {train.shape}')
print(f'Initial test data shape: {test.shape}')

We now proceed to train a random forest classifier and evaluate it on the test set:

In [None]:
from sklearn.ensemble import RandomForestClassifier

dependent_variable = 'is_raining_in_1_day'

model = RandomForestClassifier(random_state=7)
x_train, y_train = train.drop(dependent_variable, axis=1), train[dependent_variable].astype(bool)
x_test, y_test = test.drop(dependent_variable, axis=1), test[dependent_variable].astype(bool)

model.fit(x_train, y_train)

test_acc = model.score(x_test, y_test)

print(f"Test accuracy: {test_acc:.2%}")

# Updating Data and Model
Until now, we only have used data from the 2010s. Let's download additional 2020s data, transform it, and save it to LakeFS.

In [None]:
destination = "data/weather-2020s.json"
urllib.request.urlretrieve("https://archive-api.open-meteo.com/v1/archive?latitude=52.52&longitude=13.41&start_date=2020-01-01&end_date=2023-08-31&hourly=temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m", destination)

new_data = transform_json_weather_data('data/weather-2020s.json')

with fs.transaction as tx:
    new_data.to_csv(f'lakefs://{REPO_NAME}/main/weather_2020s.csv')
    tx.commit(repository=REPO_NAME, branch="main", message="Add 2020s weather data")

In [None]:
new_data = new_data.drop('time', axis=1)

Let's concatenate the old data and the new data, create a new train-test split, and overwrite the files on lakeFS:

In [None]:
df_train = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather.csv', index_col=0)
df_test = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather.csv', index_col=0)

full_data = pd.concat([new_data, df_train, df_test])

train_df, test_df = sklearn.model_selection.train_test_split(full_data, random_state=7)

with fs.transaction as tx:
    train_df.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather.csv')
    test_df.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather.csv')
    tx.commit(repository=REPO_NAME, branch=TRAINING_BRANCH, message="Add train-test split of full 2010-2020s data")

We may now read the updated data directly from lakeFS and check their shape to insure that initial files `train_weather.csv` and `test_weather.csv` have been overwritten successfully (number of rows should be significantly higher as 2020 data were added):

In [None]:

train = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather.csv', index_col=0)
test = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather.csv', index_col=0)

print(f'Updated train data shape: {train.shape}')
print(f'Updated test data shape: {test.shape}')

Now we may train the model based on the new train data and validate based on the new test data:

In [None]:
x_train, y_train = train.drop(dependent_variable, axis=1), train[dependent_variable].astype(bool)
x_test, y_test = test.drop(dependent_variable, axis=1), test[dependent_variable].astype(bool)

model.fit(x_train, y_train)

test_acc = model.score(x_test, y_test)

print(f"Test accuracy: {test_acc:.2%}")

# Accessing Data Version and Reproducing Experiment

Let's assume we need to go to our initial data and reproduce the first experiment (the model trained on the 2010s data with its initial accuracy). This might be tricky as we have overwritten initial train and test data on lakeFS.

To enable data versioning we should save the `ref` of the specific datasets. `ref` can be a branch we are pulling a file from LakeFS. `ref` can be also a commit id - then you can access different data versions within the same branch and not only the version from the latest commit. Therefore, we will use explicit versioning and get the actual commit SHA. We have multiple ways to do this. Manually, we could go into the lakeFS UI, select the training branch, and navigate to the "Commits" tab. There, we could see the second-latest commit, titled `Add train-test split of 2010s weather data`, and copy its ID.

However, we want to automate as much as possible and therefore use a helper function. You find pre-written helper functions in the `lakefs_spec.client_helpers` module:

In [None]:
from lakefs_spec.client_helpers import rev_parse

# parent is a relative number of a commit when 0 is the latest
previous_commit = rev_parse(fs.client, REPO_NAME, TRAINING_BRANCH, parent=1)
fixed_commit_id = previous_commit.id
print(fixed_commit_id)

With our transaction setup, both `DataFrame.to_csv()` operations are kept in a single commit. To get other commits with the `rev_parse` function, you can change the `repository` and `branch` parameters. To go back in the chosen branch's commit history, you can increase the `parent` parameter. In our case the initial data was commited two commits ago - we count the latest commit on a branch as 0, thus `parent = 1`.

Let's check whether we manage to get the initial train and test data with this commit SHA, comparing the shape to the initial data shape:

In [None]:
train = pd.read_csv(f"lakefs://{REPO_NAME}/{fixed_commit_id}/train_weather.csv", index_col=0)
test = pd.read_csv(f"lakefs://{REPO_NAME}/{fixed_commit_id}/test_weather.csv", index_col=0)

print(f'train data shape: {train.shape}')
print(f'test data shape: {test.shape}')

Let's train and validate the model based on re-fetched data and see whether we manage to reproduce the initial accuracy ratio:  

In [None]:
x_train, y_train = train.drop(dependent_variable, axis=1), train[dependent_variable].astype(bool)
x_test, y_test = test.drop(dependent_variable, axis=1), test[dependent_variable].astype(bool)

model.fit(x_train, y_train)

test_acc = model.score(x_test, y_test)

print(f"Test accuracy: {test_acc:.2%}")

# Using a tag instead of a commit SHA for semantic versioning
The above method for data versioning works great when you have experiment tracking tools to store and retrieve the commit SHA in automated pipelines. But it is hard to remember and tedious to retrieve in manual prototyping. We can make selected versions of the dataset more accessible with semantic versioning. We attach a human-interpretable tag that points to a specific commit SHA.

Creating a tag is easiest when done inside a transaction, just like the files we already uploaded. To do this, simply call `tx.tag` on the transaction and supply the repository name, the commit SHA to tag, and the intended tag name.

In [None]:
with fs.transaction as tx:
    # the `tag` result is simply the tag name, in this case 'train-test-split-2010'.
    tag = tx.tag(repository=REPO_NAME, ref=fixed_commit_id, tag='train-test-split-2010')

Now we can access the specific files with the semantic tag. Both the `fixed_commit_id` and `tag` reference the same version `ref` in lakeFS, whereas a branch name always points to the latest version on that respective branch.

In [None]:
train_from_branch_head = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather.csv', index_col=0)
train_from_commit_sha = pd.read_csv(f'lakefs://{REPO_NAME}/{fixed_commit_id}/train_weather.csv', index_col=0)
train_from_semantic_tag = pd.read_csv(f'lakefs://{REPO_NAME}/{tag}/train_weather.csv', index_col=0)

We can verify this by looking at the lengths of the `DataFrame`s. We see that the `train_from_commit_sha` and `train_from_semantic_tag` are equal. 

In [None]:
print(len(train_from_branch_head))
print(len(train_from_commit_sha))
print(len(train_from_semantic_tag))