In this notebook we will complete a small end to end data science tutorial that employs LakeFS-spec for data versioning.

The necessary dependencies are in the `demo-requirements.txt`. 

Let us first get this environment up and running. You can create it with `python -m venv .demo-venv` in the command line. Then activate it and execute `pip install -r demo-requirements.txt` to install the dependencies.

Next, let us set-up LakeFS. You can do this by executing the `docker run` command here [lakeFS quickstart](https://docs.lakefs.io/quickstart/launch.html) in a terminal of your choice. Then create an empty repository. You may call it `weather` or some. Leave the default settings as is, and create the repostitory. 


We will also install the CLI of lakeFS, `lakectl`. Then lakeFS-spec can automatically handle authentication. Open a terminal of your choice and `brew install lakefs`. Then use `lakectl config`. You find the authentication information in the terminal window where you started the LakeFS Docker container. Note: for this to work, you need the `pyyaml` package which is not a default dependency of LakeFS-spec. It was installed via the `demo-requirements.txt`, however. Yet, in you own projects add the dependency manually, if you want to use the LakeCTL authentication.

In [2]:
REPO_NAME = 'weather'

Now it's time to get some data. We will use the [Open-Meteo api](https://open-meteo.com/), where we can pull weather data from an API for free (as long as we are non-commercial) and without an API-token.  

For trainig we get the full data of the 2010s from Munich (where I am writing this right now ;) ) 

In [2]:
!curl -o './data/weather-2010s.json' 'https://archive-api.open-meteo.com/v1/archive?latitude=52.52&longitude=13.41&start_date=2010-01-01&end_date=2019-12-31&hourly=temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6297k    0 6297k    0     0  1683k      0 --:--:--  0:00:03 --:--:-- 1685k0  2079k      0 --:--:--  0:00:02 --:--:-- 2082k1651k      0 --:--:--  0:00:03 --:--:-- 1652k


The data is in JSON format so we need to wrangle the data a bit to make it usable. But first we will save it into our lakeFS instance. We will create a new branch, `transform-raw-data`.

In [16]:
from lakefs_spec import LakeFSFileSystem

NEW_BRANCH_NAME = 'transform-raw-data'


fs = LakeFSFileSystem(host="localhost:8000")
fs.put('./data/weather-2010s.json',  f'{REPO_NAME}/{NEW_BRANCH_NAME}/weather-2010.json')



✅ Add file weather-2010s_copy.json


Now, on LakeFS in your browser, can change the branch to `transform-raw-data` and see the saved file. However, the change is not yet committed. To change that, we update the `postcommit` parameter of out filesystem to `postcommit = true`. We also add a postcommitHook that executes upon the commit and allows us to, for example, set the commit message.

In [9]:
from lakefs_client.models import CommitCreation
from lakefs_spec.commithook import FSEvent, HookContext

def create_commit_message(event: FSEvent, ctx: HookContext) -> CommitCreation:
    if event == FSEvent.RM:
        message = f"❌ Remove file {ctx.resource}"
    else:
        message = f"✅ Add file {ctx.resource}"
    print(message)

    return CommitCreation(message=message)

fs.postcommit = True
fs.commithook = create_commit_message

Now let's transform the data for our use case. 

In [5]:
import pandas as pd
import numpy as np
import json 


f = open('./data/weather-2010s.json')
data = json.load(f)
f.close()

df = pd.DataFrame.from_dict(data["hourly"])
df.time = pd.to_datetime(df.time)
df['is_raining'] = df.rain > 0
df['is_raining_in_1_day'] = df.is_raining.shift(24)
df = df.dropna()
df.head(26)


Unnamed: 0,time,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
24,2010-01-02 00:00:00,-3.0,88,0.0,1004.0,999.2,100,54,100,70,8.3,17.3,18,31,False,False
25,2010-01-02 01:00:00,-3.2,89,0.0,1004.5,999.7,100,37,98,92,9.1,18.6,9,22,False,False
26,2010-01-02 02:00:00,-3.4,89,0.0,1005.2,1000.4,100,24,98,77,10.1,20.3,2,13,False,False
27,2010-01-02 03:00:00,-3.5,89,0.0,1005.6,1000.8,93,8,98,89,11.2,21.7,4,12,False,False
28,2010-01-02 04:00:00,-3.7,90,0.0,1006.2,1001.4,94,6,100,95,11.2,21.8,358,9,False,False
29,2010-01-02 05:00:00,-3.8,90,0.0,1007.2,1002.4,99,12,100,93,10.9,21.6,351,2,False,False
30,2010-01-02 06:00:00,-3.8,91,0.0,1008.0,1003.2,97,13,100,85,10.0,20.6,345,356,False,False
31,2010-01-02 07:00:00,-3.6,93,0.0,1008.9,1004.1,72,1,100,36,10.0,20.4,339,351,False,False
32,2010-01-02 08:00:00,-3.6,93,0.0,1009.8,1005.0,62,2,100,0,10.5,21.0,333,344,False,False
33,2010-01-02 09:00:00,-3.2,92,0.0,1010.6,1005.8,65,5,100,0,11.0,20.9,328,339,False,False


Let us now save this data as a csv into the main branch. You can verify the saving worked in your LakeFS browser. 

In [15]:
df.to_csv(f'lakefs://{REPO_NAME}/main/weather_2010s.csv')

Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.


In [10]:
import sklearn.model_selection

model_data=df.drop('time', axis=1)

train, test = sklearn.model_selection.train_test_split(model_data)

We save these train and test datasets into a new `training` branch

In [31]:
TRAINING_BRANCH = 'training'
train.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather_2010s.csv')
test.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather_2010s.csv')


Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Created new branch 'training' from branch 'main'.
Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.


In [None]:
#TODO: to_csv method does not work with postcommit. What should i do here
# For now the step is a manual commit in LakeFS UI or programatically using the lakefs client, which is what we want to avoid with LakeFS Spec

In [32]:
import lakefs_client


configuration = lakefs_client.Configuration()
configuration.username = 'AKIAIOSFOLQUICKSTART'
configuration.password = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
configuration.host = 'http://localhost:8000'

client = lakefs_client.client.LakeFSClient(configuration)

client.commits_api.commit(
    repository=REPO_NAME,
    branch=TRAINING_BRANCH,
    commit_creation=lakefs_client.models.CommitCreation(message='Train and Test Split'))


{'committer': 'quickstart',
 'creation_date': 1694525438,
 'id': 'd2d65f05dbacae5f9f7350039802202d655a080abdc42f9546296769b9753dfe',
 'message': 'Train and Test Split',
 'meta_range_id': '',
 'metadata': {},
 'parents': ['474cf6fb6bb7bec92b56b7199072991bfe6508e3a357fa238708f2054e6c31fd']}

In [None]:
from sklearn.ensemble import RandomForestClassifier

dependent_variable = 'is_raining_in_1_day'

model = RandomForestClassifier()
x_train, y_train = train.drop(dependent_variable, axis=1), train[dependent_variable].astype(bool)
x_test, y_test = test.drop(dependent_variable, axis=1), test[dependent_variable].astype(bool)

model.fit(x_train, y_train)



Test accuracy: 89.0 %


In [None]:
test_acc = model.score(x_test, y_test)

print(f"Test accuracy: {round(test_acc,4) * 100 } %")

Test accuracy: 88.52 %
