In this notebook we will complete a small end to end data science tutorial that employs LakeFS-spec for data versioning.
We will use weather data to train a random forest classifier and aim to predict whether it is raining a day from now given the current weather. 

Lets set the environment up. You can create it with `python -m venv .demo-venv` in the command line. Then activate it (`source .demo-venv/bin/activate`) and execute `pip install -r demo-requirements.txt` to install the dependencies.

Next, let us set-up LakeFS. You can do this by executing the `docker run` command given here [lakeFS quickstart](https://docs.lakefs.io/quickstart/launch.html) in a terminal of your choice. Open a browser and navigate to the lakeFS instance (per default`localhost:8000`). Authenticate with the credentials given in the terminal where you executed the docker container. As an email, you can enter anything, we won't need it in this example. Proceed to create an empty repository. You may call it `weather`.


We will also install the CLI of lakeFS, `lakectl`. Then lakeFS-spec can automatically handle authentication. Open a terminal of your choice and `brew install lakefs`. Then use `lakectl config`. You find the authentication information in the terminal window where you started the LakeFS Docker container. 

Note: for this to work, you need the `pyyaml` package which is not a default dependency of LakeFS-spec. It was installed via the `demo-requirements.txt`, however. Yet, in you own projects add the dependency manually, if you want to use the LakeCTL authentication.

In [1]:
REPO_NAME = 'weather'

Now it's time to get some data. We will use the [Open-Meteo api](https://open-meteo.com/), where we can pull weather data from an API for free (as long as we are non-commercial) and without an API-token.  

For trainig we get the full data of the 2010s from Munich (where I am writing this right now ;) ) 

In [2]:
!curl -o './data/weather-2010s.json' 'https://archive-api.open-meteo.com/v1/archive?latitude=52.52&longitude=13.41&start_date=2010-01-01&end_date=2019-12-31&hourly=temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6297k    0 6297k    0     0  1274k      0 --:--:--  0:00:04 --:--:-- 1275k


The data is in JSON format so we need to wrangle the data a bit to make it usable. But first we will save it into our lakeFS instance. We will create a new branch, `transform-raw-data`.

In [3]:
from lakefs_spec import LakeFSFileSystem

NEW_BRANCH_NAME = 'transform-raw-data'


fs = LakeFSFileSystem()
fs.put('./data/weather-2010s.json',  f'{REPO_NAME}/{NEW_BRANCH_NAME}/weather-2010.json')



Created new branch 'transform-raw-data' from branch 'main'.


Now, on LakeFS in your browser, can change the branch to `transform-raw-data` and see the saved file. However, the change is not yet committed. 
While you can do that manually via the uncommitted changes tab in Lakefs, we will take another route to commit. 

We can also do commit changes programmatically. To achieve this, we register a hook. The hook needs to have the signature `(client, context) -> None`. Where the `client` is the respective LakeFS client. The context object contains information about the corresponding resource. 



In [4]:
from lakefs_client.client import LakeFSClient
from lakefs_spec.client_helpers import commit
from lakefs_spec.hooks import FSEvent, HookContext

#Define the commit hook
def commit_on_put(client: LakeFSClient, ctx:HookContext) -> None:
    commit_message = f"Add file {ctx.resource}"
    print(f"Attempting Commit: {commit_message}")
    commit(client, repository=ctx.repository, branch = ctx.ref, message=commit_message)
    

#Register the commit hook to be executed after a PUT_FILE event
fs.register_hook(FSEvent.PUT_FILE, commit_on_put)


When you execute the next cell however, you will see a message indicating that the upload of the resource has been skipped because the file is uploaded to LakeFS already (checksums match). This is useful when we work with large files to reduce the amount of network traffic. Nonetheless, in this specific situation the PUT is not executed and neither is our commit hook. 


In [5]:
fs.put('./data/weather-2010s.json',  f'{REPO_NAME}/{NEW_BRANCH_NAME}/weather-2010.json')

Skipping upload of resource '/Users/maxmynter/Desktop/appliedAI/lakefs/spec/demos/data/weather-2010s.json' to remote path 'weather/transform-raw-data/weather-2010.json': Resource 'weather/transform-raw-data/weather-2010.json' exists and checksums match.


We can circumvent this by disabling checking the checksums on a specific put operation.  

In [6]:
fs.put('./data/weather-2010s.json',  f'{REPO_NAME}/{NEW_BRANCH_NAME}/weather-2010.json', precheck=False)

Attempting Commit: Add file weather-2010.json


Now let's transform the data for our use case. We put the transformation into a function such that we can reuse it later

In this notebook, we follow a simple toy example to predict whether it is raining at the same time tomorrow given weather data from right now. 

However, we will skip a lot of possible feature engineering etc. in order to focus on the application of LakeFS and LakeFS-spec. 

In [7]:
import pandas as pd
import numpy as np
import json 

def transform_json_weather_data(filepath):
    f = open(filepath)
    data = json.load(f)
    f.close()

    df = pd.DataFrame.from_dict(data["hourly"])
    df.time = pd.to_datetime(df.time)
    df['is_raining'] = df.rain > 0
    df['is_raining_in_1_day'] = df.is_raining.shift(24)
    df = df.dropna()
    return df
    
df = transform_json_weather_data('./data/weather-2010s.json')
df.head(5)


Unnamed: 0,time,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
24,2010-01-02 00:00:00,-3.0,88,0.0,1004.0,999.2,100,54,100,70,8.3,17.3,18,31,False,False
25,2010-01-02 01:00:00,-3.2,89,0.0,1004.5,999.7,100,37,98,92,9.1,18.6,9,22,False,False
26,2010-01-02 02:00:00,-3.4,89,0.0,1005.2,1000.4,100,24,98,77,10.1,20.3,2,13,False,False
27,2010-01-02 03:00:00,-3.5,89,0.0,1005.6,1000.8,93,8,98,89,11.2,21.7,4,12,False,False
28,2010-01-02 04:00:00,-3.7,90,0.0,1006.2,1001.4,94,6,100,95,11.2,21.8,358,9,False,False


Let us now save this data as a csv into the main branch. The `.to_csv` method calls a `put` operation behind the scenes, our commit hook is called and the file committed. You can verify the saving worked in your LakeFS GUI in the browser when looking at the Comits and uncommitted changes tabs of the main branch. 

In [8]:
df.to_csv(f'lakefs://{REPO_NAME}/main/weather_2010s.csv')

Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Attempting Commit: Add file weather_2010s.csv


We will now do a train test split.

In [9]:
import sklearn.model_selection

model_data=df.drop('time', axis=1)

train, test = sklearn.model_selection.train_test_split(model_data)

We save these train and test datasets into a new `training` branch. If, as in this case, the branch does not yet exist, it is implicitly created. You can steer this behaviour via the `create_branch_ok` flag when initializing the `LakeFSFileSystem`. The flag defaults to `True`.  

In [10]:
TRAINING_BRANCH = 'training'
train.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather.csv')
test.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather.csv')


Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Created new branch 'training' from branch 'main'.
Attempting Commit: Add file train_weather.csv
Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Attempting Commit: Add file test_weather.csv


The implicit branch creation is a convenient way to create new branches programmatically. However, one drawback is that typos might result in new branch creation. If you want these to throw errors instead, disable the implicit branch creation

We can now read these files directly from the remote LakeFS instance. (You can verify that neither the train nor the test file are saved in the `/data` directory). 

In [11]:
train = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather.csv', index_col=0)
test = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather.csv', index_col=0)

train.head()

Unnamed: 0,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
34392,-2.2,95,0.0,1024.2,1019.3,34,7,0,93,7.5,18.2,235,252,False,False
1100,-3.8,89,0.0,1010.2,1005.3,81,90,0,0,5.4,8.0,138,162,False,False
50750,7.9,93,0.0,1016.4,1011.7,100,100,94,0,3.3,4.5,193,194,False,False
44851,-0.4,97,0.0,1023.1,1018.2,4,4,0,0,6.4,11.1,196,216,False,False
67062,15.4,87,0.0,1018.1,1013.5,9,0,5,19,7.9,15.8,90,107,False,False


We now proceed to train a random forest classifier and evaluate it on the test set. 

In [12]:
from sklearn.ensemble import RandomForestClassifier

dependent_variable = 'is_raining_in_1_day'

model = RandomForestClassifier()
x_train, y_train = train.drop(dependent_variable, axis=1), train[dependent_variable].astype(bool)
x_test, y_test = test.drop(dependent_variable, axis=1), test[dependent_variable].astype(bool)

model.fit(x_train, y_train)

test_acc = model.score(x_test, y_test)

print(f"Test accuracy: {round(test_acc, 4) * 100 } %")

Test accuracy: 88.57000000000001 %


Until now, we only have used data from the 2010s. Lets get additional 2020s data, transform it and save it to LakeFS. 

In [13]:
!curl -o './data/weather-2020s.json' 'https://archive-api.open-meteo.com/v1/archive?latitude=52.52&longitude=13.41&start_date=2020-01-01&end_date=2023-08-31&hourly=temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m'

new_data = transform_json_weather_data('./data/weather-2020s.json')
new_data.to_csv(f'lakefs://{REPO_NAME}/main/weather_2020s.csv')

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2308k    0 2308k    0     0  1187k      0 --:--:--  0:00:01 --:--:-- 1188k
Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Attempting Commit: Add file weather_2020s.csv


Let's test how well our model performs on 2020s data.

First, we drop the `time` column such that we have the same variables as during the fit in the data. 

In [14]:
new_data = new_data.drop('time', axis=1)

In [15]:
acc_2020s = model.score(new_data.drop(dependent_variable, axis=1), new_data[dependent_variable].astype(bool))

print(f"Test accuracy: {round(acc_2020s, 4) * 100 } %")

Test accuracy: 84.66 %


We have an accuracy similar to the one we had on the 2020s data. Yet, it makes sense to utilize this data for training as well. We will create a concatenated dataframe and perform a new train test split. 

However, this gives rise to the problem, that we now have multiple models which will perfom differently on different datasets. For example, if someone takes the model we are about to train and evaluates it on the data from the 2020s the accuracy will probably be higher. But it will be higher because of data leakage. We are going to use some of the data points in the 2020s data to train. 

To circumvent this issue (or at least enable the traceability and reproducibility) we should save the `ref` of the specific datasets. 


We are going to do this now.

First we create the new train test split and save it in the training branch.


In [None]:
# Below here .. wild 

In [16]:
df_train = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather.csv', index_col=0)
df_test = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather.csv', index_col=0)


In [17]:
new_data

Unnamed: 0,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
24,-2.0,88,0.0,1030.3,1025.4,0,0,0,0,9.6,15.9,236,257,False,False
25,-2.6,89,0.0,1030.1,1025.2,0,0,0,0,8.2,15.2,218,248,False,False
26,-3.0,90,0.0,1030.2,1025.3,0,0,0,0,8.2,16.4,218,241,False,False
27,-3.6,91,0.0,1030.0,1025.1,0,0,0,0,8.0,18.8,216,234,False,False
28,-3.7,90,0.0,1028.9,1024.0,0,0,0,0,9.4,21.4,212,228,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32131,14.3,81,0.0,1010.6,1006.0,58,10,58,46,16.3,28.2,242,247,False,False
32132,14.0,81,0.0,1011.1,1006.5,40,15,30,27,15.6,28.4,244,248,False,False
32133,13.8,82,0.0,1011.5,1006.9,37,20,20,23,15.2,27.7,248,252,False,False
32134,13.4,84,0.0,1012.0,1007.4,40,8,16,76,13.8,26.9,231,240,False,False


In [18]:
df_train

Unnamed: 0,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
34392,-2.2,95,0.0,1024.2,1019.3,34,7,0,93,7.5,18.2,235,252,False,False
1100,-3.8,89,0.0,1010.2,1005.3,81,90,0,0,5.4,8.0,138,162,False,False
50750,7.9,93,0.0,1016.4,1011.7,100,100,94,0,3.3,4.5,193,194,False,False
44851,-0.4,97,0.0,1023.1,1018.2,4,4,0,0,6.4,11.1,196,216,False,False
67062,15.4,87,0.0,1018.1,1013.5,9,0,5,19,7.9,15.8,90,107,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61980,2.0,76,0.0,1031.9,1027.0,90,100,0,0,14.0,19.6,136,137,False,False
39969,19.8,80,0.3,1016.3,1011.8,42,2,24,87,13.8,18.5,28,29,True,False
63651,5.7,77,0.0,1017.7,1013.0,42,47,0,0,21.1,34.1,295,298,False,False
15110,16.6,66,0.0,1015.9,1011.4,60,22,67,0,22.0,33.1,269,270,False,False


In [19]:
df_test

Unnamed: 0,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
60309,8.9,90,0.0,1003.3,998.7,88,8,84,100,20.7,36.7,200,204,False,False
32173,14.8,83,0.5,1015.2,1010.6,100,100,98,74,26.6,42.5,257,259,True,True
53242,7.3,81,0.0,1017.7,1013.0,20,22,0,0,23.0,37.3,284,285,False,False
8207,-2.5,94,0.0,1003.7,998.9,100,91,100,100,15.6,26.8,320,327,False,False
21504,10.3,89,0.0,1020.2,1015.5,75,26,86,0,5.7,8.9,198,238,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8797,0.6,90,0.0,1016.4,1011.6,54,60,0,0,12.9,20.9,288,291,False,False
82146,10.6,96,0.0,1014.4,1009.8,100,100,71,1,6.5,12.6,183,182,False,False
19685,5.9,86,0.5,999.9,995.3,100,97,48,11,31.7,50.2,279,281,True,False
33985,3.1,97,0.0,1026.0,1021.2,92,100,0,6,11.3,21.5,239,252,False,False


In [20]:
df

Unnamed: 0,time,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
24,2010-01-02 00:00:00,-3.0,88,0.0,1004.0,999.2,100,54,100,70,8.3,17.3,18,31,False,False
25,2010-01-02 01:00:00,-3.2,89,0.0,1004.5,999.7,100,37,98,92,9.1,18.6,9,22,False,False
26,2010-01-02 02:00:00,-3.4,89,0.0,1005.2,1000.4,100,24,98,77,10.1,20.3,2,13,False,False
27,2010-01-02 03:00:00,-3.5,89,0.0,1005.6,1000.8,93,8,98,89,11.2,21.7,4,12,False,False
28,2010-01-02 04:00:00,-3.7,90,0.0,1006.2,1001.4,94,6,100,95,11.2,21.8,358,9,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87643,2019-12-31 19:00:00,4.4,89,0.0,1032.3,1027.5,16,18,0,0,16.0,29.6,293,296,False,False
87644,2019-12-31 20:00:00,3.8,91,0.0,1032.5,1027.7,14,15,0,0,14.9,27.9,290,294,False,False
87645,2019-12-31 21:00:00,3.3,92,0.0,1032.9,1028.1,10,11,0,0,13.7,26.9,293,298,False,False
87646,2019-12-31 22:00:00,2.7,94,0.0,1033.2,1028.4,8,9,0,0,10.0,22.6,283,295,False,False


In [21]:
df = pd.concat([df, df_train, df_test])


In [22]:
df

Unnamed: 0,time,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
24,2010-01-02 00:00:00,-3.0,88,0.0,1004.0,999.2,100,54,100,70,8.3,17.3,18,31,False,False
25,2010-01-02 01:00:00,-3.2,89,0.0,1004.5,999.7,100,37,98,92,9.1,18.6,9,22,False,False
26,2010-01-02 02:00:00,-3.4,89,0.0,1005.2,1000.4,100,24,98,77,10.1,20.3,2,13,False,False
27,2010-01-02 03:00:00,-3.5,89,0.0,1005.6,1000.8,93,8,98,89,11.2,21.7,4,12,False,False
28,2010-01-02 04:00:00,-3.7,90,0.0,1006.2,1001.4,94,6,100,95,11.2,21.8,358,9,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8797,NaT,0.6,90,0.0,1016.4,1011.6,54,60,0,0,12.9,20.9,288,291,False,False
82146,NaT,10.6,96,0.0,1014.4,1009.8,100,100,71,1,6.5,12.6,183,182,False,False
19685,NaT,5.9,86,0.5,999.9,995.3,100,97,48,11,31.7,50.2,279,281,True,False
33985,NaT,3.1,97,0.0,1026.0,1021.2,92,100,0,6,11.3,21.5,239,252,False,False


In [23]:
train_df, test_df = sklearn.model_selection.train_test_split(df, test_size=0.9)

In [24]:
test_df

Unnamed: 0,time,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
72958,2018-04-28 22:00:00,12.1,83,0.0,1011.7,1007.1,74,0,78,91,7.2,18.2,297,297,False,False
29901,2013-05-30 21:00:00,14.8,90,1.2,1000.5,996.0,100,50,94,100,21.9,37.2,23,27,True,False
80798,2019-03-21 14:00:00,13.6,67,0.0,1032.9,1028.2,96,100,10,0,15.2,21.3,266,266,False,False
79504,2019-01-26 16:00:00,1.2,100,0.1,1001.6,996.9,100,100,100,99,7.0,15.0,201,224,True,False
25043,NaT,9.8,79,0.0,1017.6,1012.9,74,57,38,0,14.0,19.8,258,260,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20551,2012-05-06 07:00:00,6.8,78,0.0,1010.9,1006.2,100,100,16,84,10.1,13.8,17,15,False,True
1410,2010-02-28 18:00:00,7.8,73,0.0,987.3,982.8,43,2,67,2,25.5,45.0,209,212,False,True
52764,NaT,3.4,75,0.0,998.7,994.0,79,16,94,26,26.0,43.6,239,242,False,False
70265,NaT,4.8,91,0.0,1012.0,1007.3,100,64,95,44,5.4,11.5,262,290,False,False


In [25]:
test_df.isna().sum()

time                   78886
temperature_2m             0
relativehumidity_2m        0
rain                       0
pressure_msl               0
surface_pressure           0
cloudcover                 0
cloudcover_low             0
cloudcover_mid             0
cloudcover_high            0
windspeed_10m              0
windspeed_100m             0
winddirection_10m          0
winddirection_100m         0
is_raining                 0
is_raining_in_1_day        0
dtype: int64

In [26]:
train_df.to_csv(f'lakefs://{REPO_NAME}/new/train_weather.csv')

Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Created new branch 'new' from branch 'main'.
Attempting Commit: Add file train_weather.csv


In [27]:
test_df.to_csv(f'lakefs://{REPO_NAME}/new/test_weather.csv')

Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Attempting Commit: Add file test_weather.csv


In [28]:
df_train = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather.csv', index_col=0)
df_test = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather.csv', index_col=0)

full_data = pd.concat([new_data, df_train, df_test])

train_df, test_df = sklearn.model_selection.train_test_split(full_data, test_size=0.1)


In [34]:
len(full_data)

119736

In [35]:
len(train_df), len(test_df)

(11973, 107763)

In [30]:
train_df.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train.csv')
test_df.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test.csv')

Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Attempting Commit: Add file train.csv
Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Attempting Commit: Add file test.csv


In [31]:
pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train.csv', index_col=0)

Unnamed: 0,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
7687,2.4,95,0.3,1016.3,1011.5,100,100,90,24,16.0,26.3,44,48,True,True
4385,30.2,32,0.0,1016.7,1012.4,4,0,7,0,7.6,10.0,59,60,False,False
9053,2.5,97,0.0,1008.0,1003.3,100,100,4,93,7.6,18.2,251,261,False,True
29170,12.9,54,0.0,1022.9,1018.3,69,12,51,92,5.0,6.1,270,270,False,False
43962,-0.5,83,0.0,1025.6,1020.7,71,46,0,100,18.4,30.8,132,135,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80461,14.6,54,0.0,994.7,990.2,32,2,41,17,29.4,44.4,234,235,False,False
62439,3.6,55,0.0,1035.9,1031.1,0,0,0,0,4.4,9.9,99,109,False,False
26360,5.5,93,0.1,1020.2,1015.5,100,100,89,0,24.0,40.4,244,248,True,True
26201,3.3,84,0.0,1023.4,1018.6,21,10,7,25,15.3,29.1,243,248,False,True


In [32]:
pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test.csv', index_col=0)

Unnamed: 0_level_0,6.5,87,0.0,1016.3,1011.6,100,98,6,68,8.7,13.8,38,41,False,True
2951,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
11018,-0.4,80,0.0,1024.2,1019.3,1,0,0,3,7.9,9.8,267,262,False,False
11293,14.1,40,0.0,1022.6,1018.0,59,0,98,0,6.2,8.0,291,288,False,False
32076,23.3,39,0.0,1020.9,1016.4,45,7,21,87,5.8,7.1,300,300,False,False
47356,9.1,76,0.0,1017.0,1012.3,100,27,82,91,9.7,19.5,195,210,False,False
83550,16.9,83,0.0,1015.3,1010.8,73,61,18,23,14.5,20.1,333,333,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8255,0.3,91,0.0,1012.4,1007.6,100,90,4,75,11.6,19.4,85,89,False,False
24208,17.4,67,0.0,1015.3,1010.8,12,0,0,41,16.1,29.7,243,241,False,False
35962,7.4,66,0.0,995.7,991.1,83,2,85,99,17.0,31.0,174,183,False,False
80593,6.2,50,0.2,1000.2,995.6,100,49,95,99,22.0,37.9,201,204,True,False


In [37]:
len(pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train.csv', index_col=0)), len(pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test.csv', index_col=0))

(11973, 33986)

In [47]:
test_df

Unnamed: 0,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
8659,1.4,84,0.0,1016.8,1012.0,49,4,35,81,18.6,33.0,212,216,False,False
12137,22.1,41,0.0,1020.9,1016.4,13,0,5,32,10.5,15.0,63,63,False,True
11028,8.5,59,0.0,1020.3,1015.6,74,82,0,0,15.5,20.6,258,258,False,True
47672,17.2,58,0.0,1027.8,1023.2,59,24,30,65,4.3,5.0,85,90,False,False
5609,24.5,70,0.0,1011.5,1007.1,41,3,15,97,8.5,17.2,258,259,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19347,3.2,89,0.0,1028.8,1024.0,4,0,0,13,12.7,27.1,335,327,False,False
10402,8.2,69,0.0,1012.7,1008.0,29,0,1,93,18.3,26.0,212,214,False,False
14720,15.5,81,0.0,1023.4,1018.8,37,37,3,6,11.3,15.6,107,109,False,False
5091,13.3,81,0.0,1019.5,1014.9,18,0,1,57,5.4,13.6,290,302,False,False


In [40]:
test_df.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather.csv')

Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Attempting Commit: Add file test_weather.csv
No changes to commit on branch 'training', aborting commit.


In [None]:
######

In [None]:
df_train = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather.csv', index_col=0)
df_test = pd.read_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather.csv', index_col=0)

full_data = pd.concat([new_data, df_train, df_test])

train_df, test_df = sklearn.model_selection.train_test_split(full_data, test_size=0.9)

train_df.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/train_weather.csv')
test_df.to_csv(f'lakefs://{REPO_NAME}/{TRAINING_BRANCH}/test_weather.csv')

Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Attempting Commit: Add file train_weather.csv
Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Attempting Commit: Add file test_weather.csv


In [None]:
df_train

Unnamed: 0,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,...,44,1,71,0,1.1,1.5,18,14,False,False.1
4908,19.0,76.0,0.3,1012.6,1008.1,91.0,1.0,100.0,100.0,20.8,...,,,,,,,,,,
1240,7.0,59.0,0.0,1023.5,1018.8,86.0,71.0,0.0,72.0,18.4,...,,,,,,,,,,
4713,17.6,77.0,0.3,1012.1,1007.6,100.0,27.0,98.0,91.0,11.2,...,,,,,,,,,,
21247,9.9,86.0,0.0,1010.5,1005.9,100.0,36.0,91.0,100.0,9.4,...,,,,,,,,,,
54155,2.5,98.0,0.6,995.9,991.2,100.0,100.0,100.0,97.0,18.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1664,,,,,,,,,,,...,70.0,78.0,0.0,0.0,11.3,15.3,27.0,27.0,False,False
10836,8.1,53.0,0.0,1012.4,1007.7,23.0,24.0,3.0,0.0,16.5,...,,,,,,,,,,
28161,10.1,84.0,0.9,1015.3,1010.7,54.0,7.0,29.0,100.0,7.5,...,,,,,,,,,,
70608,-3.0,95.0,0.0,1003.9,999.1,81.0,0.0,87.0,95.0,6.6,...,,,,,,,,,,


In [None]:
df_train.isna().sum()

temperature_2m          3327
relativehumidity_2m     3327
rain                    3327
pressure_msl            3327
surface_pressure        3327
cloudcover              3327
cloudcover_low          3327
cloudcover_mid          3327
cloudcover_high         3327
windspeed_10m           3327
windspeed_100m          3327
winddirection_10m       3327
winddirection_100m      3327
is_raining              3327
is_raining_in_1_day     3327
23.8                   13241
39                     13241
0.0                    13241
1025.4                 13241
1020.9                 13241
44                     13241
1                      13241
71                     13241
0                      13241
1.1                    13241
1.5                    13241
18                     13241
14                     13241
False                  13241
False.1                13241
dtype: int64

In [None]:
train_df

Unnamed: 0,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,...,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30
57644,,,,,,,,,,,...,,,,,,,,,,
29373,15.5,94.0,0.0,1012.6,1008.1,35.0,0.0,22.0,71.0,7.1,...,,,,,,,,,,
74334,14.3,84.0,0.0,1019.9,1015.3,86.0,96.0,0.0,0.0,11.1,...,,,,,,,,,,
46257,10.2,55.0,0.0,1030.0,1025.3,78.0,0.0,80.0,100.0,17.2,...,,,,,,,,,,
202,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10354,3.3,64.0,0.0,1022.6,1017.8,90.0,100.0,0.0,0.0,23.9,...,,,,,,,,,,
77091,,,,,,,,,,,...,,,,,,,,,,
79573,3.0,72.0,0.0,1004.2,999.5,64.0,66.0,7.0,0.0,15.8,...,,,,,,,,,,
40397,18.5,83.0,0.0,1008.3,1003.8,24.0,2.0,11.0,51.0,15.5,...,,,,,,,,,,


In [None]:
test_df.to_csv(f'lakefs://{REPO_NAME}/weird2/test.csv')
fs.get(f'{REPO_NAME}/weird2/test.csv', './data/tts_weird.csv')

Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Attempting Commit: Add file test.csv
No changes to commit on branch 'weird2', aborting commit.
Skipping download of resource 'weather/weird2/test.csv' to local path '/Users/maxmynter/Desktop/appliedAI/lakefs/spec/demos/data/tts_weird.csv': Resource '/Users/maxmynter/Desktop/appliedAI/lakefs/spec/demos/data/tts_weird.csv' exists and checksums match.


In [None]:
pd.read_csv(f'lakefs://{REPO_NAME}/weird2/test.csv', index_col=0)

Unnamed: 0_level_0,-10.9,76.0,0.0,1028.6,1023.5,0.0.1,0.0.2,0.0.3,1.0,12.6,...,Unnamed: 48,Unnamed: 49,Unnamed: 50,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57
28271,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
73791,,,,,,,,,,,...,,,,,,,,,,
2197,8.1,48.0,0.0,1018.6,1013.9,25.0,26.0,2.0,0.0,16.6,...,,,,,,,,,,
42146,11.2,93.0,0.0,1015.1,1010.5,100.0,95.0,31.0,0.0,11.2,...,,,,,,,,,,
47494,,,,,,,,,,,...,,,,,,,,,,
68528,11.6,88.0,0.0,1022.5,1017.9,100.0,33.0,95.0,93.0,15.3,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14152,26.5,35.0,0.0,1020.0,1015.6,7.0,0.0,12.0,0.0,6.6,...,,,,,,,,,,
63178,9.5,77.0,0.0,1010.4,1005.8,100.0,95.0,85.0,84.0,15.5,...,,,,,,,,,,
28096,,,,,,,,,,,...,,,,,,,,,,
47082,12.6,67.0,0.0,1018.5,1013.9,100.0,100.0,88.0,0.0,18.0,...,,,,,,,,,,


In [None]:
#del
train_df.to_csv(f'./data/weird/tts_train.csv')
train_df.to_csv(f'lakefs://{REPO_NAME}/weird/tts_train.csv')
tts_weird_from_lakefs = pd.read_csv(f'lakefs://{REPO_NAME}/weird/tts_train.csv', index_col=0)
tts_weird_from_local =  pd.read_csv(f'./data/weird/tts_train.csv', index_col=0)

Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.
Attempting Commit: Add file tts_train.csv


In [None]:
tts_weird_from_lakefs

Unnamed: 0_level_0,23.8,39,0.0,1025.4,1020.9,44,1,71,0,1.1,1.5,18,14,False,False.1
4787,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
49125,20.1,52,0.0,1019.5,1015.0,0,0,0,0,13.6,23.7,37,42,False,False
72473,16.1,62,0.0,1009.5,1005.0,0,0,0,0,6.4,13.6,128,140,False,False
12801,21.0,64,0.0,1003.0,998.6,24,18,13,0,18.9,25.9,231,231,False,False
17223,2.6,81,0.0,1015.0,1010.2,75,83,0,0,13.2,23.4,235,239,False,True
35026,3.2,86,0.0,1027.3,1022.5,0,0,0,0,11.1,19.9,205,212,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42855,6.6,73,0.0,1025.8,1021.1,47,52,1,0,2.4,1.8,153,169,False,False
26258,4.9,72,0.0,1011.8,1007.1,46,0,31,91,17.7,34.5,193,200,False,False
83658,23.6,43,0.0,1011.0,1006.6,42,0,63,14,5.4,12.6,98,103,False,False
16541,11.1,90,0.0,1022.8,1018.1,100,100,39,0,18.4,30.9,274,277,False,False


In [None]:
tts_weird_from_local

Unnamed: 0,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
14508,15.7,81,0.4,1010.6,1006.1,91,63,45,24,10.7,14.8,40,43,True,True
81470,19.6,31,0.0,1029.5,1024.9,0,0,0,0,15.5,20.9,89,89,False,False
11749,13.7,58,0.0,999.3,994.8,86,9,88,85,28.7,44.2,219,220,False,False
15894,6.2,91,0.0,1016.6,1011.9,91,100,1,0,20.0,33.8,104,107,False,False
37633,7.7,59,0.0,1008.2,1003.6,93,3,100,100,13.8,26.2,237,243,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26056,-2.8,73,0.0,1018.1,1013.2,90,100,0,1,13.9,20.2,100,101,False,False
35031,3.1,84,0.0,1025.9,1021.1,3,3,0,0,11.3,23.4,173,181,False,True
26416,7.2,75,0.0,1017.0,1012.3,60,66,1,0,20.9,34.2,299,300,False,True
62027,-1.2,70,0.0,1021.1,1016.2,5,6,0,0,12.5,17.4,139,140,False,False


In [None]:
#del
tts_train = pd.read_csv(f'lakefs://{REPO_NAME}/weird/tts_train.csv')
tts_train

Unnamed: 0,68368,14.9,85,0.0,1014.4,1009.8,0,0.1,0.2,0.3,7.6,17.7,115,123,False,False.1
0,68127,12.2,66,0.0,1015.2,1010.6,54,20,38,43,16.9,27.5,281,284,False,False
1,28356,1.0,56,0.0,1016.8,1012.0,34,0,9,95,18.6,26.6,80,81,False,False
2,2839,14.4,67,0.0,1017.1,1012.5,36,0,11,99,10.0,15.6,201,206,False,False
3,79586,-3.9,93,0.0,1002.9,998.1,32,3,0,97,8.4,20.0,130,141,False,False
4,6992,8.5,84,0.0,1016.5,1011.8,100,100,96,72,20.0,33.4,252,256,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33981,23028,30.2,35,0.0,1011.1,1006.8,34,0,51,12,5.1,6.1,8,3,False,False
33982,41851,15.2,84,0.0,1016.7,1012.1,88,0,99,94,5.4,6.4,184,218,False,False
33983,51784,4.1,92,0.0,1009.6,1004.9,67,41,13,73,18.0,32.4,267,271,False,False
33984,46006,2.3,80,0.0,1010.1,1005.4,52,43,22,1,21.0,35.3,263,266,False,False


This now concatenated the old data, created a new train test split and overwrote the files. This presents problems with respect to strict versioning. When we get the data using only the branch name and filename we get the latest commit. 

Lets use explicit versioning and get the explicit commit. Therefore go into the LakeFS GUI, select the training branch and choose the "Commits" tab. 

You should see the latest two Commits "Add file test_weather.csv" and "Add file train_weather.csv".

Copy the ID to your clipboard and paste it below.

In [None]:
for i in 

SyntaxError: invalid syntax (3996493703.py, line 1)

In [None]:
test_commit_id  = '984e23fd19349d2722a27445c00e45d11a137c7ab213bf2655967897c5144b60'
train_commit_id = '14187cd6c0efd48349ead20d901ba2be753d8093a3ef828f0a57684e3b67b3bd'

In [None]:
train = pd.read_csv(f"lakefs://{REPO_NAME}/{train_commit_id}/train_weather.csv", index_col=0)
test = pd.read_csv(f"lakefs://{REPO_NAME}/{test_commit_id}/test_weather.csv", index_col=0)

In [None]:
#del
test

Unnamed: 0,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,is_raining,is_raining_in_1_day
18947,9.4,92,0.0,1022.0,1017.3,100,100,41,4,12.8,18.7,302,304,False,True
6502,10.8,84,0.0,1009.1,1004.5,59,0,77,42,9.2,20.5,201,211,False,False
63643,8.1,65,0.0,1018.2,1013.5,15,11,8,0,17.7,31.4,294,297,False,False
85317,13.0,96,0.0,1007.4,1002.8,43,10,15,84,8.3,18.0,252,258,False,False
38104,16.7,52,0.0,1008.4,1003.9,17,11,12,1,14.0,20.2,271,272,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13634,12.4,86,0.5,1002.9,998.4,100,100,100,99,24.9,40.6,260,263,True,True
34803,1.6,90,0.0,1028.3,1023.5,12,0,0,39,15.6,30.8,218,226,False,False
23551,15.9,84,0.9,1009.8,1005.3,51,3,31,100,10.5,19.1,354,344,True,False
74856,18.6,81,0.0,1012.1,1007.6,45,1,31,86,9.1,19.6,18,26,False,False


In [None]:
train

Unnamed: 0_level_0,14.9,85,0.0,1014.4,1009.8,0,0.1,0.2,0.3,7.6,17.7,115,123,False,False.1
68368,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
68127,12.2,66,0.0,1015.2,1010.6,54,20,38,43,16.9,27.5,281,284,False,False
28356,1.0,56,0.0,1016.8,1012.0,34,0,9,95,18.6,26.6,80,81,False,False
2839,14.4,67,0.0,1017.1,1012.5,36,0,11,99,10.0,15.6,201,206,False,False
79586,-3.9,93,0.0,1002.9,998.1,32,3,0,97,8.4,20.0,130,141,False,False
6992,8.5,84,0.0,1016.5,1011.8,100,100,96,72,20.0,33.4,252,256,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23028,30.2,35,0.0,1011.1,1006.8,34,0,51,12,5.1,6.1,8,3,False,False
41851,15.2,84,0.0,1016.7,1012.1,88,0,99,94,5.4,6.4,184,218,False,False
51784,4.1,92,0.0,1009.6,1004.9,67,41,13,73,18.0,32.4,267,271,False,False
46006,2.3,80,0.0,1010.1,1005.4,52,43,22,1,21.0,35.3,263,266,False,False


In [None]:
model.fit(train, train[dependent_variable].astype(bool))

test_acc = model.score(test.drop(dependent_variable, axis=1), test[dependent_variable].astype(bool))

print(f"Test accuracy: {round(test_acc, 4) * 100 } %")

KeyError: 'is_raining_in_1_day'