In this notebook we will complete a small end to end data science tutorial that employs LakeFS-spec for data versioning.

The necessary dependencies are in the `demo-requirements.txt`. 

Let us first get this environment up and running. You can create it with `python -m venv .demo-venv` in the command line. Then activate it and execute `pip install -r demo-requirements.txt` to install the dependencies.

Next, let us set-up LakeFS. You can do this by executing the `docker run` command here [lakeFS quickstart](https://docs.lakefs.io/quickstart/launch.html) in a terminal of your choice. Then create an empty repository. You may call it `weather` or some. Leave the default settings as is, and create the repostitory. 


We will also install the CLI of lakeFS, `lakectl`. Then lakeFS-spec can automatically handle authentication. Open a terminal of your choice and `brew install lakefs`. Then use `lakectl config`. You find the authentication information in the terminal window where you started the LakeFS Docker container. Note: for this to work, you need the `pyyaml` package which is not a default dependency of LakeFS-spec. It was installed via the `demo-requirements.txt`, however. Yet, in you own projects add the dependency manually, if you want to use the LakeCTL authentication.

In [1]:
REPO_NAME = 'weather'

Now it's time to get some data. We will use the [Open-Meteo api](https://open-meteo.com/), where we can pull weather data from an API for free (as long as we are non-commercial) and without an API-token.  

For trainig we get the full data of the 2010s from Munich (where I am writing this right now ;) ) 

In [2]:
!curl -o './data/weather-2010s.json' 'https://archive-api.open-meteo.com/v1/archive?latitude=52.52&longitude=13.41&start_date=2010-01-01&end_date=2019-12-31&hourly=temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6297k    0 6297k    0     0  1397k      0 --:--:--  0:00:04 --:--:-- 1417k


The data is in JSON format so we need to wrangle the data a bit to make it usable. But first we will save it into our lakeFS instance. We will create a new branch, `transform-raw-data`, 

In [3]:
from lakefs_spec import LakeFSFileSystem

NEW_BRANCH_NAME = 'transform-raw-data'

storage_options={
    "host": "localhost:8000",
    "username": "AKIAIOSFOLQUICKSTART",
    "password": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
}


fs = LakeFSFileSystem(host="localhost:8000")
fs.put('./data/weather-2010s.json',  f'{REPO_NAME}/{NEW_BRANCH_NAME}/weather-2010s.json')



Now, on LakeFS in your browser, can change the branch to `transform-raw-data` and see the saved file.  

Now let's transform the data for our use case. 

In [4]:
import pandas as pd
import json 


f = open('./data/weather-2010s.json')
data = json.load(f)
f.close()

df = pd.DataFrame.from_dict(data["hourly"])
df.head()


Unnamed: 0,time,temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m
0,2010-01-01T00:00,-2.6,88,0.0,996.9,992.1,100,100,97,75,16.0,27.2,54,58
1,2010-01-01T01:00,-2.7,88,0.0,996.4,991.6,100,99,96,49,16.3,28.0,55,58
2,2010-01-01T02:00,-2.7,88,0.0,996.2,991.4,100,96,94,60,16.3,27.5,55,58
3,2010-01-01T03:00,-2.7,88,0.0,996.1,991.3,100,97,96,83,15.4,26.6,53,57
4,2010-01-01T04:00,-2.7,88,0.0,996.0,991.2,100,92,98,82,14.8,25.6,47,52


Let us now save this data as a csv into the main branch. You can verify the saving worked in your LakeFS browser. 

In [5]:
df.to_csv(f'lakefs://{REPO_NAME}/main/weather_2010s.csv')

Calling open() in write mode results in unbuffered file uploads, because the lakeFS Python client does not support multipart uploads. Note that uploading large files unbuffered can have performance implications.


In [9]:
import sklearn.model_selection
train, test = sklearn.model_selection.train_test_split(df)