# Using lakeFS-spec to interact with lakeFS

### lakeFS-spec makes versioned data available via a filesystem interface

In [1]:
import pandas as pd

df = pd.read_parquet("lakefs://pydata-hn/main/lakes.parquet")
df.head()

Unnamed: 0,Hylak_id,Lake_name,Country,Depth_m
0,1,Caspian Sea,Russia,1025.0
1,2,Great Bear,Canada,446.0
2,3,Great Slave,Canada,614.0
3,4,Winnipeg,Canada,36.0
4,5,Superior,United States of America,406.0













### The same way, we can write files

In [2]:
german_lakes = df[df['Country'] == "Germany"]
german_lakes.head()

Unnamed: 0,Hylak_id,Lake_name,Country,Depth_m
1186,1187,Muritz,Germany,18.461577
13367,13368,,Germany,99.121739
13382,13383,,Germany,29.004134
13399,13400,,Germany,16.637063
13423,13424,,Germany,79.615976


In [3]:
from lakefs_spec import LakeFSFileSystem

fs = LakeFSFileSystem()

In [4]:
with fs.transaction("pydata-hn", "main") as tx:
    german_lakes.to_parquet(f"lakefs://{tx.repository}/{tx.branch.id}/german_lakes.parquet")
    tx.commit(message="Extract German lakes")

### We can access arbitrary files with `open()`

In [5]:
import json
from pathlib import Path

with fs.transaction("pydata-hn", "main") as tx:
    with fs.open(f"lakefs://{tx.repository}/{tx.branch.id}/experiment.json", "w") as f:
        data = Path("experiment.json").read_text()
        json.dump(data, f)
    tx.commit(message="Add experiment json")

### With the transaction API, we can perform complex versioning operations

In [6]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df)

In [7]:
with fs.transaction(
    "pydata-hn",
    base_branch="main",
    branch_name="demo-experiment",
    automerge=False,
    delete="never",
) as tx:
    train.to_csv(f"lakefs://{tx.repository}/{tx.branch.id}/train.csv")
    test.to_csv(f"lakefs://{tx.repository}/{tx.branch.id}/test.csv")
    
    commit = tx.commit(message="Create train test split")
print(commit)

Reference(repository="pydata-hn", id="3036efd02f0e5feb5aa3f4ac4f257682dfc07bb7ef60bde9484d9d08f238bff6")


### We can also merge branches and reference repository states using tags

In [9]:
with fs.transaction("pydata-hn", "main"):
    tx.merge(source_ref="main", into="demo-experiment")
    tag = tx.tag(ref=commit.id, name="PyDataDemo")
print(tag)

Tag(repository="pydata-hn", id="PyDataDemo2")


In [10]:
test_df = pd.read_csv("lakefs://pydata-hn/PyDataDemo/test.csv", index_col=0)
test_df

Unnamed: 0,Hylak_id,Lake_name,Country,Depth_m
12289,12290,,Finland,27.749067
22651,22652,,Canada,15.471857
42077,42078,,Canada,12.280825
78372,78373,,Canada,11.237657
53567,53568,,Canada,19.290865
...,...,...,...,...
8736,8737,,Canada,27.715888
25266,25267,,Canada,14.689580
87150,87151,,Canada,15.636295
64261,64262,,Canada,19.456921


### We can use unique identifiers for automated versioning

In [11]:
print(commit)

Reference(repository="pydata-hn", id="3036efd02f0e5feb5aa3f4ac4f257682dfc07bb7ef60bde9484d9d08f238bff6")


In [12]:
df = pd.read_parquet(f"lakefs://pydata-hn/{commit.id}/lakes.parquet")
df

Unnamed: 0,Hylak_id,Lake_name,Country,Depth_m
0,1,Caspian Sea,Russia,1025.000000
1,2,Great Bear,Canada,446.000000
2,3,Great Slave,Canada,614.000000
3,4,Winnipeg,Canada,36.000000
4,5,Superior,United States of America,406.000000
...,...,...,...,...
99995,99996,,Canada,6.580317
99996,99997,,Canada,9.083196
99997,99998,,Canada,20.822592
99998,99999,,Canada,10.418225


### Summary

lakeFS & lakeFS-spec
- Easy read an write operations by adding lakeFS URIs to your filesystem
- Git-style versioning and collaboration features
- Transactions as a safeguarded way to programmatically conduct versioning operations

Niceties
- Automatic authentication discovery
- Caching for up and downloads

# Questions?
![Our GitHub Repo](lakefs-spec-github-qrcode.png)

`pip install lakefs-spec`