# Using lakeFS-spec to interact with lakeFS

### lakeFS-spec makes versioned data available via a filesystem interface

After installing `lakeFS-spec` you can use file-system identifiers (`lakefs://`) to reference URIs to the lakeFS storage. The installation (`pip install lakefs-spec`) registers the identifier with fsspec.
That way, without even importing the lakefs-spec, all libraries that use fsspec under the hood, like pandas or DuckDb, can work with filepaths such as in the next cell to fetch data from remote storage locations.

In [None]:
import pandas as pd

df = pd.read_parquet("lakefs://pydata-hn/main/lakes.parquet")
df.head()












### The same way, we can write files

We can write files in the same way. However, to ensure a clean state of the remote lakeFS repository even if errors occur in the versioning operations we recommend to use the transaction context manager to conduct versioning operations.

To start a transaction, we have to first instantiate a `LakeFSFileSystem` object. From the instance we use the `.transaction()` method to launch a context manager.
Under the hood the context manager collects the versioning operations in placeholder actions. Once you exit the transaction, the operations are sent off to lakeFS in a batch. That way, we catch possible errors before sending instructions to lakeFS and avoid a repository in a dangling state. 

Additionally, each transaction creates a temporary branch (that is persisted if you want) on which the operations are performed. This safety guard prevents corrupting the repository state, should errors happen during the execution of the versioning operations on the remote.

In [None]:
german_lakes = df[df['Country'] == "Germany"]
german_lakes.head()

In [None]:
from lakefs_spec import LakeFSFileSystem

fs = LakeFSFileSystem()

In [None]:
with fs.transaction("pydata-hn", "main") as tx:
    german_lakes.to_parquet(f"lakefs://{tx.repository}/{tx.branch.id}/german_lakes.parquet")
    tx.commit(message="Extract German lakes")

### We can access arbitrary files with `open()`

To access arbitrary files and not be reliant on an fsspec implementation in a library, we can use Pythons builtin `open()`.

In [None]:
import json
from pathlib import Path

with fs.transaction("pydata-hn", "main") as tx:
    with fs.open(f"lakefs://{tx.repository}/{tx.branch.id}/experiment.json", "w") as f:
        data = Path("experiment.json").read_text()
        json.dump(data, f)
    tx.commit(message="Add experiment json")

### With the transaction API, we can perform complex versioning operations

The transaction API supports complex versioning operations available in lakeFS and with which you may be familiar with from tools like git.
Namely, on top of reading and writing files, within a `fs.transaction` you can `.commit`, `.revert`, `.merge`, `.tag` a repository state reference (e.g. a commit or branch), and `.rev_parse` to parse reference trees.

You can also access the current `.branch`, `.base_branch`, `.repository`, and the files on the branch with `.files`.

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df)

In [None]:
with fs.transaction(
    "pydata-hn",
    base_branch="main",
    branch_name="demo-experiment",
    automerge=False,
    delete="never",
) as tx:
    train.to_csv(f"lakefs://{tx.repository}/{tx.branch.id}/train.csv")
    test.to_csv(f"lakefs://{tx.repository}/{tx.branch.id}/test.csv")
    
    commit = tx.commit(message="Create train test split")
print(commit)

### We can also merge branches and reference repository states using tags

As we see above, the `commit` object holds a unique SHA that identifies the specific data repository state. We can use tags to provide human readable references. For automated versioning as well as experiment tracking we can use the unique identifiers.

Tags are immutable so that you do not accidentally break any code existing elsewhere would you reassign a tag.

In [None]:
with fs.transaction("pydata-hn", "main"):
    tx.merge(source_ref="main", into="demo-experiment")
    tag = tx.tag(ref=commit.id, name="PyDataDemo")
print(tag)

In [None]:
test_df = pd.read_csv("lakefs://pydata-hn/PyDataDemo/test.csv", index_col=0)
test_df

### We can use unique identifiers for automated versioning

In [None]:
print(commit)

In [None]:
df = pd.read_parquet(f"lakefs://pydata-hn/{commit.id}/lakes.parquet")
df

### Summary

lakeFS & lakeFS-spec
- Easy read an write operations by adding lakeFS URIs to your filesystem
- Git-style versioning and collaboration features
- Transactions as a safeguarded way to programmatically conduct versioning operations

Niceties
- Automatic authentication discovery
- Caching for up and downloads

You can visit the lakeFS-spec [GitHub Repository](https://github.com/aai-institute/lakefs-spec) and our [documentation](https://lakefs-spec.org/latest/).
You can start using lakeFS-spec now with:
`pip install lakefs-spec`