# User story

In [24]:
%load_ext autoreload
%autoreload 2

import elliptio as eio
from pprint import pprint
from datetime import datetime, UTC, timedelta
import pandas as pd
from pathlib import Path
import git

pd.set_option('display.max_colwidth', None)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [20]:
class DummyModel:
    def train(self, train_path: Path):
        del train_path
        pass

    def save(self, url: Path):
        Path(url).write_text("dummy_model")


In [26]:
repo_path = Path(git.Repo(search_parent_directories=True).working_dir)

## Save files

Andy is an ambitious data scientist who analyzes data and trains models. He just prepared a new `train.txt` and `test.txt` with a special stratification method. He saves them with the following API call:

In [31]:
h = eio.Handler(remote_file_cls=eio.LocalFile)

with eio.mock_username("andy"):
    artifact = h.upload(
        local_paths=[
            repo_path / "docs/train.txt",
            repo_path / "docs/test.txt",
        ],
        labels=eio.Labels(
            ticket="AB-123",
            project="awesome-project",
            datatype="annotation_file",
            infos={"stratification": "new special method"},
        ),
    )

Note that he doesn't need to specify the destination. The destination path is generated automatically and the S3 bucket is defined as an environment variable. The labels are optional.

## Searching files by metadata

Brenda wants to use the files from Andy. However, Andy just started his Antartica-crossing last week and hasn't told anybody where he stored his work. So Brenda searches artifacts which were...

- generated by Andy
- labeled with `datatype=annotation_file`
- generated within the last 10 days
- contain `*.txt` files

In [13]:
df= h.find(
    query = {
        "username": "andy",
        "labels.datatype": "annotation_file",
        "creation_time": {"$gte": datetime.now(tz=UTC) - timedelta(days=7)},
        "file_relpaths": {"$regex": "\.txt"},
    },
    max_docs=1,
)
df

Unnamed: 0_level_0,creation_time,username,file_relpaths,argv,based_on,hostname,labels,local_root,log_relpaths,remote_root,run_id,version
artifact_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
artifact_b4c31f3a-c25d-4dda-aa74-cab7245116f8,2023-12-03 13:42:55.338,andy,"[train.txt, test.txt]",/home/cgebbe/venvs/elliptio/bin/python -m ipykernel_launcher --f=/home/cgebbe/.local/share/jupyter/runtime/kernel-v2-3296HzpU9aTlH5J3.json,[],cgebbe,"{'datatype': 'annotation_file', 'ticket': 'AB-123', 'project': 'awesome-project', 'dataset': '', 'description': '', 'infos': {'stratification': 'new cool method'}}",,[],/home/cgebbe/tmp/elliptio/2023/12/03/andy/134255_artifact_b4c31f3a-c25d-4dda-aa74-cab7245116f8,run_2ea13cb6-61b9-41ef-82f7-b58acf119180,1


Based on the displayed results, Brenda is pretty sure she found the correct files. She notes down the unique `artifact_id`.

## Using files

To use the file, Brenda uses the unique artifact ID and runs:

In [17]:
artifact_id = df.index[0]
artifact = h.get(artifact_id)
pprint(artifact)
pd.read_csv(artifact.files["test.txt"].remote_url)

# She could have also downloaded the file
# artifact.files["test.txt"].download()
# artifact.files["test.txt"].download_string()

Artifact(metadata=Metadata(username='andy',
                           hostname='cgebbe',
                           argv='/home/cgebbe/venvs/elliptio/bin/python -m '
                                'ipykernel_launcher '
                                '--f=/home/cgebbe/.local/share/jupyter/runtime/kernel-v2-3296HzpU9aTlH5J3.json',
                           artifact_id='artifact_b4c31f3a-c25d-4dda-aa74-cab7245116f8',
                           run_id='run_2ea13cb6-61b9-41ef-82f7-b58acf119180',
                           creation_time=datetime.datetime(2023, 12, 3, 13, 42, 55, 338000),
                           python_packages={'GitPython': '3.1.40',
                                            'PyYAML': '6.0.1',
                                            'Pygments': '2.16.1',
                                            'assertpy': '1.1',
                                            'asttokens': '2.4.1',
                                            'black': '23.10.1',
                  

Unnamed: 0,path,class
0,/path/to/image0.png,0
1,/path/to/image1.png,1


# Reproducability

Let's assume that she got an error upon opening the file. Could it be that Andy used another pandas version?

In [9]:
artifact.metadata.python_packages["pandas"]

'2.1.3'

Yes, Andy used `2.1.3.` whereas Brenda is using `1.4.0`. To be on the safe side, she quickly matches all her python libraries to the ones from Andy. She could also checkout the git commit hash as well as any local `git diff`erences (not yet implemented).

After she successfully opened the file, she trains a model and saves it:

In [10]:
with eio.mock_username("brenda"):
    model = DummyModel()
    model.train(train_path=artifact.files["train.txt"].remote_url)
    with h.new(["model.pt"]) as new:
        model.save(new.file.remote_url)

## Searching files by lineage

Once Andy comes back from his antarctica-crossing, he wants to know whether anybody actually used the newly stratified dataset he created:

In [18]:
h.find(query={"based_on": artifact_id})

Unnamed: 0_level_0,creation_time,username,file_relpaths,argv,based_on,hostname,labels,local_root,log_relpaths,remote_root,run_id,version
artifact_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
artifact_04312c0b-26a7-4f6d-8790-9c03bb7912b1,2023-12-03 13:43:08.551,brenda,[model.pt],/home/cgebbe/venvs/elliptio/bin/python -m ipykernel_launcher --f=/home/cgebbe/.local/share/jupyter/runtime/kernel-v2-3296HzpU9aTlH5J3.json,[artifact_b4c31f3a-c25d-4dda-aa74-cab7245116f8],cgebbe,"{'datatype': '', 'ticket': '', 'project': '', 'dataset': '', 'description': '', 'infos': {}}",,[],/home/cgebbe/tmp/elliptio/2023/12/03/brenda/134308_artifact_04312c0b-26a7-4f6d-8790-9c03bb7912b1,run_2ea13cb6-61b9-41ef-82f7-b58acf119180,1


He's happy to see that Brenda continued his work and checks out whether his method lead to a better model (for example, by following the data lineage to the evaluation metrics generated by Charlie).

Note that the lineage between `train.txt` and `model.pth` in Brendas run was tracked automatically. Every time `h.get(artifact_id)` is called, it's added to the `h.based_on` list.
