# Remote Data Lakes

Sharing and versioning a dataset is not feasible for large binary data, so pipelime
gives you the option to store your data in a remote data lake and silently download it
when an item is accessed. Then, you can dump just the remote data addresses, ie, json
text files that you can easily version and share.

## File Remote Example

For this example we'll use a filesystem-based remote data lake, ie, a local folder where
data will be stored. To this end, let's create a local folder:

In [None]:
from pathlib import Path

Path("local-remote").mkdir(exist_ok=True)

Now upload a dataset and save the remote data file:

In [None]:
!pipelime remote-add +input.folder ../../tests/sample_data/datasets/underfolder_minimnist +output.folder uploaded_dataset +output.exists_ok +remotes.url "file://localhost/local-remote"

We can iterate through the dataset as usual:

In [None]:
from pipelime.sequences import SamplesSequence
from PIL import Image
from IPython.display import display

for idx, s in enumerate(
    SamplesSequence.from_underfolder("uploaded_dataset").enumerate()[:10:2]  # type: ignore
):
    print(f"{idx:>02d}: Sample #{int(s['~idx']()):>02d}", flush=True)
    display(Image.fromarray(s["image"]()), Image.fromarray(s["mask"]()))


Then we'll download the data and save it locally. Note that the default serialization
mode writes the remote paths, if available, so we need to force the creation of a new
file:

In [None]:
!pipelime clone +input.folder uploaded_dataset +output.folder downloaded_dataset +output.exists_ok +output.serialization.override.CREATE_NEW_FILE null

Finally, check the data is correct:

In [None]:
import numpy as np


def _check_numpy(s1, s2, s3, k):
    assert np.array_equal(s1[k](), s2[k](), equal_nan=True)
    assert np.array_equal(s2[k](), s3[k](), equal_nan=True)


for s1, s2, s3 in zip(
    SamplesSequence.from_underfolder(  # type: ignore
        "../../tests/sample_data/datasets/underfolder_minimnist"
    ),
    SamplesSequence.from_underfolder("uploaded_dataset"),  # type: ignore
    SamplesSequence.from_underfolder("downloaded_dataset"),  # type: ignore
):
    assert list(s1.keys()) == list(s2.keys()) == list(s3.keys())
    _check_numpy(s1, s2, s3, "image")
    _check_numpy(s1, s2, s3, "mask")
    _check_numpy(s1, s2, s3, "label")
    _check_numpy(s1, s2, s3, "points")
    assert s1["metadata"]() == s2["metadata"]() == s3["metadata"]()
    assert s1["cfg"]() == s2["cfg"]() == s3["cfg"]()
    _check_numpy(s1, s2, s3, "numbers")
    _check_numpy(s1, s2, s3, "pose")
    for k in s1:
        assert s1[k].is_shared == s2[k].is_shared == s3[k].is_shared

print("Everything is OK!")
