# Pipelime Samples And Sequences

At the core of pipelime's dataset management we find the concept of samples sequences.
All the functionalities you need are packed into the SamplesSequence class, where the
operational methods, eg, `shuffle`, `sort`, etc, are **dynamically** generated from
internal and external definitions (more on this later). Therefore, **you won't find them
by just looking at the code**. Instead, you can list them from the command line using
`pipelime list` or explicitly calling the printer from an interactive session:

In [None]:
from pipelime.cli.utils import print_sequence_operators_list

print_sequence_operators_list()

As you can see above, we have two kind of sequence operations:
1. **generators**: class-methods that generate a sequence of samples
1. **pipes**: instance-methods that append an operation to the sequence

Note that *piped operations* follow the general pipelime approach of *returning* a new
object leaving the original unchanged. Usually, it does not add a significant overhead.

To get the list of arguments required for a given operation, just use the usual
`pipelime help <operation>` from command line or the
`pipelime.cli.utils.print_command_or_op_info` python function. For example, let's see
how to load an underfolder dataset and shuffle it:

In [None]:
from pipelime.cli.utils import print_command_op_stage_info

print_command_op_stage_info("from_underfolder")
print_command_op_stage_info("shuffle")

## A Simple Data Pipe

In [None]:
from pipelime.sequences import SamplesSequence
from PIL import Image
from IPython.display import display

seq = SamplesSequence.from_underfolder(  # type: ignore
    "../../tests/sample_data/datasets/underfolder_minimnist",
    # merge_root_items=False,  # see below about toggling this comment
)
print("Before shuffling:", flush=True)
display(Image.fromarray(seq[0]['image']()))

seq = seq.shuffle(seed=42)[2:6]
print("After shuffling and slicing:", flush=True)
display(Image.fromarray(seq[0]['image']()))

In the code above we have also accessed a sample by its index and an item by its
name. Note that what you get from `seq[0]['image']` is a `pipelime.items.Item` object,
so to get its value you have to `__call__()` it. Then, where the actual data come from may
vary:
1. first, if the data has been already loaded and cached, the cached value is returned
1. then, all file sources are checked
1. finally, remote data lakes are accessed

The usual approach is to create a new object every time the data is changed,
so **you should not really care where such data come from**. Though, to reduce memory
footprint, you can disable item data caching by setting the `Item.cache_data` property
or using the `pipelime.items.no_data_cache` context manager:

```
    # disable data cache for all items
    with no_data_cache():
        ...
    
    # disable only for BinaryItem and NumpyItem
    with no_data_cache(BinaryItem, NumpyItem):
        ...
    
    # apply at function invocation
    @no_data_cache(ImageItem)
    def my_fn():
        ...
```

Now let's see what happens when we write the above sequence to disk. The writer is,
indeed, just another operation, so we append it to the sequence and then just iterate
over the sequence to write the samples to disk:

In [None]:
writer = seq.to_underfolder("./writer_output", exists_ok=True)
for _ in writer:
    pass

As you can see, the items know how to serialize themselves since they represent specific
data formats. To see all the supported formats, just inquire the item factory:

In [None]:
from pipelime.items import Item
from rich.pretty import pprint

pprint(Item.ITEM_CLASSES)


What if we want to distribute the computation over multiple cores? Just use a **Grabber**!

**NB**: *Multiprocessing does not work in Jupyter notebooks, so we have packed the logic
in grabber_example.py*

In [None]:
!python grabber_example.py

An interesting thing to note is that when we write an item to disk, such file is,
indeed, a new *file source* for the item. Since no changes are made to the actual value,
this data source is added to the same item instance we initially loaded<sup>1</sup>:

<sup>[1] *The attentive reader will notice that if we had set `merge_root_items=False`
when loading the underfolder dataset above, even the Sample object would have been the
SAME instance.*</sup>

In [None]:
print("Original Sample:", flush=True)
print(seq[0])

print("Written Sample:", flush=True)
print(writer[0])

for org_sample, wrt_sample in zip(seq, writer):
    for v1, v2 in zip(org_sample, wrt_sample):
        assert v1 is v2
    
    ## This is true if we set `merge_root_items=False`
    # assert org_sample is wrt_sample

## The `pipe` Command

Once we have a sequence of operations, we can serialize it and replay it later through
the `pipe` command. To this end, first we ask the samples sequence to serialize itself:

In [None]:
import yaml
print(yaml.safe_dump(writer.to_pipe(), indent=2))

Then, we can create a config file following the `pipe` syntax and copy-pasting the
serialized sequence above, though removing the `from_underfolder` and `to_underfolder`
steps, since `pipe` already takes care of that:

In [None]:
!pipelime pipe --config pipe_cfg.yaml

In [None]:
import numpy as np

seq1 = SamplesSequence.from_underfolder("writer_output")  # type: ignore
seq2 = SamplesSequence.from_underfolder("writer_output_piped")  # type: ignore
assert len(seq1) == len(seq2)
for s1, s2 in zip(seq1, seq2):
    assert list(s1.keys()) == list(s2.keys())
    for k in s1.keys():
        if isinstance(s1[k](), np.ndarray):
            assert np.array_equal(s1[k](), s2[k](), equal_nan=True)
        else:
            assert s1[k]() == s2[k]()

## Sample Data Caching

Pipelime provides a special operator `cache` to serialize to disk a whole sample the
first time it is accessed and then load it from disk the next time, instead of
triggering again the whole source data pipeline. This is really useful when:
* you loop over the data multiple times
* the data processing is time consuming but fixed, ie, for each index you always get the
same sample

To show how it works, we will in fact add some randomness into the pipeline, so that we
can clearly see the difference between *caching* and *no caching*:

In [None]:
from pipelime.sequences import SamplesSequence, Sample
from pipelime.stages import SampleStage
import numpy as np
import shutil


def _print_label(data_seq, idx = 0, times = 3):
    for i in range(times):
        print(f"Reading label #{idx} ({i})", data_seq[idx]["label"](), flush=True)


class RandomNoiseStage(SampleStage):
    def __call__(self, x: Sample):
        return x.set_value("label", x["label"]() + np.random.normal(0, .1))  # type: ignore

seq = SamplesSequence.from_underfolder(  # type: ignore
    "../../tests/sample_data/datasets/underfolder_minimnist"
).map(RandomNoiseStage())

print("Every time we read a sample, the label is modified by a random noise:", flush=True)
_print_label(seq)

print("\nInstead, a cached sequence always return the first value we get:", flush=True)
shutil.rmtree("local_cache", ignore_errors=True)
cached_seq = seq.cache("local_cache")
_print_label(cached_seq)

print("\nWe can even re-use the same cached data between different runs:", flush=True)
another_cached_seq = seq.cache("local_cache", reuse_cache=True)
_print_label(another_cached_seq)