# Pipelime Samples And Sequences

At the core of pipelime's dataset management we find the concept of samples sequences.
All the functionalities you need are packed into the SamplesSequence class, where the
operational methods, eg, `suffle`, `slice`, etc, are **dynamically** generated from
internal and external definitions (more on this later). Therefore, **you won't find them
by just looking at the code**. Instead, you can list them from the command line using
`pipelime list` or explicitly calling the printer from an interactive session:

In [None]:
from pipelime.cli.utils import print_sequence_operators_list

print_sequence_operators_list()

As you can see above, we have two kind of sequence operations:
1. **generators**: class-methods that generate a sequence of samples
1. **pipes**: instance-methods that append an operation to the sequence

Note that *piped operations* follow the general pipelime approach of *returning* a new
object leaving the original unchanged. Usually, it does not add a significant overhead.

To get the list of arguments required for a given operation, just use the usual
`pipelime help <operation>` from command line or the
`pipelime.cli.utils.print_command_or_op_info` python function. For example, let's see
how to load an underfolder dataset and shuffle it:

In [None]:
from pipelime.cli.utils import print_command_op_stage_info

print_command_op_stage_info("from_underfolder")
print_command_op_stage_info("shuffle")

## A Simple Data Pipe

In [None]:
from pipelime.sequences import SamplesSequence
from PIL import Image
from IPython.display import display

seq = SamplesSequence.from_underfolder(  # type: ignore
    "../../tests/sample_data/datasets/underfolder_minimnist"
)
print("Before shuffling:", flush=True)
display(Image.fromarray(seq[0]['image']()))

seq = seq.shuffle(seed=42)
print("After shuffling:", flush=True)
display(Image.fromarray(seq[0]['image']()))

In the previous code we have also accessed a sample by its index and an item by its
name. Note that what you get from `seq[0]['image']` is a `pipelime.items.Item` object,
so to get its value you have to `__call__()` it. Then, where the actual data come from may
vary:
1. first, if the data has been already loaded and cached, the cached value is returned
1. then, all file sources are checked
1. finally, remote data lakes are accessed

Note that the usual approach is to create a new object every time the data is changed,
so **you should not really care where such data come from**. Though, to reduce memory
footprint, you can disable item data caching by setting the `Item.cache_data` property
or using the `pipelime.items.no_data_cache` context manager (NB: can be used as function
decorator as well!).

Now let's see what happens when we write the above sequence to disk. The writer is,
indeed, just another operation, so we append it to the sequence and then just iterate
over the sequence to write the samples to disk:

In [None]:
writer = seq.to_underfolder("./writer_output", exists_ok=True)
for _ in writer:
    pass

What if we want to distribute the computation over multiple cores? Just use a Grabber!

**NB**: *Multiprocessing does not work in Jupyter notebooks, so we have packed the logic
in grabber_example.py*

In [None]:
!python grabber_example.py

An interesting thing to note is that when we write an item to disk, such file is,
indeed, a new *file source* for the item. Since no changes are made to the actual value,
this data source is added to the same item instance we initially loaded:

In [None]:
print("Original Sample:", flush=True)
print(seq[0])

print("Written Sample:", flush=True)
print(writer[0])

for org_sample, wrt_sample in zip(seq, writer):
    assert org_sample is wrt_sample