# Data Streams

A data stream is a façade bringing together an input sequence and an output processing
pipe which are not directly connected. Instead, you get samples from the input and pass
them to the output only when ready. Though unusual, this pattern becomes useful, eg,
when the user has to interact asyncronously with the dataset.

First, we need a toy dataset to play with:

In [None]:
from pipelime.sequences import SamplesSequence
import shutil


shutil.rmtree("toy_dataset", ignore_errors=True)
for _ in SamplesSequence.toy_dataset(3).to_underfolder("toy_dataset"):  # type: ignore
    pass


Now create a DataStream object and read/write some samples:

In [None]:
from pipelime.sequences import SamplesSequence, DataStream
from pipelime.items import JsonMetadataItem
import numpy as np

# getting data
# NB: cache is disabled on these items
data_stream = DataStream.rw_underfolder("toy_dataset")
s0 = data_stream[0]
print("original label:", s0["label"]())

# changing an item value does not propagate to the original data
new_s0 = s0.set_value("label", s0["label"]() + np.random.normal(0, .1))  # type: ignore
print("noisy label:", new_s0["label"]())
assert new_s0["label"]() != s0["label"]()

# creates a new item and save all changes
new_s0 = new_s0.set_item("new_label", JsonMetadataItem([np.random.randint(100, 110)]))
print("new label:", new_s0["new_label"]())
data_stream.process_sample(0, new_s0, ["label", "new_label"])

# now the original data is updated as well, since with
# no data cache the item is always read from disk
assert new_s0["label"]() == s0["label"]()
assert "new_label" not in s0

# however, to read new keys we first need to reload the DataStream
data_stream = DataStream.rw_underfolder("toy_dataset")
assert new_s0["new_label"]() == data_stream[0]["new_label"]()