# Quick start

## Starting the server

### Install `lavender-data`

You can install lavender-data using pip.

```sh
pip install lavender-data
```

### Run the server

You can run the lavender data server simply with following command:
```sh
lavender-data server start
```

The server will be running on http://0.0.0.0:8000.
This is obviously a blocking call, so we'll spawn a subprocess instead and move on.

In [1]:
!lavender-data server start

[2025-04-22 15:48:49,732] INFO - lavender_data.server.settings: Loading settings
lavender-data is running on http://0.0.0.0:8000
UI is running on http://localhost:8000


## Using the client

### Initialize the client

Use `lavender_data.client.init` to initialize the client.

In [6]:
import lavender_data.client as lavender

lavender.init(api_url="http://localhost:8000", api_key="la-...")

<lavender_data.client.api.LavenderDataClient at 0x1337a6fd0>

Let's check if we're connected by listing the datasets with `get_datasets`.

In [104]:
lavender.api.get_datasets()

[]

### Define the schema

Make a new dataset with `create_dataset`. `uid_column_name` is the name of the column that will be used as the unique identifier for each sample.

In [105]:
lavender.api.create_dataset(name="test-dataset", uid_column_name="uid")

DatasetPublic(name='test-dataset', created_at=datetime.datetime(2025, 4, 22, 7, 13, 1), id='ds-m9s649wvmtj2nsjububv', uid_column_name='uid', additional_properties={})

In [106]:
dataset = lavender.api.get_dataset(name="test-dataset")
dataset

GetDatasetResponse(name='test-dataset', created_at=datetime.datetime(2025, 4, 22, 7, 13, 1), columns=[], shardsets=[], id='ds-m9s649wvmtj2nsjububv', uid_column_name='uid', additional_properties={})

Add a shardset to the dataset with `create_shardset`.
Let's add 2 columns, `uid` and `text`.

In [107]:
import os

shardset_dir = os.path.expanduser("~/.lavender-data/.cache/test_shards")

shardset = lavender.api.create_shardset(
    dataset_id=dataset.id,
    location=f"file://{shardset_dir}",
    columns=[
        lavender.api.DatasetColumnOptions(
            name="uid",
            description="Unique identifier",
            type_="int",
        ),
        lavender.api.DatasetColumnOptions(
            name="text",
            description="A text field",
            type_="str",
        ),
    ],
)
shardset

CreateShardsetResponse(dataset_id='ds-m9s649wvmtj2nsjububv', location='file:///Users/hanch/.lavender-data/.cache/test_shards', created_at=datetime.datetime(2025, 4, 22, 7, 13, 2), columns=[DatasetColumnPublic(dataset_id='ds-m9s649wvmtj2nsjububv', shardset_id='ss-m9s64axf90r8u80wgwbs', name='uid', type_='int', created_at=datetime.datetime(2025, 4, 22, 7, 13, 2), id='dc-m9s64axi8hjzklpqw240', description='Unique identifier', additional_properties={}), DatasetColumnPublic(dataset_id='ds-m9s649wvmtj2nsjububv', shardset_id='ss-m9s64axf90r8u80wgwbs', name='text', type_='str', created_at=datetime.datetime(2025, 4, 22, 7, 13, 2), id='dc-m9s64axif4c0hhcot3xm', description='A text field', additional_properties={})], id='ss-m9s64axf90r8u80wgwbs', shard_count=0, total_samples=0, additional_properties={})

Now the dataset has 2 columns, `uid` and `text`.

In [108]:
[f"{col.name} ({col.type_}): {col.description}" for col in lavender.api.get_dataset(dataset.id).columns]

['text (str): A text field', 'uid (int): Unique identifier']

### Add data

Let's create example csv files to the shardset location.

In [133]:
import os
import csv

shard_count = 10
samples_per_shard = 10

os.makedirs(shardset_dir, exist_ok=True)
for i in range(shard_count):
    with open(f"{shardset_dir}/shard.{i:05d}.csv", "w") as f:
        writer = csv.writer(f)
        writer.writerow(["uid", "text"])
        for j in range(samples_per_shard):
            writer.writerow(
                [
                    i * samples_per_shard + j,
                    f"Sample {i * samples_per_shard + j}",
                ]
            )


To reflect it on the server, call sync_shardset.

In [110]:
lavender.api.sync_shardset(dataset.id, shardset.id, overwrite=True)

GetShardsetResponse(dataset_id='ds-m9s649wvmtj2nsjububv', location='file:///Users/hanch/.lavender-data/.cache/test_shards', created_at=datetime.datetime(2025, 4, 22, 7, 13, 2), shards=[ShardPublic(shardset_id='ss-m9s64axf90r8u80wgwbs', location='file:///Users/hanch/.lavender-data/.cache/test_shards/shard.00000.csv', format_='csv', created_at=datetime.datetime(2025, 4, 22, 7, 13, 2), id='sd-m9s64ayvoszz99yjfu8t', filesize=130, samples=10, index=1, additional_properties={}), ShardPublic(shardset_id='ss-m9s64axf90r8u80wgwbs', location='file:///Users/hanch/.lavender-data/.cache/test_shards/shard.00001.csv', format_='csv', created_at=datetime.datetime(2025, 4, 22, 7, 13, 2), id='sd-m9s64ayv3g7tsi1zhyaf', filesize=150, samples=10, index=2, additional_properties={}), ShardPublic(shardset_id='ss-m9s64axf90r8u80wgwbs', location='file:///Users/hanch/.lavender-data/.cache/test_shards/shard.00002.csv', format_='csv', created_at=datetime.datetime(2025, 4, 22, 7, 13, 2), id='sd-m9s64ayw1nj17w0h3hd5'

Now the shardset has 100 samples.

In [111]:
shardset = lavender.api.get_dataset(dataset.id).shardsets[0]
print(f"Shard count: {shardset.shard_count}, Total samples: {shardset.total_samples}")

Shard count: 10, Total samples: 100


### Add a new column

You might want to add a new feature to the dataset. In this case, you can add a new column to the dataset by adding a new shardset.

Be aware that all the shardsets must have the `uid_column_name` column.

In [112]:
new_shardset_dir = os.path.expanduser("~/.lavender-data/.cache/test_shards_new")


new_shardset = lavender.api.create_shardset(
    dataset_id=dataset.id,
    location=f"file://{new_shardset_dir}",
    columns=[
        lavender.api.DatasetColumnOptions(
            name="uid",
            description="Unique identifier",
            type_="int",
        ),
        lavender.api.DatasetColumnOptions(
            name="new_text",
            description="A new text field",
            type_="str",
        ),
    ],
)

In [117]:
[f"{col.name} ({col.type_}): {col.description}" for col in lavender.api.get_dataset(dataset.id).columns]

['new_text (str): A new text field',
 'text (str): A text field',
 'uid (int): Unique identifier']

We'll add only 8 samples per shard this time, to demonstrate what happens when shardsets in the same dataset have different number of samples.


> For each sample, the shard index of the sample MUST be the same across all the shardsets.
> If not, it's hard to determine which shard the sample belongs to.
>
> For example, let's say you have 10 samples per shard in shardset A.
> Then, the 11th sample in shardset A belongs to 2nd shard.
> Let's say you derived a new shardset B from A, and had to drop 9th, 10th samples in A.
> Even though you dropped 2 samples, the 11th sample in shardset B should still belongs to 2nd shard.

In [134]:
import os
import csv

os.makedirs(new_shardset_dir, exist_ok=True)
for i in range(shard_count):
    with open(f"{new_shardset_dir}/shard.{i:05d}.csv", "w") as f:
        writer = csv.writer(f)
        writer.writerow(["uid", "new_text"])
        for j in range(samples_per_shard - 2):
            writer.writerow(
                [
                    i * samples_per_shard + j,
                    f"Sample {i * samples_per_shard + j}",
                ]
            )

In [118]:
lavender.api.sync_shardset(dataset.id, new_shardset.id, overwrite=True)

GetShardsetResponse(dataset_id='ds-m9s649wvmtj2nsjububv', location='file:///Users/hanch/.lavender-data/.cache/test_shards_new', created_at=datetime.datetime(2025, 4, 22, 7, 13, 9), shards=[ShardPublic(shardset_id='ss-m9s64g065q1f05rvjkht', location='file:///Users/hanch/.lavender-data/.cache/test_shards_new/shard.00000.csv', format_='csv', created_at=datetime.datetime(2025, 4, 22, 7, 13, 41), id='sd-m9s655c87o4eh0vc2u1u', filesize=110, samples=8, index=1, additional_properties={}), ShardPublic(shardset_id='ss-m9s64g065q1f05rvjkht', location='file:///Users/hanch/.lavender-data/.cache/test_shards_new/shard.00001.csv', format_='csv', created_at=datetime.datetime(2025, 4, 22, 7, 13, 41), id='sd-m9s655c9y656v01vitef', filesize=122, samples=8, index=2, additional_properties={}), ShardPublic(shardset_id='ss-m9s64g065q1f05rvjkht', location='file:///Users/hanch/.lavender-data/.cache/test_shards_new/shard.00002.csv', format_='csv', created_at=datetime.datetime(2025, 4, 22, 7, 13, 41), id='sd-m9s6

In [119]:
new_shardset = lavender.api.get_dataset(dataset.id).shardsets[1]
print(f"Shard count: {new_shardset.shard_count}, Total samples: {new_shardset.total_samples}")

Shard count: 10, Total samples: 80


### Iterate over the dataset

Use `Iteration` to iterate over the dataset. Specify the dataset id and shardsets you want to iterate over.

Excluded shardsets will not be loaded. This can reduce huge amount of the overhead.

Best practice would be selecting only the shardsets you need. For example, let's say you have an image dataset, and you preprocessed the images into embeddings. Store the embeddings in a new shardset, and do not select it on iteration if you don't need it.

In [121]:
iteration = lavender.LavenderDataLoader(
    dataset_id=dataset.id,
    shardsets=[shardset.id],
)
iteration

<lavender_data.client.iteration.LavenderDataLoader at 0x13143f4d0>

In [122]:
for sample in iteration:
    print(sample)


{'uid': 0, 'text': 'Sample 0', '_lavender_data_indices': [0], '_lavender_data_current': 1}
{'uid': 1, 'text': 'Sample 1', '_lavender_data_indices': [1], '_lavender_data_current': 2}
{'uid': 2, 'text': 'Sample 2', '_lavender_data_indices': [2], '_lavender_data_current': 3}
{'uid': 3, 'text': 'Sample 3', '_lavender_data_indices': [3], '_lavender_data_current': 4}
{'uid': 4, 'text': 'Sample 4', '_lavender_data_indices': [4], '_lavender_data_current': 5}
{'uid': 5, 'text': 'Sample 5', '_lavender_data_indices': [5], '_lavender_data_current': 6}
{'uid': 6, 'text': 'Sample 6', '_lavender_data_indices': [6], '_lavender_data_current': 7}
{'uid': 7, 'text': 'Sample 7', '_lavender_data_indices': [7], '_lavender_data_current': 8}
{'uid': 8, 'text': 'Sample 8', '_lavender_data_indices': [8], '_lavender_data_current': 9}
{'uid': 9, 'text': 'Sample 9', '_lavender_data_indices': [9], '_lavender_data_current': 10}
{'uid': 10, 'text': 'Sample 10', '_lavender_data_indices': [10], '_lavender_data_current'

The samples will be shuffled if `shuffle` is set to `True`. You can fix the shuffled order by setting `shuffle_seed` to a fixed value.

`shuffle_block_size` is the number of shards to shuffle at a time. Larger value means more disk usage but gives more randomness.

In [123]:
for sample in lavender.LavenderDataLoader(
    dataset_id=dataset.id,
    shardsets=[shardset.id],
    shuffle=True,
    shuffle_seed=42,
    shuffle_block_size=3,
):
    print(sample)

{'uid': 57, 'text': 'Sample 57', '_lavender_data_indices': [57], '_lavender_data_current': 1}
{'uid': 15, 'text': 'Sample 15', '_lavender_data_indices': [15], '_lavender_data_current': 2}
{'uid': 53, 'text': 'Sample 53', '_lavender_data_indices': [53], '_lavender_data_current': 3}
{'uid': 17, 'text': 'Sample 17', '_lavender_data_indices': [17], '_lavender_data_current': 4}
{'uid': 88, 'text': 'Sample 88', '_lavender_data_indices': [88], '_lavender_data_current': 5}
{'uid': 89, 'text': 'Sample 89', '_lavender_data_indices': [89], '_lavender_data_current': 6}
{'uid': 58, 'text': 'Sample 58', '_lavender_data_indices': [58], '_lavender_data_current': 7}
{'uid': 54, 'text': 'Sample 54', '_lavender_data_indices': [54], '_lavender_data_current': 8}
{'uid': 12, 'text': 'Sample 12', '_lavender_data_indices': [12], '_lavender_data_current': 9}
{'uid': 80, 'text': 'Sample 80', '_lavender_data_indices': [80], '_lavender_data_current': 10}
{'uid': 84, 'text': 'Sample 84', '_lavender_data_indices': 

The samples will be batched if `batch_size` is set.

In [124]:
for sample in lavender.LavenderDataLoader(
    dataset_id=dataset.id,
    shardsets=[shardset.id],
    batch_size=10,
):
    print(sample)

{'uid': tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'text': ['Sample 0', 'Sample 1', 'Sample 2', 'Sample 3', 'Sample 4', 'Sample 5', 'Sample 6', 'Sample 7', 'Sample 8', 'Sample 9'], '_lavender_data_indices': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], '_lavender_data_current': 1}
{'uid': tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]), 'text': ['Sample 10', 'Sample 11', 'Sample 12', 'Sample 13', 'Sample 14', 'Sample 15', 'Sample 16', 'Sample 17', 'Sample 18', 'Sample 19'], '_lavender_data_indices': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], '_lavender_data_current': 2}
{'uid': tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29]), 'text': ['Sample 20', 'Sample 21', 'Sample 22', 'Sample 23', 'Sample 24', 'Sample 25', 'Sample 26', 'Sample 27', 'Sample 28', 'Sample 29'], '_lavender_data_indices': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], '_lavender_data_current': 3}
{'uid': tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39]), 'text': ['Sample 30', 'Sample 31', 'Sample 32', 'Sample 33', 'Sample 34', 'Sample 35', 

### What happens if shardsets have different number of samples?

If shardsets have different number of samples, only the samples with all the columns will be loaded.


In [135]:
for sample in lavender.LavenderDataLoader(
    dataset_id=dataset.id,
    shardsets=[shardset.id, new_shardset.id],
    batch_size=10,
):
    print(sample)

{'uid': tensor([ 0,  1,  2,  3,  4,  5,  6,  7, 10, 11]), 'new_text': ['Sample 0', 'Sample 1', 'Sample 2', 'Sample 3', 'Sample 4', 'Sample 5', 'Sample 6', 'Sample 7', 'Sample 10', 'Sample 11'], 'text': ['Sample 0', 'Sample 1', 'Sample 2', 'Sample 3', 'Sample 4', 'Sample 5', 'Sample 6', 'Sample 7', 'Sample 10', 'Sample 11'], '_lavender_data_indices': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], '_lavender_data_current': 1}
{'uid': tensor([12, 13, 14, 15, 16, 17, 20, 21, 22, 23]), 'new_text': ['Sample 12', 'Sample 13', 'Sample 14', 'Sample 15', 'Sample 16', 'Sample 17', 'Sample 20', 'Sample 21', 'Sample 22', 'Sample 23'], 'text': ['Sample 12', 'Sample 13', 'Sample 14', 'Sample 15', 'Sample 16', 'Sample 17', 'Sample 20', 'Sample 21', 'Sample 22', 'Sample 23'], '_lavender_data_indices': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], '_lavender_data_current': 2}
{'uid': tensor([24, 25, 26, 27, 30, 31, 32, 33, 34, 35]), 'new_text': ['Sample 24', 'Sample 25', 'Sample 26', 'Sample 27', 'Sample 30', 'Sample 31',

## Custom modules

### Define module directory

Specify the directory containing the module files with `LAVENDER_DATA_MODULE_DIR` environment variable. Use `.env` file or set the environment variable with `export`.

```sh
export LAVENDER_DATA_MODULE_DIR=/path/to/module/dir
```

We already have a module directory `modules` in the example directory.

```sh
export LAVENDER_DATA_MODULE_DIR=$PWD/modules
```

This example module contains a custom filter, collater, and preprocessor. Let's take a look at one by one.


### Online Filters

To online-filter the dataset, define a filter class that inherits from `Filter`. It takes a single sample as an argument and returns a boolean value. If it returns `True`, the sample will be included in the dataset. For example, below is a filter that only includes samples with even `uid`.

In [22]:
from lavender_data.server import Filter

class UidModFilter(Filter, name="uid_mod"):
    def filter(self, sample: dict, *, mod: int = 2) -> bool:
        return sample["uid"] % mod == 0

On iteration, specify the filter name to use it.

In [127]:
for sample in lavender.LavenderDataLoader(
    dataset_id=dataset.id,
    shardsets=[shardset.id],
    filters=[("uid_mod", {"mod": 2})],
    batch_size=10,
):
    print(sample)

{'uid': tensor([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18]), 'text': ['Sample 0', 'Sample 2', 'Sample 4', 'Sample 6', 'Sample 8', 'Sample 10', 'Sample 12', 'Sample 14', 'Sample 16', 'Sample 18'], '_lavender_data_indices': [0, 2, 4, 6, 8, 10, 12, 14, 16, 18], '_lavender_data_current': 1}
{'uid': tensor([20, 22, 24, 26, 28, 30, 32, 34, 36, 38]), 'text': ['Sample 20', 'Sample 22', 'Sample 24', 'Sample 26', 'Sample 28', 'Sample 30', 'Sample 32', 'Sample 34', 'Sample 36', 'Sample 38'], '_lavender_data_indices': [20, 22, 24, 26, 28, 30, 32, 34, 36, 38], '_lavender_data_current': 2}
{'uid': tensor([40, 42, 44, 46, 48, 50, 52, 54, 56, 58]), 'text': ['Sample 40', 'Sample 42', 'Sample 44', 'Sample 46', 'Sample 48', 'Sample 50', 'Sample 52', 'Sample 54', 'Sample 56', 'Sample 58'], '_lavender_data_indices': [40, 42, 44, 46, 48, 50, 52, 54, 56, 58], '_lavender_data_current': 3}
{'uid': tensor([60, 62, 64, 66, 68, 70, 72, 74, 76, 78]), 'text': ['Sample 60', 'Sample 62', 'Sample 64', 'Sample 66', 'Sampl

### Collater

To collate the samples, define a collater class that inherits from `Collater`. It takes a list of samples as an argument and returns a dictionary of batched samples.

If `torch` is installed, default collater will be `torch.utils.data.default_collate`. If not, it will be a simple function that concatenates the samples to a list, like below.

In [24]:
from lavender_data.server import Collater

class PyListCollater(Collater, name="pylist"):
    def collate(self, samples: list[dict]) -> dict:
        return {
            "uid": [sample["uid"] for sample in samples],
            "text": [sample["text"] for sample in samples],
        }

In [128]:
for sample in lavender.LavenderDataLoader(
    dataset_id=dataset.id,
    shardsets=[shardset.id],
    collater="pylist",
    batch_size=10,
):
    print(sample)

{'uid': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 'text': ['Sample 0', 'Sample 1', 'Sample 2', 'Sample 3', 'Sample 4', 'Sample 5', 'Sample 6', 'Sample 7', 'Sample 8', 'Sample 9'], '_lavender_data_indices': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], '_lavender_data_current': 1}
{'uid': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], 'text': ['Sample 10', 'Sample 11', 'Sample 12', 'Sample 13', 'Sample 14', 'Sample 15', 'Sample 16', 'Sample 17', 'Sample 18', 'Sample 19'], '_lavender_data_indices': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], '_lavender_data_current': 2}
{'uid': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], 'text': ['Sample 20', 'Sample 21', 'Sample 22', 'Sample 23', 'Sample 24', 'Sample 25', 'Sample 26', 'Sample 27', 'Sample 28', 'Sample 29'], '_lavender_data_indices': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], '_lavender_data_current': 3}
{'uid': [30, 31, 32, 33, 34, 35, 36, 37, 38, 39], 'text': ['Sample 30', 'Sample 31', 'Sample 32', 'Sample 33', 'Sample 34', 'Sample 35', 'Sample 36', 'Sample 37', 'Sampl

### Remote Preprocessor

To preprocess the samples remotely, define a preprocessor class that inherits from `Preprocessor`. It takes a collated batch as an argument and returns a preprocessed batch.

In [26]:
from lavender_data.server import Preprocessor

class AppendNewColumn(Preprocessor, name="append_new_column"):
    def process(self, batch: dict) -> dict:
        batch["new_column"] = []
        for uid in batch["uid"]:
            batch["new_column"].append(f"{uid}_processed")
        return batch

In [131]:
for sample in lavender.LavenderDataLoader(
    dataset_id=dataset.id,
    shardsets=[shardset.id],
    preprocessors=["append_new_column"],
    batch_size=10,
):
    print(sample)

{'uid': tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'text': ['Sample 0', 'Sample 1', 'Sample 2', 'Sample 3', 'Sample 4', 'Sample 5', 'Sample 6', 'Sample 7', 'Sample 8', 'Sample 9'], 'new_column': ['0_processed', '1_processed', '2_processed', '3_processed', '4_processed', '5_processed', '6_processed', '7_processed', '8_processed', '9_processed'], '_lavender_data_indices': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], '_lavender_data_current': 1}
{'uid': tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]), 'text': ['Sample 10', 'Sample 11', 'Sample 12', 'Sample 13', 'Sample 14', 'Sample 15', 'Sample 16', 'Sample 17', 'Sample 18', 'Sample 19'], 'new_column': ['10_processed', '11_processed', '12_processed', '13_processed', '14_processed', '15_processed', '16_processed', '17_processed', '18_processed', '19_processed'], '_lavender_data_indices': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], '_lavender_data_current': 2}
{'uid': tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29]), 'text': ['Sample 20', 'Sample 21', 'Samp