In [1]:
import sys
import numpy as np

# the following line is not required if Dataset is installed as a python package.
sys.path.append("../..")
from dataset import Dataset, DatasetIndex, Batch

In [2]:
# number of items in the dataset
NUM_ITEMS = 10
# number of items in a batch when iterating
BATCH_SIZE = 3

# Create a dataset

A dataset is defined by an index (a sequence of item ids) and a batch class (see [the documentation for details](https://analysiscenter.github.io/dataset/intro/dataset.html)).

In the simplest case an index is a natural sequence 0, 1, 2, 3, ...

So all you need to define the index is just a number of items in the dataset.

In [3]:
dataset = Dataset(index=NUM_ITEMS, batch_class=Batch)

# The index

See [the documentation](https://analysiscenter.github.io/dataset/intro/index.html) for more info about how to create an index which fits your needs.

Here are the most frequent use cases:

    client_index = DatasetIndex(my_client_ids)

    images_index = FilesIndex(path="/path/to/images/*.jpg", no_ext=True)

# Iterate with gen_batch(...)

In [4]:
for i, batch in enumerate(dataset.gen_batch(BATCH_SIZE, n_epochs=1)):
    print("batch", i, " contains items", batch.indices)

batch 0  contains items [0 1 2]
batch 1  contains items [3 4 5]
batch 2  contains items [6 7 8]
batch 3  contains items [9]


### drop_last=True skips the last batch if it contains fewer than BATCH_SIZE items

In [5]:
for i, batch in enumerate(dataset.gen_batch(BATCH_SIZE, n_epochs=1, drop_last=True)):
    print("batch", i, " contains items", batch.indices)

batch 0  contains items [0 1 2]
batch 1  contains items [3 4 5]
batch 2  contains items [6 7 8]


### shuffle permutes items across batches

In [6]:
for i, batch in enumerate(dataset.gen_batch(BATCH_SIZE, n_epochs=1, drop_last=True, shuffle=True)):
    print("batch", i, " contains items", batch.indices)

batch 0  contains items [8 0 1]
batch 1  contains items [9 6 7]
batch 2  contains items [5 3 4]


Run the cell above multiple times to see how batches change.

### Shuffle can be bool, int (seed number) or a RandomState object

In [7]:
for i, batch in enumerate(dataset.gen_batch(BATCH_SIZE, n_epochs=1, drop_last=True, shuffle=123)):
    print("batch", i, " contains items", batch.indices)

batch 0  contains items [4 0 7]
batch 1  contains items [5 8 3]
batch 2  contains items [1 6 9]


Run the cell above multiple times to see that batches stay the same across runs.

# Iterate with next_batch(...)

While `gen_batch` is a python generator, `next_batch` is an ordinary function.
Most of the time you will use `gen_batch`, but for a deeper control over training and a more sophisticated finetuning `next_batch` might be more convenient.

If too many iterations are made, `StopIteration` will be raised.

Check that there are `NUM_ITEMS * 3` iterations (i.e. 3 epochs), but `n_epochs=2`.

In [8]:
for i in range(NUM_ITEMS * 3):
    try:
        batch = dataset.next_batch(BATCH_SIZE, shuffle=True, n_epochs=2, drop_last=True)
        print("batch", i + 1, "contains items", batch.indices)
    except StopIteration:
        print("got StopIteration")
        break

batch 1 contains items [8 0 4]
batch 2 contains items [9 6 1]
batch 3 contains items [3 2 5]
batch 4 contains items [7 8 0]
batch 5 contains items [9 2 3]
batch 6 contains items [5 6 4]
got StopIteration


### And finally with shuffle=True, n_epochs=None and a variable batch size

Do not forget to reset iterator to start `next_batch`'ing from scratch

In [9]:
dataset.reset_iter()

`n_epochs=None` allows for infinite iterations.

In [10]:
for i in range(int(NUM_ITEMS * 1.3)):
    batch = dataset.next_batch(BATCH_SIZE + (-1)**i * i % 3, shuffle=True, n_epochs=None, drop_last=True)
    print("batch", i + 1, "contains items", batch.indices)

batch 1 contains items [0 3 4]
batch 2 contains items [2 6 1 8 9]
batch 3 contains items [2 1 0 4 6]
batch 4 contains items [5 7 9]
batch 5 contains items [6 3 7 0]
batch 6 contains items [8 5 2 4]
batch 7 contains items [0 7 1]
batch 8 contains items [4 2 8 5 6]
batch 9 contains items [5 6 0 9 7]
batch 10 contains items [4 1 8]
batch 11 contains items [6 3 2 8]
batch 12 contains items [0 5 9 1]
batch 13 contains items [9 5 0]


To get a deeper understanding of `drop_last` read [very important notes in the API](https://analysiscenter.github.io/dataset/api/dataset.index.html#dataset.DatasetIndex.next_batch).

# Working with data

For illustrative purposes let's create a small array which will serve as a raw data source.

In [11]:
data = (100 + np.arange(NUM_ITEMS * 3)).reshape(NUM_ITEMS, -1)
data

array([[100, 101, 102],
       [103, 104, 105],
       [106, 107, 108],
       [109, 110, 111],
       [112, 113, 114],
       [115, 116, 117],
       [118, 119, 120],
       [121, 122, 123],
       [124, 125, 126],
       [127, 128, 129]])

## Load data into a batch

After loading data is available as `batch.data`

In [12]:
for batch in dataset.gen_batch(BATCH_SIZE, n_epochs=1):
    batch = batch.load(src=data)
    print("batch contains items with indices", batch.indices)
    print('and batch data is')
    print(batch.data)
    print()

batch contains items with indices [0 1 2]
and batch data is
[[100 101 102]
 [103 104 105]
 [106 107 108]]

batch contains items with indices [3 4 5]
and batch data is
[[109 110 111]
 [112 113 114]
 [115 116 117]]

batch contains items with indices [6 7 8]
and batch data is
[[118 119 120]
 [121 122 123]
 [124 125 126]]

batch contains items with indices [9]
and batch data is
[[127 128 129]]



### You can easily iterate over batch items too

In [13]:
for batch in dataset.gen_batch(BATCH_SIZE, n_epochs=1):
    batch = batch.load(src=data)
    print("batch contains")
    for item in batch:
        print(item)
    print()

batch contains
[100 101 102]
[103 104 105]
[106 107 108]

batch contains
[109 110 111]
[112 113 114]
[115 116 117]

batch contains
[118 119 120]
[121 122 123]
[124 125 126]

batch contains
[127 128 129]



## Data components

Not infrequently, the batch stores a more complex data structures, e.g. features and labels or images, masks, bounding boxes and labels. To work with these you might employ data components. Just define a property as follows:

In [14]:
class MyBatch(Batch):
    components = 'features', 'labels'

Let's generate some random data:

In [15]:
features_array = (200 + np.arange(NUM_ITEMS * 3)).reshape(NUM_ITEMS, -1)
labels_array = np.random.choice(10, size=NUM_ITEMS)
data = features_array, labels_array

Now create a dataset (`preloaded` handles data loading from data stored in memory)

In [16]:
dataset = Dataset(index=NUM_ITEMS, batch_class=MyBatch, preloaded=data)

Since components are defined, you can address them as batch and even item attributes (they are created and loaded automatically).

In [17]:
for i, batch in enumerate(dataset.gen_batch(BATCH_SIZE, n_epochs=1)):
    print("batch", i, " contains items", batch.indices)
    print("and batch data consists of features:")
    print(batch.features)
    print("and labels:", batch.labels)
    print()

batch 0  contains items [0 1 2]
and batch data consists of features:
[[200 201 202]
 [203 204 205]
 [206 207 208]]
and labels: [6 4 5]

batch 1  contains items [3 4 5]
and batch data consists of features:
[[209 210 211]
 [212 213 214]
 [215 216 217]]
and labels: [2 5 1]

batch 2  contains items [6 7 8]
and batch data consists of features:
[[218 219 220]
 [221 222 223]
 [224 225 226]]
and labels: [6 3 8]

batch 3  contains items [9]
and batch data consists of features:
[[227 228 229]]
and labels: [1]



### You can iterate over batch items and change them on the fly

In [18]:
for i, batch in enumerate(dataset.gen_batch(BATCH_SIZE, n_epochs=1)):
    print("Batch", i)
    for item in batch:
        print("item features:", item.features, "    item label:", item.labels)

    print()
    print("You can change batch data, even scalars.")
    for item in batch:
        item.features = item.features + 1000
        item.labels = item.labels + 100
    print("New batch features:\n", batch.features)
    print("and labels:", batch.labels)
    print()

Batch 0
item features: [200 201 202]     item label: 6
item features: [203 204 205]     item label: 4
item features: [206 207 208]     item label: 5

You can change batch data, even scalars.
New batch features:
 [[1200 1201 1202]
 [1203 1204 1205]
 [1206 1207 1208]]
and labels: [106 104 105]

Batch 1
item features: [209 210 211]     item label: 2
item features: [212 213 214]     item label: 5
item features: [215 216 217]     item label: 1

You can change batch data, even scalars.
New batch features:
 [[1209 1210 1211]
 [1212 1213 1214]
 [1215 1216 1217]]
and labels: [102 105 101]

Batch 2
item features: [218 219 220]     item label: 6
item features: [221 222 223]     item label: 3
item features: [224 225 226]     item label: 8

You can change batch data, even scalars.
New batch features:
 [[1218 1219 1220]
 [1221 1222 1223]
 [1224 1225 1226]]
and labels: [106 103 108]

Batch 3
item features: [227 228 229]     item label: 1

You can change batch data, even scalars.
New batch features:
 

# Splitting a dataset

For machine learning tasks you might need to split a dataset into train, test and validation parts.

In [19]:
dataset.cv_split(0.8)

Now the dataset is split into train / test in 80/20 ratio.

In [20]:
len(dataset.train), len(dataset.test)

(8, 2)

In [21]:
dataset.cv_split([.6, .2, .2])

In [22]:
len(dataset.train), len(dataset.test), len(dataset.validation)

(6, 2, 2)

Dataset may be shuffled before splitting.

In [23]:
dataset.cv_split(0.7, shuffle=True)

In [24]:
dataset.train.indices, dataset.test.indices

(array([4, 9, 2, 6, 5, 7, 8]), array([0, 3, 1]))

As always, shuffle can be bool, int (seed number) or a RandomState object.

`dataset.train` and `dataset.test` are also datasets so you can do anything you want including splitting them further into `dataset.train.train`, etc.

Most of the time, though, you will work with pipelines, not datasets.

See [the next tutorial](./02_pipeline_basic_operations.ipynb) for details.