# Tutorial

Before going through some code examples of how `h5torch` works. Please read through the [package concepts](https://h5torch.readthedocs.io/en/latest/index.html) to know what's going on.

Make sure you have a working installation of `torch` and have installed `h5torch` via `pip install h5torch`.

For a more detailed overview of this package's functionality, see the [API reference](https://h5torch.readthedocs.io/en/latest/h5torch.html).

## Quick start / simple use-cases

`h5torch` works by first instantiating a HDF5 file using `h5torch.File` and filling it with data using `File.register()`.
Then, `h5torch.Dataset` can be used to read data directly from the HDF5 file, ready to be used with PyTorch's `DataLoader`.

Note that the first registered object should always be the `central` object.

The most simple use-case is a ML setting with a 2-D `X` matrix as central object with corresponding labels `y` along the first axis.

In [1]:
import h5torch
import numpy as np
f = h5torch.File("example.h5t", "w")
X = np.random.randn(100, 15)
y = np.random.rand(100)
f.register(X, "central")
f.register(y, 0, name = "y")
f.close()

dataset = h5torch.Dataset("example.h5t")
print(dataset[5])
print(len(dataset))

dataset.close()

{'central': array([ 0.98979173, -3.41435395, -0.78360513, -1.26418759, -0.93384509,
       -0.22411679, -1.31606085,  0.4299904 ,  1.32580397,  0.22207813,
       -0.69006483,  0.42695502, -0.22582408,  0.11693748,  0.58534766]), '0/y': 0.9617572347542511}
100


You will note that `h5torch.Dataset` returns a dictionary of objects.

Note that labels `y` can also play the role of central object. Both are equivalent in this simple case.
By example:

In [2]:
f = h5torch.File("example.h5t", "w")
X = np.random.randn(100, 15)
y = np.random.rand(100)
f.register(y, "central")
f.register(X, 0, name = "X")
f.close()

dataset = h5torch.Dataset("example.h5t")
print(dataset[5])
print(len(dataset))

dataset.close()

{'central': 0.6420444587143879, '0/X': array([ 0.93766063,  0.32896113, -1.46207565,  0.04000909, -0.59668085,
        1.06243662,  0.70166576, -0.85328211, -0.42410189,  0.51222133,
       -0.06248377,  0.44139297, -1.182032  ,  0.77243425,  0.2729682 ])}
100



Note that the data will be saved and loaded with the data types that they had upon registering.
To control this behavior, users can pass the arguments `dtype_save` and `dtype_load` to `File.register()`. Some examples where controlling this behavior is useful are:
- Converting from NumPy's default (`float64`) to PyTorch's default (`float32`).
- Saving disk space (e.g. binary labels can be saved as booleans but can be converted back to integers upon loading)
- Circumventing the fact `h5py` doesn't work with string data-types. These should be converted to `"bytes"` and back.


The following example shows an example where we convert labels from integers to booleans in the saved HDF5 format, and convert back to integer upon loading in `h5torch.Dataset`.

In [3]:
f = h5torch.File("example.h5t", "w")
X = np.random.randn(100, 15)
y = (np.random.rand(100) > 0.5).astype(int)
f.register(X, "central")
f.register(y, 0, name = "y", dtype_save="bool", dtype_load="int64")
f.close()

dataset = h5torch.Dataset("example.h5t")

print(dataset[5])
print(len(dataset))

dataset.close()

{'central': array([-0.01736951, -1.43416815,  0.75396736,  0.17290202,  0.07317204,
       -0.46644905, -1.30649431,  0.44177047, -0.12409271,  0.98836047,
       -1.52254716, -0.62410142,  1.27216924,  2.02733532, -0.43260042]), '0/y': 1}
100


## Multi-dimensional cases

Consider the previous example, but where we also have metadata on features that we want to save.
In this case, we want to align said metadata to the 1st axis of `X`, making it necessary to make `X` our central object.

In [4]:
f = h5torch.File("example.h5t", "w")
X = np.random.randn(100, 15)
y = (np.random.rand(100) > 0.5).astype(int)
metadata = np.random.randn(15, 5)

f.register(X, "central")
f.register(metadata, 1, name = "metadata")
f.register(y, 0, name = "y", dtype_save="bool", dtype_load="int64")
f.close()

dataset = h5torch.Dataset("example.h5t")

print(dataset[5])
print(len(dataset))

dataset.close()

{'central': array([ 0.99690642, -2.33629046,  1.94552392, -0.07709176, -0.83370512,
        0.86024048, -1.17020468, -0.21787612,  0.67440204, -0.10539647,
        1.49601214, -0.50616417,  0.87227492, -2.18742138, -1.21128324]), '0/y': 0}
100


## Data "modes"

`h5torch` supports a variety of modes for saving data, as not all data comes neatly arrayed in NumPy arrays (our default `N-D` mode).

For more intuition on when to use different modes of objects, see our package concepts.

For details on what format of data each mode expects, see the [API reference](https://h5torch.readthedocs.io/en/latest/h5torch.html).

Consider, for illustration, a dataset of histological images of tissues with paired transcriptomic counts. The transcriptomic counts present themselves as rows of gene activities per tissue. Suppose furthermore that a (variable-length) textual description of each gene is present.

In `h5torch`, such dataset could be saved and loaded by saving the transcriptomic count matrix as central object in default `N-D` mode, with images aligned to axis 0 in `separate` mode and text aligned to axis 1 in `vlen` mode.

The following code shows how some fake data in this format could be saved:

In [5]:
f = h5torch.File("example.h5t", "w")

# make a fictional 100x50 counts matrix by simulating random integers
counts = np.random.randint(low=0, high=20, size = (100, 50))

# simulate 100 (3xHxW) images, where H and W varies per image.
images = [
    np.random.randn(
        3,
        np.random.randint(low=20, high=256),
        np.random.randint(low=20, high=256)
        ) for _ in range(100)
    ]

# simulate 50 variable-length text descriptions, each description contains integer-tokenized text
text = [
    np.random.randint(
    low=0, high=10_000, 
    size = (np.random.randint(low=5, high=256),)) 
    for _ in range(50)
    ]


f.register(counts, "central")
f.register(images, axis = 0, name = "images", mode = "separate")
f.register(text, axis = 1, name = "text", mode = "vlen")
f.close()

dataset = h5torch.Dataset("example.h5t")

print(dataset[5])
print(len(dataset))

dataset.close()

{'central': array([ 0,  4,  2,  5, 12,  8,  6,  4,  8,  5, 18, 18, 13,  4, 14, 10, 16,
        2, 14,  7, 13,  7, 15, 11,  6, 10,  9,  2, 13,  5,  9, 17,  9,  7,
       19,  3, 15,  4, 12,  9, 15, 14, 15, 19, 19,  3, 12,  0, 11, 18]), '0/images': array([[[-3.56557988e-01,  1.79789641e+00,  7.41479427e-01, ...,
          2.29167273e-01,  4.84247601e-01,  2.44997787e-01],
        [-1.16602272e-02,  8.43172945e-01, -7.83499156e-01, ...,
         -7.77611868e-01,  4.06864525e-01,  1.98301294e-01],
        [ 1.04960846e+00, -1.26902843e+00,  1.18916929e-01, ...,
         -6.91018963e-01,  1.16213463e+00, -7.70060465e-01],
        ...,
        [ 1.63488471e+00, -1.81512802e+00, -1.38004870e+00, ...,
          2.55002079e-01,  1.98616012e-01, -8.26304432e-01],
        [-1.25587495e-01, -1.77245215e+00, -1.54150459e+00, ...,
         -1.96460166e+00,  9.25512291e-01,  7.97873750e-01],
        [ 2.49485366e+00, -1.84269406e+00, -1.33106057e+00, ...,
          7.23374747e-02, -8.49435816e-01, -1

The dataset object in the previous sample will sample rows of counts with their corresponding image. If you want to sample columns of counts with their textual data, the sampling axis of the dataset can be changed.

In [6]:
dataset = h5torch.Dataset("example.h5t", sampling=1)

print(dataset[5])
print(len(dataset))

dataset.close()

{'central': array([ 1, 14, 14, 15,  9,  8,  8,  2,  5, 18,  7, 11, 12,  5,  3, 17, 11,
       10, 13, 13,  7,  5,  9, 10,  3, 17, 19, 15,  2, 12, 12,  5, 19, 13,
       19, 15, 19,  0,  5, 12, 13, 11, 11,  8, 16, 18,  9,  2,  0,  2,  6,
        5,  0, 11,  5,  3,  5, 16,  7,  5,  3, 13,  0,  6,  0,  8, 19, 10,
       10, 10,  2, 11, 11, 14,  0, 15,  4,  2, 15,  4, 10, 11, 12, 11,  3,
        9, 12, 10,  8,  5, 15, 12,  0,  2,  2,  7,  2, 19, 15, 16]), '1/text': array([9513, 3200, 2204, 9789, 5096, 3807,  489, 9505, 7206, 4841,  402,
       5838, 6297,  368, 7025, 7158, 9164,  357, 2911, 2706, 7805, 4245,
         73, 2484, 4196, 1703, 2675, 6824, 9249, 9300, 6728, 4718, 2179,
       7885, 2876, 5094, 2075, 9043,  940, 7841,  753, 9734, 9842, 3809,
       8733, 5920, 6658, 2512,  317, 2904, 8991, 3735, 9564, 5556, 6164,
       7865, 6403, 8233, 7191, 5923, 4336, 9092, 8438, 2229, 2280, 8952,
       7648, 6086, 8924,  463, 4755, 7247, 5577, 7004])}
50


If instead, you want a sample to constitute an image-text pair (row-column pair), with its corresponding count. You can choose to sample in `"coo"` mode:

In [7]:
dataset = h5torch.Dataset("example.h5t", sampling="coo")

print(dataset[5])
print(len(dataset))

dataset.close()

{'central': 1, '0/images': array([[[-0.43810267, -0.11849598,  1.33702541, ...,  0.72592681,
         -0.5234584 ,  0.10405255],
        [ 1.67701011,  1.23007339,  1.75710564, ..., -0.77413316,
         -0.48173892,  0.43799925],
        [-1.1750171 ,  0.42391199, -0.71871375, ...,  0.17442484,
         -0.98827085, -0.17298069],
        ...,
        [ 1.52855698, -1.78897902, -0.21855661, ..., -0.61193698,
          0.49896934,  0.78627897],
        [ 1.56965465, -0.33384022,  1.56938974, ..., -0.19517717,
         -1.79049889, -1.35649419],
        [ 0.18089748, -0.59007415,  0.76988018, ..., -1.35494275,
         -0.19199976, -1.26714873]],

       [[-0.59587307, -1.14308934, -1.16665446, ...,  0.35755592,
         -1.08198708,  0.26279554],
        [-0.41433322, -0.26345253, -1.52905785, ...,  0.76088384,
          0.74151497,  0.00777517],
        [ 1.48439867, -0.68861787,  0.33277467, ...,  0.26786888,
          1.06074661,  0.66279619],
        ...,
        [-2.17334197,  0.55

Note the different length of the dataset in this case.

If you have a sparse (`mode = "coo"`) central object and you want to use `"coo"` sampling, the default behavior is to only use the nonzero elements as samples (i.e. the dataset size will be equal to the number of nonzero elements).

## Filling the HDF5 in batches

Notice how, in previous examples, we registered all the data for each object in the HDF5 all at once.
For large data settings, this might not be possible as the whole object might not fit in memory. For this purpose, we allow pre-specifying the length of an object upon first registering, and appending to the object with subsequent calls.

In [8]:
f = h5torch.File("example.h5t", "w")

# generate the first 10 000 data points
X = np.random.randn(10000, 15)

# specify that the length should be 100 000
f.register(X, "central", length = 100_000)

# generate the other 90 000 in a for loop and append
for i in range(1, 10):
    X = np.random.randn(10000, 15)
    f.append(X, "central")

f.close()

dataset = h5torch.Dataset("example.h5t")
print(dataset[5])
print(len(dataset))

dataset.close()

{'central': array([ 0.24442064,  0.5016083 , -0.69151674,  0.30407638,  0.56601632,
        1.87827124, -0.16616496,  0.26501234,  0.39776121,  0.21226936,
       -0.53594993,  1.75798295,  0.32011685,  0.23094861, -0.10523167])}
100000


## Subsetting data

Users have the choice create different HDF5 files for their different data splits, or can specify subsets of the data to use in the dataset via the `subset` argument to `h5torch.Dataset`.

Keeping the data in one file will often save on disk space, as e.g. feature metadata which is shared between splits does not have to saved more than once.

In [9]:
import h5torch
import numpy as np
f = h5torch.File("example.h5t", "w")
X = np.random.randn(100, 15)
y = np.random.rand(100)
f.register(X, "central")
f.register(y, 0, name = "y")
f.close()


train_indices = np.arange(90)
test_indices = np.arange(90,100)

train_dataset = h5torch.Dataset("example.h5t", subset = train_indices)
test_dataset = h5torch.Dataset("example.h5t", subset = test_indices)
print(len(train_dataset))
print(len(test_dataset))

train_dataset.close()
test_dataset.close()

90
10


Alternatively, users can save an object signifying the data split along in the HDF5 file. This will help to ship whole datasets along with their splits as one file, making for easier benchmarking between methods.

In this case, users can supply a tuple of `(dataset_key, regex pattern)`:

In [10]:
import h5torch
import numpy as np
f = h5torch.File("example.h5t", "w")
X = np.random.randn(100, 15)
y = np.random.rand(100)
split = np.array(["train"] * 90 + ["test"] * 10)
f.register(X, "central")
f.register(y, 0, name = "y")
f.register(split, 0, name = "split", dtype_save="bytes")
f.close()


train_indices = np.arange(90)
test_indices = np.arange(90,100)

train_dataset = h5torch.Dataset("example.h5t", subset = ("0/split", "train"))
test_dataset = h5torch.Dataset("example.h5t", subset = ("0/split", "test"))
print(len(train_dataset), len(test_dataset))

train_dataset.close()
test_dataset.close()

90 10


## Sample processing

Users can manipulate how the final data sample is presented to them via the `sample_processor` argument. The input to this argument is expected to be a callable with the `h5py` handle to the HDF5 file as first input and the sample provided by `h5torch.Dataset` as second input.

An example is given using the previously-created dataset. Where the labels are thresholded to create a binary label and only the first 5 features are kept in X:

In [11]:
dataset = h5torch.Dataset("example.h5t")

dataset[5]

{'central': array([ 1.15874599,  0.75935053, -0.46822231,  0.09256801,  0.98858088,
        -0.95437084,  0.93028141,  0.39072497,  1.12033718, -0.84810734,
         0.18166637, -0.67633619,  0.66342818, -0.18298203,  0.33764982]),
 '0/split': 'train',
 '0/y': 0.9674837449509108}

In [12]:
def sample_processor(f, sample):
    y = (sample["0/y"] > 0.5).astype(int)
    X = sample["central"][:5]
    return X, y

dataset = h5torch.Dataset("example.h5t", sample_processor=sample_processor)

dataset[5]

(array([ 1.15874599,  0.75935053, -0.46822231,  0.09256801,  0.98858088]), 1)

Note how we used `sample_processor` here to turn the dict-based sample into a tuple one.

`sample_processor` takes `f` and `sample` as input arguments and them only. If a user wants to introduce more arguments to how samples are post-processed, they can be wrapped in a class object:

In [13]:
class SampleProcessor(object):
    def __init__(self, threshold = 0.5):
        self.threshold = threshold
    def __call__(self, f, sample):
        y = (sample["0/y"] > self.threshold).astype(int)
        X = sample["central"][:5]
        return X, y


dataset = h5torch.Dataset("example.h5t", sample_processor=SampleProcessor(threshold = 0.9))

dataset[5]

(array([ 1.15874599,  0.75935053, -0.46822231,  0.09256801,  0.98858088]), 1)

## Slicing-type datasets

The default behavior is to let a sample constitute a single index of an axis in the dataset. For some applications, however, a slice of data constitutes a sample. An example is time-series or sequence labeling.

Which slices to take is controlled by `window_size` and `overlap`.

In [14]:
dataset = h5torch.SliceDataset("example.h5t", window_size = 10, overlap = 0)

sample = dataset[5]
print({k: v.shape for k, v in sample.items()})
print(len(dataset))

{'central': (10, 15), '0/split': (10,), '0/y': (10,)}
10


The behavior of `window_size` and `overlap` can be overwritten by `window_indices`:

In [15]:
window_indices = np.array([
    [15, 20],
    [45, 50],
    [75, 80],
])

dataset = h5torch.SliceDataset("example.h5t", window_indices=window_indices)

sample = dataset[1]
print({k: v.shape for k, v in sample.items()})
print(len(dataset))

{'central': (5, 15), '0/split': (5,), '0/y': (5,)}
3
