This notebook aims at processing the original Ra dataset to a more specific one.

## Step 1 - Preparations

First we need to import some packages.

In [1]:
import h5py
import os

Then we define function `view_dataset` to view the shape of a dataset.

In [2]:
def _print_hdf5(name, obj):
    indent = "  " * name.count("/")
    if isinstance(obj, h5py.Dataset):
        print(f"{indent}[Dataset] {name} shape={obj.shape} dtype={obj.dtype}")
    elif isinstance(obj, h5py.Group):
        print(f"{indent}[Group]   {name}")

def view_dataset(dataset_path):
    with h5py.File(dataset_path, "r") as f:
        f.visititems(_print_hdf5)


## Step 2 - Load and visualize the dataset

Set `dataset_path` and `new_dataset_path` as the paths of the original and new dataset.

Display the structure of the original dataset.

In [3]:
# Paths
dataset_path = "/home/ubuntu/Desktop/Ra/datasets/Ra_128.h5"
new_dataset_path = "/home/ubuntu/Desktop/Ra/datasets/Ra_128_indexed_binned.h5"

# Path
if not os.path.exists(os.path.dirname(new_dataset_path)):
    os.makedirs(os.path.dirname(new_dataset_path), exist_ok=True)

# View the original dataset
view_dataset(dataset_path)

[Dataset] images shape=(9192, 128, 128, 3) dtype=uint8
[Dataset] index_train shape=(0,) dtype=float64
[Dataset] index_valid shape=(0,) dtype=float64
[Dataset] labels shape=(9192,) dtype=float64
[Dataset] types shape=(9192,) dtype=int32


From the above output we can see some `numpy` arrays are inside the dataset, including the important ones:

1. `images`: includes all the images data
2. `labels`: include all the labels regarding the images
3. `types`: include all the types regarding the images

We have to read and store these three `numpy` arrays.

In [4]:
with h5py.File(dataset_path, "r") as f:
    images = f["images"][:]
    labels = f["labels"][:]
    types = f["types"][:]

## Step 3 - Process the data

We process the datasets according to our requirements.

In [5]:
bins = [
    [
        2334,
        920,
        1385,
        2636,
        1846,
        541,
        1478,
        715,
        727,
        830,
        430,
        1824,
        2084,
        398,
        1018,
        1599,
        2586,
        1039,
        2461,
        867,
        1413,
        1503,
        2153,
        331,
        1409,
        976,
        863,
        1828,
        2208,
        1625,
    ],
    [
        3341,
        2996,
        3297,
        3363,
        3145,
        2876,
        2769,
        3008,
        3240,
        2820,
        3340,
        2884,
        2871,
        2937,
        3433,
        3028,
        3173,
        3129,
        3276,
        2885,
        3309,
        3216,
        2758,
        2880,
        2953,
        2881,
        3253,
        3312,
        3000,
        3103,
    ],
    [
        4229,
        4345,
        3730,
        4481,
        3773,
        3550,
        3967,
        3497,
        4193,
        3681,
        4358,
        3527,
        4172,
        4213,
        4479,
        4360,
        3886,
        4265,
        4054,
        3870,
        3542,
        4158,
        4412,
        3597,
        3599,
        3485,
        3865,
        3731,
        3716,
        3537,
    ],
    [
        5180,
        5782,
        5209,
        6223,
        5844,
        5512,
        4604,
        5425,
        5545,
        5345,
        6064,
        5753,
        4925,
        4994,
        5075,
        4879,
        5110,
        6232,
        5771,
        6229,
        5129,
        4952,
        5921,
        5317,
        4835,
        5410,
        5060,
        4878,
        5774,
        5183,
    ],
    [
        6808,
        6555,
        6863,
        6524,
        6860,
        6817,
        6916,
        6388,
        6842,
        6456,
        6579,
        6342,
        6366,
        6401,
        6725,
        6387,
        6313,
        6790,
        6499,
        6924,
        6442,
        6459,
        6360,
        6937,
        6788,
        6421,
        6469,
        6904,
        6936,
        6714,
    ],
    [
        7178,
        7057,
        7054,
        6983,
        7067,
        7216,
        7123,
        7120,
        7281,
        6966,
        7065,
        7164,
        7236,
        7209,
        7050,
        7093,
        7059,
        7088,
        7098,
        6950,
        7144,
        6996,
        7244,
        7060,
        7086,
        7031,
        7049,
        7113,
        7080,
        7258,
    ],
    [
        7550,
        7545,
        7549,
        7323,
        7317,
        7647,
        7339,
        7546,
        7595,
        7452,
        7518,
        7569,
        7425,
        7534,
        7400,
        7642,
        7526,
        7340,
        7516,
        7374,
        7299,
        7530,
        7658,
        7284,
        7362,
        7542,
        7309,
        7330,
        7293,
        7360,
    ],
    [
        7958,
        7920,
        8277,
        8021,
        7912,
        7828,
        8163,
        7748,
        7718,
        8076,
        8215,
        7895,
        7750,
        8268,
        7795,
        8177,
        7729,
        7799,
        7898,
        7727,
        7992,
        7812,
        8092,
        8189,
        7899,
        7784,
        8094,
        7990,
        8109,
        7761,
    ],
    [
        8880,
        8443,
        8504,
        8773,
        8914,
        8735,
        8561,
        8402,
        8454,
        8623,
        8833,
        8648,
        8905,
        8654,
        8901,
        8456,
        8634,
        8538,
        8474,
        8650,
        8446,
        8804,
        8769,
        8835,
        8637,
        8449,
        8910,
        8929,
        8611,
        8445,
    ],
    [
        9097,
        9036,
        9159,
        9160,
        8958,
        9047,
        9114,
        9151,
        9018,
        9084,
        9040,
        8973,
        9132,
        8952,
        9010,
        9121,
        9125,
        8961,
        9099,
        9009,
        9155,
        9038,
        9063,
        8968,
        9157,
        9091,
        9108,
        8959,
        8988,
        9096,
    ],
]
bins_flatten = [item for sublist in bins for item in sublist]

images = images[bins_flatten]
labels = labels[bins_flatten]
types = types[bins_flatten]
N = len(bins_flatten)
index_train = list(range(0, N, 2))
index_valid = list(range(1, N, 2))

## Step 4 - Construct and output the new dataset

Now that we have everything for constructing the new dataset, we come to the last step.

In [6]:
with h5py.File(new_dataset_path, "w") as f:
    f.create_dataset("images", data=images)
    f.create_dataset("labels", data=labels)
    f.create_dataset("types", data=types)
    f.create_dataset("index_train", data=index_train)
    f.create_dataset("index_valid", data=index_valid)

We can also view the structure of the new dataset.

In [7]:
view_dataset(new_dataset_path)

[Dataset] images shape=(300, 128, 128, 3) dtype=uint8
[Dataset] index_train shape=(150,) dtype=int64
[Dataset] index_valid shape=(150,) dtype=int64
[Dataset] labels shape=(300,) dtype=float64
[Dataset] types shape=(300,) dtype=int32
