# NilmTorch - Map Dataset vs Iterable Dataset

- Auothors: khalid OUBLAL, Nikolaos Virtsionis, Christoforos Nalmpantis

There are two main types of dataset objects in PyTorch: Dataset and IterableDataset. The choice between these two types depends on the size of the dataset. In general, an IterableDataset is suitable for large datasets, typically in the range of hundreds of gigabytes, due to its lazy behavior and faster processing. On the other hand, a Dataset is more appropriate for smaller datasets.

One key difference between the two is in how the data is accessed. With a regular Dataset, you can access specific rows using indexing, such as hdf5[i]. This random access capability is often referred to as "map-style" access. In contrast, IterableDataset provides a ```streaming-like``` access where the data is accessed in a sequential manner.

To illustrate, consider the example of downloading the ImageNet-1k dataset. Using a regular Dataset, you can download the dataset and access any specific row as needed. However, when using an IterableDataset, the data is streamed sequentially, allowing for efficient processing without the need to load the entire dataset into memory at once.

Overall, the choice between Dataset and IterableDataset depends on the size and nature of the dataset, with IterableDataset being more suitable for large datasets requiring streaming-like access, and Dataset being a versatile option for smaller datasets.

#### 1) Load libraries

In [4]:
import torch
import sys
from constants.constants import *
import h5py
import numpy as np
import random

#### 2) NilmTorch_iterableDataset 

In [None]:

class NilmTorch_iterableDataset(torch.utils.data.IterableDataset):
    def __init__(self,type_files="hdf5",
                 data_files = None,
                 sequence_length = 256,
                 stride = 1,
                 FILE_HDF5_list = ['ID_101'], 
                 streaming = True):
        """
        Open Open file streaming, without any loading to memory
        Not like map dataset, which map the whole data at one and then using it
        """
        # Multi correlation le tau d'information
        self.sequence_length = sequence_length
        self.stride = stride
        self.FILE_HDF5_list = FILE_HDF5_list
        if type_files in ["hdf5", "HDF5"]:
            self.FILE_HDF5 = h5py.File(data_files, 'r+')
            
        else:
            raise print(f'Not supported {type_files}')

    def __iter__(self):
        for item in self.FILE_HDF5_list:
            # do any preprocessing or data augmentation here
            resolution = self.FILE_HDF5[item].attrs['RESOLUTION']
            LEN = self.FILE_HDF5[item].attrs['LEN']
            X = np.array(self.FILE_HDF5[item]['data'][:])
            ground = np.array(self.FILE_HDF5[item]['label'][:])
            
            for i in range(0, len(X)-self.sequence_length+1, self.stride):
                    sequence = self.padding_seqs(X[i:i+self.sequence_length]) #
                    yield sequence.transpose(1,0)
            
    def padding_seqs(self, in_array):
        if len(in_array) == self.sequence_length:
            return in_array
        try:
            out_array = np.zeros((self.sequence_length, in_array.shape[1]))
        except:
            out_array = np.zeros(self.sequence_length)
        out_array[:len(in_array)] = in_array
        return out_array
    



#### 3) ShuffleDataset 

In [None]:
class ShuffleDataset(torch.utils.data.IterableDataset):
  def __init__(self, dataset, buffer_size):
    super().__init__()
    self.dataset = dataset
    self.buffer_size = buffer_size

  def __iter__(self):
    shufbuf = []
    try:
      dataset_iter = iter(self.dataset)
      for i in range(self.buffer_size):
        shufbuf.append(next(dataset_iter))
    except:
      self.buffer_size = len(shufbuf)

    try:
      while True:
        try:
          item = next(dataset_iter)
          evict_idx = random.randint(0, self.buffer_size - 1)
          yield shufbuf[evict_idx]
          shufbuf[evict_idx] = item
        except StopIteration:
          break
      while len(shufbuf) > 0:
        yield shufbuf.pop()
    except GeneratorExit:
      pass

### Differences in Speed:

There is a distinction in speed between regular Dataset objects and IterableDataset in PyTorch. The choice between these two types depends on the size of the dataset. Generally, IterableDataset is more suitable for large datasets, particularly those in the order of hundreds of gigabytes, due to its lazy behavior and faster processing. On the other hand, Dataset is recommended for smaller datasets.

When using a regular Dataset, you can access specific rows using indexing, such as htdf5[0], which enables random access. This type of dataset is commonly referred to as a "map-style" dataset. For example, you can download and access any row from the dataset.

Regular Dataset (Map) objects leverage Arrow, which facilitates efficient random access by memory mapping and in-memory formatting. This approach minimizes costly system calls and deserialization when reading data from disk. Additionally, iterating over the dataset using a for loop benefits from accessing contiguous Arrow record batches, further enhancing speed.

However, if your Dataset includes an indices mapping, such as after applying shuffling with Dataset.shuffle(), the speed may decrease by up to 10 times. This reduction in performance is due to the additional step required to obtain the row index using the indices mapping, as well as the loss of reading data in contiguous chunks. To restore the original speed, you would need to rewrite the dataset on disk using Dataset.flatten_indices(), removing the indices mapping. Keep in mind that this process can be time-consuming, especially for larger datasets.

In summary, regular Dataset objects excel at random access and efficient data loading, but the presence of an indices mapping can significantly impact speed. Consider the dataset structure and the need for indices mapping when optimizing performance.