# Neural Networks: Part 2 - Dataloader

## Making a dataloader

So far, we have manually iterated through our datasets to train our models. As you can tell, it is very clunky if you decide to use batches of data, or would like to randomize your data. A `dataloader` is a pytorch class that wraps a dataset. Today, we will write our own version of it, so we can streamline testing different training strategies.

In [1]:
import torch

### Minimal version

Let's first just make a minimal version. It is a class that will accept a *training/val/test dataset* and will simply pass us entries from that dataset on demand, one at a time.

We will make use of an `iter` object, which is a generator that can produce an output by calling `next(iter)` on it. This is a "lazy" function that only loads data into memory when we call it, which becomes very efficient with large datasets.

In [2]:
# an example of constructing an iterator
random_objects = ['chair', 'doorknob', 'stereo', 'paper airplane']
iterator = iter(random_objects)

In [3]:
# call next on the iterator repeatedly
next(iterator)

'chair'

I've started the code here. Fill in the missing part 'generate_iter' function.

In [4]:
class SimpleDataloader():
    def __init__(self, x, y):
        self.x = x
        self.y = y
        self.iter = self.generate_iter()

    def generate_iter(self):
        batches = []
        for idx in range(self.x.shape[0]):
            batches.append(
                {
                    'x':self.x[idx],
                    'y':self.y[idx],
                }
            ) 
        return iter(batches)

    def fetch_row(self):
        return next(self.iter)

Load in our processed data, and try fetching data from it using our dataloader

In [5]:
data_dict = torch.load('data/wages_processed.pt')
train_dataloader = SimpleDataloader(data_dict['x_train'],data_dict['y_train'])

In [6]:
train_dataloader.fetch_row()

{'x': tensor([-0.3905,  0.4840, -1.3102,  0.0000,  1.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  1.0000,  0.0000,  0.0000,  0.0000,  1.0000,  0.0000,  0.0000,
          1.0000]),
 'y': tensor([0.])}

Now, to make a dataloader really useful, we should allow ourselves to set a batch size, so every time we call on our iterator, it will return a whole batch of data.

Take a look in `model/dataloader.py`. I have included just the framework of the class there, with descriptions of what each function should do. Fill it in yourself, and keep in mind these constraints:

- If the iterator reaches the end, it should restart
- Data should be randomized each time the iterator resets
- Any batch size should be valid
- All data needs to be iterated through before resetting

Use the below cell to test out your new class.

In [7]:
from model.dataloader_Solution import CustomDataloader

data_dict = torch.load('data/wages_processed.pt')
train_dataloader = CustomDataloader(data_dict['x_train'],data_dict['y_train'])
train_dataloader.fetch_batch()

{'x_batch': tensor([[-0.3905,  0.4840, -1.3102,  0.0000,  1.0000,  0.0000,  0.0000,  0.0000,
           0.0000,  1.0000,  0.0000,  0.0000,  0.0000,  1.0000,  0.0000,  0.0000,
           1.0000]]),
 'y_batch': tensor([[0.]]),
 'batch_idx': 0}