# DataLoader
In the previous two notebooks you have implemented two different datasets that we can now use to access our data. However, in machine learning, we often need to perform a few additional data preparation steps before we can start training models.

An important additional class for data preparation is the **DataLoader**. By wrapping a dataset in a dataloader, we will be able to load small subsets of the dataset at a time, instead of having to load each sample separately. In machine learning, the small subsets are referred to as **mini-batches**, which will play an important role later in the lecture.

In this notebook, you will implement your own dataloader, which you can then use to load mini-batches from the datasets you implemented previously.

First, we need to import libraries and code, as always.

In [1]:
import numpy as np

from exercise_code.data import DataLoader, DummyDataset
from exercise_code.networks import DummyNetwork
from exercise_code.tests import test_dataloader, save_pickle

%load_ext autoreload
%autoreload 2

## Iterating over a Dataset
Throughout this notebook we will use a dummy dataset that contains all even numbers from 2 to 100. Similar to the datasets you have implemented before, the dummy dataset has a `__len__()` method that allows us to call `len(dataset)`, as well as a `__getitem__()` method, which allows us to call `dataset[i]` and returns a dict `{"data": val}` where `val` is the i-th even number. If you would like to see the code, have a look at `DummyDataset` in `exercise_code/data/base_dataset.py`.

Let's start by defining the dataset, and calling it's methods to get a better feel for it.

In [2]:
dataset = DummyDataset(
    root=None,
    divisor=2,
    limit=100
)
print(
    "Dataset Length:\t", len(dataset),
    "\nFirst Element:\t", dataset[0],
    "\nLast Element:\t", dataset[-1],
)

Dataset Length:	 50 
First Element:	 {'data': 2} 
Last Element:	 {'data': 100}


In the following, we will write some code to iterate over the dataset in mini-batches, similar to what we would like our dataloader to do. The number of samples to load per mini-batch is referred to as **batch size**. For the remainder of this notebook, let's use a batch size of 3.

In [3]:
batch_size = 3

Let us now define a simple function that iterates over the dataset and groups samples into mini-batches:

In [4]:
def build_batches(dataset, batch_size):
    batches = []  # list of all mini-batches
    batch = []  # curernt mini-batch
    for i in range(len(dataset)):
        batch.append(dataset[i]) #add the element of dataset into the mini-batch
        if len(batch) == batch_size:  # if the current mini-batch is full,
            batches.append(batch)  # add it to the list of mini-batches,
            #print(batch)
            batch = []  # and start a new mini-batch
    #print(batches)
    return batches

batches = build_batches(
    dataset=dataset,
    batch_size=batch_size
)

let's have a look at our mini-batches:

In [5]:
def print_batches(batches):  
    for i, batch in enumerate(batches):
        print("mini-batch %d:" % i, str(batch))

print_batches(batches)

mini-batch 0: [{'data': 2}, {'data': 4}, {'data': 6}]
mini-batch 1: [{'data': 8}, {'data': 10}, {'data': 12}]
mini-batch 2: [{'data': 14}, {'data': 16}, {'data': 18}]
mini-batch 3: [{'data': 20}, {'data': 22}, {'data': 24}]
mini-batch 4: [{'data': 26}, {'data': 28}, {'data': 30}]
mini-batch 5: [{'data': 32}, {'data': 34}, {'data': 36}]
mini-batch 6: [{'data': 38}, {'data': 40}, {'data': 42}]
mini-batch 7: [{'data': 44}, {'data': 46}, {'data': 48}]
mini-batch 8: [{'data': 50}, {'data': 52}, {'data': 54}]
mini-batch 9: [{'data': 56}, {'data': 58}, {'data': 60}]
mini-batch 10: [{'data': 62}, {'data': 64}, {'data': 66}]
mini-batch 11: [{'data': 68}, {'data': 70}, {'data': 72}]
mini-batch 12: [{'data': 74}, {'data': 76}, {'data': 78}]
mini-batch 13: [{'data': 80}, {'data': 82}, {'data': 84}]
mini-batch 14: [{'data': 86}, {'data': 88}, {'data': 90}]
mini-batch 15: [{'data': 92}, {'data': 94}, {'data': 96}]


As we see, the iteration works, but the output is not very pretty. Let us now write a simple function that combines the dictionaries of samples in a mini-batch.

In [6]:
def combine_batch_dicts(batch):
    batch_dict = {}
    for data_dict in batch:
        for key, value in data_dict.items():
            if key not in batch_dict:
                batch_dict[key] = []
            batch_dict[key].append(value)
    return batch_dict

combined_batches = [combine_batch_dicts(batch) for batch in batches]
print_batches(combined_batches)

mini-batch 0: {'data': [2, 4, 6]}
mini-batch 1: {'data': [8, 10, 12]}
mini-batch 2: {'data': [14, 16, 18]}
mini-batch 3: {'data': [20, 22, 24]}
mini-batch 4: {'data': [26, 28, 30]}
mini-batch 5: {'data': [32, 34, 36]}
mini-batch 6: {'data': [38, 40, 42]}
mini-batch 7: {'data': [44, 46, 48]}
mini-batch 8: {'data': [50, 52, 54]}
mini-batch 9: {'data': [56, 58, 60]}
mini-batch 10: {'data': [62, 64, 66]}
mini-batch 11: {'data': [68, 70, 72]}
mini-batch 12: {'data': [74, 76, 78]}
mini-batch 13: {'data': [80, 82, 84]}
mini-batch 14: {'data': [86, 88, 90]}
mini-batch 15: {'data': [92, 94, 96]}


This looks much more organized.

To perform operations more efficiently later, we would also like the values of the mini-batches to be contained in a numpy array instead of a simple list. Let's briefly write a function for that:

In [7]:
def batch_to_numpy(batch):
    numpy_batch = {}
    for key, value in batch.items():
        numpy_batch[key] = np.array(value)
    return numpy_batch

numpy_batches = [batch_to_numpy(batch) for batch in combined_batches]
print_batches(numpy_batches)

mini-batch 0: {'data': array([2, 4, 6])}
mini-batch 1: {'data': array([ 8, 10, 12])}
mini-batch 2: {'data': array([14, 16, 18])}
mini-batch 3: {'data': array([20, 22, 24])}
mini-batch 4: {'data': array([26, 28, 30])}
mini-batch 5: {'data': array([32, 34, 36])}
mini-batch 6: {'data': array([38, 40, 42])}
mini-batch 7: {'data': array([44, 46, 48])}
mini-batch 8: {'data': array([50, 52, 54])}
mini-batch 9: {'data': array([56, 58, 60])}
mini-batch 10: {'data': array([62, 64, 66])}
mini-batch 11: {'data': array([68, 70, 72])}
mini-batch 12: {'data': array([74, 76, 78])}
mini-batch 13: {'data': array([80, 82, 84])}
mini-batch 14: {'data': array([86, 88, 90])}
mini-batch 15: {'data': array([92, 94, 96])}


Lastly, we would like to make the loading a bit more memory efficient. Instead of loading the entire dataset into memory at once, let us only load samples when they are needed. We can do so by building a Python generator, using the `yield` keyword. See https://wiki.python.org/moin/Generators for more information on generators.

In [8]:
def build_batch_iterator(dataset, batch_size, shuffle):
    if shuffle:
        index_iterator = iter(np.random.permutation(len(dataset)))  # define indices as iterator
    else:
        index_iterator = iter(range(len(dataset)))  # define indices as iterator

    batch = []
    for index in index_iterator:  # iterate over indices using the iterator
        batch.append(dataset[index])
        if len(batch) == batch_size:
            yield batch  # use yield keyword to define a iterable generator
            batch = []
            
batch_iterator = build_batch_iterator(
    dataset=dataset,
    batch_size=batch_size,
    shuffle=True
)
batches = []
for batch in batch_iterator:
    batches.append(batch)

print_batches(
    [batch_to_numpy(combine_batch_dicts(batch)) for batch in batches]
)

mini-batch 0: {'data': array([ 4, 80, 68])}
mini-batch 1: {'data': array([100,   2,  58])}
mini-batch 2: {'data': array([82, 44, 30])}
mini-batch 3: {'data': array([64, 22, 10])}
mini-batch 4: {'data': array([54, 12, 26])}
mini-batch 5: {'data': array([94, 72, 16])}
mini-batch 6: {'data': array([74, 76, 18])}
mini-batch 7: {'data': array([84, 14, 38])}
mini-batch 8: {'data': array([ 6,  8, 28])}
mini-batch 9: {'data': array([88, 42, 70])}
mini-batch 10: {'data': array([66, 78, 48])}
mini-batch 11: {'data': array([40, 62, 32])}
mini-batch 12: {'data': array([52, 36, 46])}
mini-batch 13: {'data': array([50, 86, 34])}
mini-batch 14: {'data': array([60, 24, 92])}
mini-batch 15: {'data': array([20, 56, 96])}


The functionality of the cell above is now pretty close to what we would like our dataloader to do. However, there are still two remaining issues:
1. The last two samples of our dataset are not contained in any mini-batches. This is because the number of samples in our dataset is not dividable by the batch size, so there are a few left-over samples which are implicitly discarded. Ideally, we would like to have an option that allows us to decide how to handle these last samples.
2. The order of mini-batches, as well as which samples are grouped together, is always in increasing order. Ideally, we would also like to have an option that allows us to randomize which samples are grouped together. The randomization could be easily implemented by randomly permuting the indices of the dataset before iterating over it, e.g. using `indices = np.random.permutation(len(dataset))`.

## TODO: DataLoader Class Implementation
Now it is your turn to put everything together and implement the DataLoader as a proper class.
We provide you with a basic skeleton for this, which you can find in  `exercise_code/data/dataloader.py`. Open the file and have a look at the class. Note that the `__init__` method receives four arguments:
* **dataset** is the dataset that the dataloader should load.
* **batch_size** is the mini-batch size, i.e. the number of samples you want to load at the same time.
* **shuffle** is binary and defines whether the dataset should be randomly shuffled or not.
* **drop_last**: is binary and defines how to handle the last mini-batch in your dataset. Specifically, if the amount of samples in your dataset is not dividable by the minibatch size then there will be some samples left over in the end. If `drop_last=True`, we simply discard those samples, otherwise we return them together as a smaller minibatch.


Your task is now to implement the two core methods of the dataloader:
* `__len__(self)` should return the length of the dataloader, which corresponds to the number of minibatches that you can load from the dataset. Similar to your datasets, this will allows you to call `len(dataloader)` later.
* `__iter__(self)` should define how to iterate over the dataloader. For this, you should define an iterable again, similar to what we have done in the `build_batch_iterator()` function above. The mini-batches loaded by `__iter__()` should each be a dict consisting of numpy arrays.

If you're done, run the cells below to check if your dataloader works as intended.

**Hint:** Make use of the code above when implementing your `__iter__()` method. 

In [9]:
dataloader = DataLoader(
    dataset=dataset,
    batch_size=batch_size,
    shuffle=True,
    drop_last=False,
)

In [10]:
for batch in dataloader:
    print(batch)

{'data': [72, 16, 6]}
{'data': [68, 86, 20]}
{'data': [30, 34, 66]}
{'data': [96, 10, 2]}
{'data': [24, 78, 46]}
{'data': [28, 92, 64]}
{'data': [76, 40, 100]}
{'data': [4, 54, 22]}
{'data': [88, 50, 62]}
{'data': [8, 74, 90]}
{'data': [36, 48, 80]}
{'data': [32, 44, 94]}
{'data': [60, 56, 12]}
{'data': [38, 18, 98]}
{'data': [14, 84, 70]}
{'data': [58, 42, 26]}
{'data': [82, 52]}


In [11]:
dataloader1 = DataLoader(
    dataset=dataset,
    batch_size=batch_size,
    shuffle=True,
    drop_last=True,
)
test_dataloader(
    dataset=dataset,
    dataloader=dataloader1,
    batch_size=batch_size,
    shuffle=True,
    drop_last=True,
)

LenTestInt passed.
LenTestCorrect passed.
Method __len__() correctly implemented. Tests passed: 2/2
IterTestIterable passed.
IterTestItemType passed.
IterTestBatchSize passed.
IterTestNumBatches passed.
IterTestValuesUnique passed.
IterTestValueRange passed.
IterTestShuffled passed.
IterTestNonDeterministic passed.
Method __iter__() correctly implemented. Tests passed: 8/8
Class DataLoader correctly implemented. Tests passed: 10/10
Score: 100/100


100

In [12]:
dataloader2 = DataLoader(
    dataset=dataset,
    batch_size=batch_size,
    shuffle=True,
    drop_last=False
)
test_dataloader(
    dataset=dataset,
    dataloader=dataloader2,
    batch_size=batch_size,
    shuffle=True,
    drop_last=False
)

LenTestInt passed.
LenTestCorrect passed.
Method __len__() correctly implemented. Tests passed: 2/2
IterTestIterable passed.
IterTestItemType passed.
IterTestBatchSize passed.
IterTestNumBatches passed.
IterTestValuesUnique passed.
IterTestValueRange passed.
IterTestShuffled passed.
IterTestNonDeterministic passed.
Method __iter__() correctly implemented. Tests passed: 8/8
Class DataLoader correctly implemented. Tests passed: 10/10
Score: 100/100


100

### Save your DataLoaders for Submission
Simply save your dataloaders using the following cell. This will save them to a pickle file `models/dataloader.p`.

In [13]:
save_pickle(
    data_dict={
        "dataloader1": dataloader1,
        "dataloader2": dataloader2,
    },
    file_name="dataloader.p"
)

## Key Takeaways
1. In machine learning, we often need to load data in **mini-batches**, which are small subsets of the training dataset. How many samples to load per mini-batch is called the **batch size**.
2. In addition to the Dataset class, we use a **DataLoader** class that takes care of mini-batch construction, data shuffling, and more.
3. The dataloader is iterable and only loads those samples of the dataset that are needed for the current mini-batch.

# Submission Instructions

Now, that you have completed the neccessary parts in the notebook, you can go on and submit your files.

1. Go on [our submission page](https://dvl.in.tum.de/teaching/submission/), register for an account and login. We use your matriculation number and send an email with the login details to the mail account associated. When in doubt, login into tum online and check your mails there. You will get an id which we need in the next step.
2. Navigate to `exercise_code` directory and run the `create_submission.sh` file to create the zip file of your model. This will create a single `zip` file that you need to upload. Otherwise, you can also zip it manually if you don't want to use the bash script.
3. Log into [our submission page](https://dvl.in.tum.de/teaching/submission/) with your account details and upload the `zip` file. Once successfully uploaded, you should be able to see the submitted "dummy_model.p" file selectable on the top.
4. Click on this file and run the submission script. You will get an email with your score as well as a message if you have surpassed the threshold.

<img src="./images/i2dlsubmission.png">

# Submission Goals

- Goal: Implement a DataLoader that loads mini-batches from a given dataset and supports batch_size, shuffle, and drop_last args.
- Test cases:
  1. Does `__len__()` return the correct data type?
  2. Does `__len__()` return the correct value?
  3. Does `__iter__()` work at all, i.e. is it possible to iterate over the dataloader?
  4. Does `__iter__()` load the correct data type?
  5. Does `__iter__()` load data with correct batch size?
  6. Does `__iter__()` load the correct number of batches?
  7. Does `__iter__()` load every sample only once?
  8. Does `__iter__()` load the smallest and largest sample from the dataset?
  9. Does `__iter__()` shuffle the data correctly (if necessary)?
  10. Does `__iter__()` return non-deterministic values when shuffling?
- Reachable points [0, 100]: 0 if not implemented, 100 if all tests passed, 10 per passed test
- Threshold to clear exercise: 80
- Submission start: __May 11, 2020 23.59__
- Submission deadline : __May 17, 2020 23.59__. 
- You can make multiple submission uptil the deadline. Your __best submission__ will be considered for bonus

## Outlook
You have now implemented everything you need to use the CIFAR and House Prices datasets for deep learning model training. Using your dataset and dataloader, your model training will later look something like the following:

In [14]:
dataset = DummyDataset(
    root=None,
    divisor=2,
    limit=200,
)
dataloader = DataLoader(
    dataset=dataset,
    batch_size=3,
    shuffle=True,
    drop_last=True
)
model = DummyNetwork(model_name="dummy")
for minibatch in dataloader:
    model_output = model.forward(minibatch)
    # do more stuff... (soon)

TypeError: bad operand type for unary -: 'dict'