# Parallelized Batch Data Loading

In this notebook we practice a Pytorch API to prepare a _batch data_ for machine learning algorithm training.

## Goals
* Learn what `DataLoader` is and how to create
* Learn how to stream batch data using `DataLoader`

First, we download a python package for the workshop and make sure it's up-to-date.

In [1]:
![ -d kmi ] || git clone https://github.com/drinkingkazu/kmi
! cd kmi && git pull

Already up to date.


... followed by importing some libraries

In [2]:
import matplotlib   # for plotting in 2D
%matplotlib inline  
import torch, numpy # only to set random seed (used by underlying python packages)
import numpy as np  # only to set random seed (used by underlying python packages)
# Set random seed
SEED=123
_=numpy.random.seed(SEED)
_=torch.manual_seed(SEED)
# Add the path for the python package downloaded
import sys
sys.path.insert(0, 'kmi')

## `DataLoader`
So whaat is it? When we train a machine learning algorithm using a stochastic gradient descent, we want to form a random subset of data, callled a _batch_, to compute the error and weight updates. Recall how our `Dataset` works: it tells you the total data size + gives an intuitive access to a data by a data index integer. A pytorch `DataLoader` takes `Dataset` instance with those features and automatize the packaging of batch data. 

Also we want this process to be fast such that we maximize the compute time spent for the algorithm forward and gradient calculations. `DataLoader` implements a scheme of multiple parallel workers to perform multiple batch data preparation to realize this.

### Instantiation of `DataLoader`
The instantiation is trivial through pytorch API, but here we will use another factory function from `kmi` package. The function takes a configuration parameters in the form of a python dictionary just like `Dataset`. In fact, you give both the `Dataset` and `DataLoader` configuration in a single dictionary.

In [3]:
from kmi.iotools import loader_factory

# Let's start with Dataset configuration
dataset_config = dict(data_files = ['/scratch/workshop_2020_01/train/proton.h5','/scratch/workshop_2020_01/train/muon.h5'],
                      dataset = 'DenseImage2D',
                     )

# Next, DataLoader configuration
loader_config = dict( batch_size = 10,
                      collate = 'DenseCollate',
                      shuffle = False,
                      num_workers = 4,
                    )

# Combine two configurations (or you could make one dict to begin with)
loader_config.update(dataset_config)

# Create a loader
loader = loader_factory(**loader_config)

So we introduced following configuration parameters:
* `batch_size` ... the size of the subset data
* `collate` ... a function specifier (via string) to define how multiple data entries should be combined
* `shuffle` ... True means a subset consists of randomly chosen entries while False means sequential data access
* `num_workers` ... an integer specifying the number of parallel workers to form batch data

A bit more about `collate`: recall our `Dataset` returns a dict of data element for entry X specified by you (a user). When we prepare a batch data, we don't want an array of a dictionary, rather a dictionary of an array of data. This refactorization is implemented in `DenseCollate` and `SparseCollate` functions in this workshop, corresponding to `DenseImage2D` and `SparseImage` respectively

## Data streaming with `DataLoader`
So let's play with it! First of all, it has the concept of "length".

In [4]:
print('length of DataLoader:',len(loader))
print('By the way, batch size * length =', 10 * len(loader))

length of DataLoader: 20000
By the way, batch size * length = 200000


We know, using 2 input files, the input data total statistics is 200,000 which coincides with the length of `DataLoader` instance and the batch size where the latter is the unit of batch data. **Yep, as you guessed**, `DataLoader` is iterable: 

In [5]:
# Create an iterator for playin in this notebook
from itertools import cycle
iter = cycle(loader)

for i in range(10):
    data = next(iter)
    print('Iteration',i,'batch data "index":',data['index'])

Iteration 0 batch data "index": [0 1 2 3 4 5 6 7 8 9]
Iteration 1 batch data "index": [10 11 12 13 14 15 16 17 18 19]
Iteration 2 batch data "index": [20 21 22 23 24 25 26 27 28 29]
Iteration 3 batch data "index": [30 31 32 33 34 35 36 37 38 39]
Iteration 4 batch data "index": [40 41 42 43 44 45 46 47 48 49]
Iteration 5 batch data "index": [50 51 52 53 54 55 56 57 58 59]
Iteration 6 batch data "index": [60 61 62 63 64 65 66 67 68 69]
Iteration 7 batch data "index": [70 71 72 73 74 75 76 77 78 79]
Iteration 8 batch data "index": [80 81 82 83 84 85 86 87 88 89]
Iteration 9 batch data "index": [90 91 92 93 94 95 96 97 98 99]


... and this is how "data" looks like:

In [6]:
print('Shape of an image batch data',data['data'].shape)

Shape of an image batch data (10, 192, 192)


which is, quite naturally, 10 of 192x192 images. 

## Exercise
Let's end this exercise by:
* plotting the pixel values in the whole batch 
* visualizing individual image in a batch 

In [None]:
# Plot a histogram of pixel values
fig,ax = plt.subplots(figsize=(12,8),facecolor='w')
YOUR_CODE
plt.show()

In [None]:
# Loop over images in a batch and visualize using matplotlib
for i in len(data['data']):
    data  = YOUR_CODE
    label = YOUR_CODE
    pdg   = YOUR_CODE
    index = YOUR_CODE
    # report + visualize
    print('Data at index', index, 'PDG', pdg, 'label', label)
    fig,ax = plt.subplots(figsize=(12,8),facecolor='w')
    plt.imshow(data,origin='lower')
    plt.show()