# Parallelized Batch Data Loading

In this notebook we practice a Pytorch API to prepare a _batch data_ for machine learning algorithm training.

## Goals
* Learn what `DataLoader` is and how to create
* Learn how to stream batch data using `DataLoader`

First, we download a python package for the workshop and make sure it's up-to-date.

In [1]:
![ -d kmi ] || git clone https://github.com/drinkingkazu/kmi
! cd kmi && git pull

Already up to date.


... followed by importing some libraries

In [2]:
import matplotlib   # for plotting in 2D
%matplotlib inline  
import torch, numpy # only to set random seed (used by underlying python packages)
import numpy as np  # only to set random seed (used by underlying python packages)
# Set random seed
SEED=123
_=numpy.random.seed(SEED)
_=torch.manual_seed(SEED)
# Add the path for the python package downloaded
import sys
sys.path.insert(0, 'kmi')

## `DataLoader`
So whaat is it? When we train a machine learning algorithm using a stochastic gradient descent, we want to form a random subset of data, callled a _batch_, to compute the error and weight updates. Recall how our `Dataset` works: it tells you the total data size + gives an intuitive access to a data by a data index integer. A pytorch `DataLoader` takes `Dataset` instance with those features and automatize the packaging of batch data. 

Also we want this process to be fast such that we maximize the compute time spent for the algorithm forward and gradient calculations. `DataLoader` implements a scheme of multiple parallel workers to perform multiple batch data preparation to realize this.

### Instantiation of `DataLoader`
The instantiation is trivial through pytorch API, but here we will use another factory function from `kmi` package. The function takes a configuration parameters in the form of a python dictionary just like `Dataset`. In fact, you give both the `Dataset` and `DataLoader` configuration in a single dictionary.

In [8]:
from kmi.iotools import loader_factory

BATCH_SIZE=100

# Let's start with Dataset configuration
dataset_config = dict(data_files = ['/scratch/workshop_2020_01/train/proton.h5'],
                      dataset = 'DenseImage2D',
                      angles  = True
                     )

# Next, DataLoader configuration
loader_config = dict( batch_size = BATCH_SIZE,
                      collate = 'DenseCollate',
                      shuffle = False,
                      num_workers = 4,
                    )

# Combine two configurations (or you could make one dict to begin with)
loader_config.update(dataset_config)

# Create a loader
loader = loader_factory(**loader_config)

In [None]:
import tables
OUT_FNAME='aho.h5'

FILTERS = tables.Filters(complib='zlib', complevel=5)
fout = tables.open_file(OUT_FNAME,mode='w', filters=FILTERS)
data_storage = fout.create_carray(fout.root,'data',tables.Float32Atom(),shape=[100000,192,192])
label_storage = fout.create_carray(fout.root,'pdg',tables.Int32Atom(),shape=[100000])

for i in range(100000 / BATCH_SIZE):
    
label_storage[:] = label
