# DLKit Generators

DLKit is a lightweight framework for managing a large number of Keras models. It provides a wrapper for models, tools to efficiently represent and read data, and analysis functions. This notebook overview how to use the DLKit generators to rapidly read and process data using a large number of processes/threads.




## DLGenerator

For most HEP applications, the data is far too big to keep in memory. So during fitting, the data needs to be loaded on the fly. Keras uses python generators to read data as it trains. Since reading data can take some time, training in this way usually takes significantly longer than loading the data into memory first. To accelerate reading, Keras enables you to read the data using multiple parallel generators, but unfortunately their implementation has several issues that make it inefficient. So `DLKit` provides generators that not only run much faster, make it easy to read data.

Let's try to read the LCD data using a `DLGenerator`. `DLKit` provides a generator which can read any files from various directories and correctly mix examples in the training data. But we can do the mixing before hand to save some time during training. A "premixed" file is available at `/data/LCD/LCD-Merged-All.h5`.

First lets open the file up by hand to see what is inside:

In [1]:
import h5py
f=h5py.File('/data/LCD/LCD-Merged-All.h5')

for k in f.keys():
    try:
        print k,f[k].shape
    except:
        print k,"Not a tensor"
f.close()        

ECAL (3211264, 25, 25, 25)
HCAL (3211264, 5, 5, 60)
OneHot (3211264, 4)
index (3211264,)
target (3211264, 1, 5)


We see 3211264 events, which will require several hundred gigs of ram to load into memory. ECAL and HCAL are as described above. "OneHot" and "index" encode the true class of each example. "target" holds the energy of the particle. 

Now we can build a DLGenerator to read this file on the fly:

In [2]:
# A function to Normalize the data.

from DLTools.ThreadedGenerator import DLh5FileGenerator

def ConstantNormalization(Norms):
    def NormalizationFunction(Ds):
        out = []
        for i,Norm in enumerate(Norms):
            Ds[i]/=Norm
            out.append(Ds[i])
        return out
    return NormalizationFunction

def MergeInputs():
    def f(X):
        return [X[0],X[1]],X[2]
    return f

def MakePreMixGenerator(InputFile,BatchSize,Norms=[150.,1.],  Max=3e6,Skip=0, 
                        ECAL=True, HCAL=True, Energy=False, **kwargs):
    datasets=[]

    if ECAL:
        datasets.append("ECAL")
    if HCAL:
        datasets.append("HCAL")

    datasets.append("OneHot")

    if Energy:
        datasets.append("target")
    
    if ECAL and HCAL:
        post_f=MergeInputs()
    else:
        post_f=False
        
    pre_f=ConstantNormalization(Norms)
    
    G=DLh5FileGenerator(files=[InputFile], datasets=datasets,
                        batchsize=BatchSize,
                        max=Max, skip=Skip, 
                        postprocessfunction=post_f,
                        preprocessfunction=pre_f,
                        **kwargs)
    
    return G

MyGen=MakePreMixGenerator("/data/LCD/LCD-Merged-All.h5",1024,[150.,150.,1.])

Using Theano backend.


Couldn't import dot_parser, loading of dot files will not be possible.


`DLh5FileGenerator` takes a list of files and keys of objects to read, and delivers `BatchSize` number of examples as requested. Note that we not only read the data, but we use a `preprocessfunction` to normalize the data, and a `postprocessfunction` to format output as needed for Keras to train a ECAL and HCAL model simultaneously.

Let's get some events:

In [3]:
TheGen=MyGen.Generator()
Data=TheGen.next()

print Data[0][0].shape
print Data[0][1].shape
print Data[1].shape


(1024, 25, 25, 25)
(1024, 5, 5, 60)
(1024, 4)


# Mixing Generator

The DLKit's mixing generator take data separated into files and appropriately mix them and label them for classification tasks.