A concrete example for using data generator for large datasets such as ImageNet #1627

parag2489 · 2016-02-03T02:51:27Z

I am already aware of some discussions on how to use Keras for very large datasets (>1,000,000 images) such as this and this. However, for my scenario, I can't figure out the appropriate way to use the ImageDataGenerator or write my own dataGenerator.

Specifically, I have the following four questions:

From this link: when we do datagen.fit(X_sample), do we assume that X_sample is a big enough chunk of data to calculate mean, perform feature centering/normalization and whitening on?
From the same previous link, Another thing is, X_sample cannot obviously be the entire data, so will the augmentation (i.e. flipping, width/height shift) happen on partial data? For example, X_sample = 10000 out of total 1,000,000 pictures. After augmentation, suppose we get 2 * 10,000 more pictures. Note that we are not running datagen.fit() again, so will our augmented data contain only 1,000,000 + 2 * 10,000 samples? How do we augment entire data (i.e. 1,000,000 + 2 * 1,000,000 samples)?
This is about fetching data from the manually written generator. Since the data is so large, it won't fit into one big HDF5 file, so we split it into 8 files. Now, the data generator has to run in an infinite loop. This and this answer mentions data generators, but a concrete example of it will be more helpful.

My approach for building a data generator (for a very large data) which loops indefinitely is as follows (which fails):

def myGenerator() #this will give chunk of 10K pictures, 100 such chunks form entire dataset:
    fileIndex=0
    while 1:
            # following loads data from HDF5 file numbered with fileIndex
            (X_train, y_train) = LOAD_HDF5_OF_10K_SAMPLES(fileIndex)
            fileIndex=fileIndex+1
            if fileIndex == numOfHDF_files
                fileIndex=0 #so that fileIndex wraps back and loop goes on indefinitely

The above code doesn't work in the sense that once it enters into the above function from fit_generator(), it just stays in the while 1 loop forever. A detailed example will help a lot.

Also, if we are to use ImageDataGenerator as in this link (which is preferable instead of writing our own), should we put (X_train, y_train), (X_test, y_test) = LOAD_10K_SAMPLES_OF_BIG_DATA() in a for loop and write datagen.fit(X_train) and model.fit_generator(datagen.flow(...)) in that loop?

The text was updated successfully, but these errors were encountered:

wongjingping · 2016-02-03T09:20:28Z

Hi there, my answers to your questions below:

Yes that assumption is implicit.
datagen.fit() doesn't generate a batch of examples continuously like a generator; it actually does some once-off computations (mean, std, etc) given the X matrix which you supplied. To continuously generate data with random on-the-fly augmentations, you need to pass the .flow() function of the ImageDataGenerator class to the fit_generator function. That way you will be generating augmented batches forever :)
How big is your data? HDF5 has no size limits afaik so you don't need to worry about storing your data into one huge HDF5 file. Your only issue would be loading it all into memory at a go, which I would advise against since that is what the batch generators are for: avoiding a huge memory hog. Your example looks alright (looping forever is what you want), though you might want to add in some lines for augmenting the data you just loaded.

parag2489 · 2016-02-03T10:01:14Z

I have some follow up questions @wongjingping :

Response to 3. I should have been clearer. I am not concerned about one "big" HDF5 file, the question is entire data can't be loaded as you say. You say that my example looks fine, but I think its wrong. I illustrate that with the following snippet of data generator (let's leave the data augmentation for later):

def myGenerator():
    #loading data
    (X_train, y_train), (X_test, y_test) = mnist.load_data()

    #some preprocessing
    y_train = np_utils.to_categorical(y_train,10)
    X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
    X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    X_train /= 255
    X_test /= 255
    while 1:
        for i in range(1875):
            if i%125==0:
                print "i = " + str(i)
            yield X_train[i*32:(i+1)*32], y_train[i*32:(i+1)*32]

The above function is called by fit_generator(). Now, it should print till i=1750 and then train the model. However, it just keeps printing i=0 to i=1750 and then starts again from i=0 without training the model.

If I comment the line while 1, it runs perfectly, but then it violates the assumption of infinite loop, doesn't it? Can you clear up my confusion by providing a concrete example or talking with respect to this example?

If you want a self-contained code snippet, it is as follows. You can just run it.

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten, Reshape
from keras.layers.convolutional import Convolution1D, Convolution2D, MaxPooling2D
from keras.utils import np_utils


def myGenerator():
    (X_train, y_train), (X_test, y_test) = mnist.load_data()
    y_train = np_utils.to_categorical(y_train,10)
    X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
    X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    X_train /= 255
    X_test /= 255
    while 1:
        for i in range(1875): # 1875 * 32 = 60000 -> # of training samples
            if i%125==0:
                print "i = " + str(i)
            yield X_train[i*32:(i+1)*32], y_train[i*32:(i+1)*32]

batch_size = 128
nb_classes = 10
nb_epoch = 12

# input image dimensions
img_rows, img_cols = 28, 28
# number of convolutional filters to use
nb_filters = 32
# size of pooling area for max pooling
nb_pool = 2
# convolution kernel size
nb_conv = 3
model = Sequential()

model.add(Convolution2D(nb_filters, nb_conv, nb_conv,
                        border_mode='valid',
                        input_shape=(1, img_rows, img_cols)))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters, nb_conv, nb_conv))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(nb_pool, nb_pool)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adadelta')

model.fit_generator(myGenerator(), samples_per_epoch = 60000, nb_epoch = 2, verbose=2, show_accuracy=True, callbacks=[], validation_data=None, class_weight=None, nb_worker=1)

wongjingping · 2016-02-03T15:56:58Z

Hi there,

I'm afraid I'm having some problems running your code with the debugger, but from what I can see I think you need to assign the generator instance to a new variable before passing it to the fit_generator() method as below:

my_generator = myGenerator()
model.fit_generator(my_generator, samples_per_epoch = 60000, nb_epoch = 2, verbose=2, show_accuracy=True, callbacks=[], validation_data=None, class_weight=None, nb_worker=1)

Let me know if it doesn't work! (I'm not too sure myself)

parag2489 · 2016-02-03T17:06:40Z

I can run it in pycharm debugger after putting breakpoint inside my_generator() function. However, I don't think running with a debugger is a good idea, in case of generators, that's why I am printing the value of i. With the code I had, you should see i going from 0 to 1750, then immediately warping back and printing 0 to 1750, repeat this indefinitely. Ideally, it should print till 1750, train, the again print till 1750, repeat this for NUMBER_OF_EPOCHS times.

Anyway, with your suggestion, same thing happens, it just keeps repeating indefinitely. However, if I remove while 1, it goes till 1750, trains and again goes back to 0 till 1750, trains, then terminates (as desired).

wongjingping · 2016-02-04T01:48:35Z

My apologies, I'm having some difficulty with the ipdb debugger in spyder, and resorted to another workaround.

Your model is training actually. You can add this snippet of code to verify (print out) the progress of your model using a callback:

class printbatch(Callback):
def on_batch_end(self, epoch, logs={}):
print(logs)
...
pb = printbatch()
# modify the fit_generator call to include the callback pb
model.fit_generator(myGenerator(), samples_per_epoch = 60000, nb_epoch = 2,
verbose=2, show_accuracy=True, callbacks=[pb], validation_data=None, class_weight=None, nb_worker=1)

By i ~ 500 you should observe that the training accuracy printed out is at least 90%, with 100% accuracy appearing more frequently.
Without the callback, it is as you observed, i goes from 0 to 1750 and wraps back for the next epoch.
Hope this clarifies your doubts :)

parag2489 · 2016-02-04T08:10:00Z

@wongjingping Thanks. It works. Just one small doubt (rather observation) is that when the logs are printed, the numbers being printed in front of "Batch: " are actually sample numbers. So if there are 60000 samples / epoch, then in the logs being printed through callbacks, we saw "Batch: 59999".

Anyway, I am closing this issue now. I still haven't had success in running the data generators with >1 workers, but I have asked another question for that. You may take a look at it if time permits. That would be great. That is Issue #1638

wongjingping · 2016-02-04T10:01:01Z

Hi @parag2489,

I suspect there is a bug with the fit_generator in determining the batch_size, have raised this under a separate issue #1639. Feel free to chip in!

sunshineatnoon · 2016-02-06T12:38:40Z

@wongjingping @parag2489 Hi~ May I ask you guys how to specify batch size if I wrote my own data generator since fit_generator doesn't have the batch_size parameter and in the fit_generator, we only yield one sample at a time.

wongjingping · 2016-02-09T01:18:45Z

@sunshineatnoon You can pass the batch_size as an argument to the generator:

i = 0

while i < epoch_size:

# add in image reading/augmenting code here

yield X[i:i+batch_size,...],y[i:i+batch_size,...]

if i + batch_size > epoch_size:

    i = 0

else:

    i += batch_size```


You might want to check out [this link](https://wiki.python.org/moin/Generators) that introduces generators

sunshineatnoon · 2016-02-09T03:53:35Z

@wongjingping Thanks! I will look into this. BTW, what does samples_per_epoch mean in fit_generator exactly? Say if I use a batch size of 64. Does this mean a total of 64*samples_per_epoch is seen every epoch?

wongjingping · 2016-02-09T06:13:21Z

@sunshineatnoon sorry for the poor formatting; the samples_per_epoch is the number of examples you expect to see in an epoch, not batch_size * samples_per_epoch :)

sunshineatnoon · 2016-02-09T06:28:18Z

@wongjingping So it means that if I use a batch size of 64, I will have samples_per_epoch / 64 batches per epoch? But when I specify batch_size and generate a batch, my network training time slows down, it seems like it trains on more samples each epoch if I increase the batch_size. Here is my generator:

def generate_batch_data(vocPath,imageNameFile,batch_size):
    sample_number = 5000 
    class_num = 20

    while 1:

        for i in range(0,sample_number,batch_size):
            #Read a batch of images from files
            imageList = prepareBatch(i,i+batch_size,imageNameFile,vocPath)
            #process imageList to np arrays images and boxes
            yield np.asarray(images),np.asarray(boxes)

parag2489 · 2016-02-09T23:54:43Z

@sunshineatnoon samples_per_epoch means fit_generator() will stop asking samples from data generator. This is necessary since data generator has an infinite loop and has to be stopped somewhere. In another words, samples_per_epoch = batch_size * number_of_batches.

Regarding why your training slows down, its best to profile your code. There is a feature in Theano for that (I think mode=Profile). You can increase speed if you call prepareBatch() for a large number of samples (large means that they can fit in your cpu RAM but not in GPU). Also, convert images and boxes in numpy array only once. Then just yield in batches of 32. In short, prepareBatch and two calls to np.asarray will go outside the for loop.

sunshineatnoon · 2016-02-10T01:26:20Z

@parag2489 Thanks! It's very nice of you to give such a detailed explanation, I will try to change my code.

zzqboy · 2016-05-01T12:18:52Z

this page helps,thanks! btw,it seems that new version comes quickly

raymondjplante · 2016-05-02T16:29:59Z

Just to be clear can someone confirm:

In model.fit() you specify the batch_size so it knows how to break a finite data set (the corresponding x and y numpy arrays) into chunks for gradient calculation. 100% of the data set gets consumed each epoch.
With model.fit_generator() the generator you provide should loop infinitely, the as samples_per_epoch is basically giving a bound to total data samples to run through. The batch_size isn't specified as each tuple returned from the generator is a single batch. You control the size of the batch via the generator, so if you return one sample per yields, it's like setting a batch size of 1.

Question:
What the heck is the use for the max_q_size? If the generator is handling the batching, why do you need another queue?

parag2489 · 2016-05-02T19:09:45Z

@raymondjplante

Q1. Your understanding of model.fit() is correct.
Q2. Correct.

Even I am not sure what the max_q_size. I think this answer has a mention of queue. So the queue is used to ensure that the generator is thread-safe.

You can also look at #1638 to see how to make the data generator thread-safe.

raymondjplante · 2016-05-02T21:41:14Z

@parag2489 Someone on SO provides a good explanation of the purpose of the generator queue: http://stackoverflow.com/questions/36986815/in-keras-model-fit-generator-method-what-is-the-generator-queue-controlled-pa

yanranwang · 2016-11-08T18:55:51Z

@wongjingping @parag2489 For the case of multi-inputs, such as we have two pathways in the network, each corresponding to different input, Can we still use "data-generator" to generate image regions parallel with training process?

tanayz · 2016-12-04T17:07:07Z

The problem I'm facing is keras fit_generator is good for processing images with collective size more than RAM size,but what if those files are actually not in image format.For example I've taken huge number of images(500k) and have used them against a pre-trained inception v3 model to get the feature out of them.Now each of those files are nothing but (1,384,8,8) array or npy files.Any idea how I can use fit generator to read them in batch as collectively they won't fit in my RAM and generators apparent don't recognize anything other than image files.

patyork · 2016-12-04T17:25:51Z

@tanayz It would be the exact same as if they were image instead of pickled/numpy data files:

Get a list of all of the files, and pass this list into the generator
In the generator:
- Infinite loop
- Shuffle the list of files
- For each slice of the shuffled files, where len(slice) == batch_size
  - Open files, read to a single array with first shape[0] == batch_size; yield data
  - Have an edge case to handle the case where batch_size is not a multiple of the number of files, such that the generator will always yield batch_size number of examples

wongjingping · 2016-12-19T03:45:33Z

@seasonwang I'm afraid I haven't tried that out before - sorry for the late reply!

behnamprime · 2017-05-30T17:35:59Z

Is there a way to use train_on_batch with a generator?

Dref360 · 2017-06-08T16:31:38Z

you means?

for batch in generator:
    model.train_on_batch(batch)

sxs4337 · 2017-06-14T18:40:29Z

Hi,
I have a question while using "predict_generator". How to ensure that the prediction is done on all test samples once.

For example-
predictions = model.predict_generator(
test_generator,
steps=int(test_generator.samples/float(batch_size)), # all samples once
verbose = 1,
workers = 2,
max_q_size=10,
pickle_safe=True
)
predicted_classes = np.argmax(predictions, axis=1)
true_classes = test_generator.classes

So the dimensions of predicted_classes and true_classes is different since total samples is not divisible by batch size.

The size of my test_set is not consistent, so the no. of steps in predict_generator would change each time depending upon the batch size. I am using flow_from_directory and cannot use predict_on_batch since my data is organized in a directory structure.

One solution is running with batch size of 1, but makes it very slow.

I hope my question is clear. Thanks in advance.

timehaven · 2017-07-26T11:14:22Z

The comments and suggestions in this issue and its cousin #1638 were very helpful for me to efficiently process large numbers of images. I wrote it all up in a tutorial fashion that I hope can help others.

https://techblog.appnexus.com/a-keras-multithreaded-dataframe-generator-for-millions-of-image-files-84d3027f6f43

ghost · 2018-02-21T08:05:13Z

Hello,
I am trying to use model.fit_generator with a custom Callback that tries to access Validation data. However, whatever I do, when accessing validation data from within the Callback, it always equates to None.

class RecallMetrics(Callback):
    def on_train_begin(self, logs=None):
        print('RecallMetrics ... validating')
        self.val_f1s = []
        self.val_recalls = []
        self.val_precisions = []


    def on_epoch_end(self, epoch, logs=None):
        x=(self.validation_data[0])
        if x is None :
            print ('Error: validation_data is None')
            return
        else:
            val_predict = (np.asarray(self.model.predict(self.validation_data[0]))).round()
            val_targ = self.validation_data[1]
            _val_f1 = f1_score(val_targ, val_predict)
            _val_recall = recall_score(val_targ, val_predict)
            _val_precision = precision_score(val_targ, val_predict)
            self.val_f1s.append(_val_f1)
            self.val_recalls.append(_val_recall)
            self.val_precisions.append(_val_precision)
            print (" — val_f1: % f — val_precision: % f — val_recall % f" % (_val_f1, _val_precision, _val_recall))
            return

history = model.fit_generator(generator=train_gen,
                                  validation_data=validate_gen,
                                  # validation_data=None,
                                  steps_per_epoch=len(train_file_list),
                                  validation_steps=len(val_file_list) * 3,
                                  verbose=2,
                                  epochs=int(tc.config["LUNA16"]["epochs"]),
                                  callbacks=callbacks,
                                  workers=multiprocessing.cpu_count(),
                                  use_multiprocessing=True)

How can I access validation data from a custom Callback when using fit_generator?

Best,

lawrencekiba · 2019-08-26T09:18:58Z

you means?

for batch in generator:
    model.train_on_batch(batch)

Hi,

Tried using this but got the following error:

dloss_real = disc.train_on_batch(dataBatch, valid) File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 1211, in train_on_batch class_weight=class_weight) File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 751, in _standardize_user_data exception_prefix='input') File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training_utils.py", line 92, in standardize_input_data data = [standardize_single_array(x) for x in data] File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training_utils.py", line 92, in <listcomp> data = [standardize_single_array(x) for x in data] File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training_utils.py", line 27, in standardize_single_array elif x.ndim == 1: AttributeError: 'tuple' object has no attribute 'ndim'

parag2489 closed this as completed Feb 4, 2016

wongjingping mentioned this issue Feb 4, 2016

Bug: Incorrect batch_size in fit_generator (Sequential) #1639

Closed

solomondg mentioned this issue Jul 4, 2016

Training data batch by batch or one by one. #3135

Closed

BoltzmannBrain mentioned this issue Aug 13, 2016

Model fit_generator not pulling data samples as expected #3461

Closed

beomjunshin-ben mentioned this issue Aug 26, 2016

Large scale data training in keras, by parallel image preprocessing #3585

Closed

nameofuser1 mentioned this issue Sep 19, 2016

How does fit_generator works #3809

Closed

ellisdg mentioned this issue Mar 27, 2017

I don't what the code means,please help me,thank you～ ellisdg/3DUnetCNN#3

Closed

titipata mentioned this issue Jun 30, 2017

Using Keras fit_generator to train the model rkcosmos/deepcut#9

Closed

timehaven mentioned this issue Jul 26, 2017

Proper way of making a data generator which can handle multiple workers #1638

Closed

satyanarayan-rao mentioned this issue Aug 14, 2017

Batch size and CUDNN_STATUS_NOT_SUPPORTEDdim=4 #7644

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A concrete example for using data generator for large datasets such as ImageNet #1627

A concrete example for using data generator for large datasets such as ImageNet #1627

parag2489 commented Feb 3, 2016

wongjingping commented Feb 3, 2016

parag2489 commented Feb 3, 2016

wongjingping commented Feb 3, 2016

parag2489 commented Feb 3, 2016

wongjingping commented Feb 4, 2016

parag2489 commented Feb 4, 2016

wongjingping commented Feb 4, 2016

sunshineatnoon commented Feb 6, 2016

wongjingping commented Feb 9, 2016

sunshineatnoon commented Feb 9, 2016

wongjingping commented Feb 9, 2016

sunshineatnoon commented Feb 9, 2016

parag2489 commented Feb 9, 2016

sunshineatnoon commented Feb 10, 2016

zzqboy commented May 1, 2016

raymondjplante commented May 2, 2016

parag2489 commented May 2, 2016

raymondjplante commented May 2, 2016

yanranwang commented Nov 8, 2016 •

edited

Loading

tanayz commented Dec 4, 2016 •

edited

Loading

patyork commented Dec 4, 2016

wongjingping commented Dec 19, 2016

behnamprime commented May 30, 2017

Dref360 commented Jun 8, 2017

sxs4337 commented Jun 14, 2017

timehaven commented Jul 26, 2017

ghost commented Feb 21, 2018

lawrencekiba commented Aug 26, 2019 •

edited

Loading

A concrete example for using data generator for large datasets such as ImageNet #1627

A concrete example for using data generator for large datasets such as ImageNet #1627

Comments

parag2489 commented Feb 3, 2016

wongjingping commented Feb 3, 2016

parag2489 commented Feb 3, 2016

wongjingping commented Feb 3, 2016

parag2489 commented Feb 3, 2016

wongjingping commented Feb 4, 2016

parag2489 commented Feb 4, 2016

wongjingping commented Feb 4, 2016

sunshineatnoon commented Feb 6, 2016

wongjingping commented Feb 9, 2016

sunshineatnoon commented Feb 9, 2016

wongjingping commented Feb 9, 2016

sunshineatnoon commented Feb 9, 2016

parag2489 commented Feb 9, 2016

sunshineatnoon commented Feb 10, 2016

zzqboy commented May 1, 2016

raymondjplante commented May 2, 2016

parag2489 commented May 2, 2016

raymondjplante commented May 2, 2016

yanranwang commented Nov 8, 2016 • edited Loading

tanayz commented Dec 4, 2016 • edited Loading

patyork commented Dec 4, 2016

wongjingping commented Dec 19, 2016

behnamprime commented May 30, 2017

Dref360 commented Jun 8, 2017

sxs4337 commented Jun 14, 2017

timehaven commented Jul 26, 2017

ghost commented Feb 21, 2018

lawrencekiba commented Aug 26, 2019 • edited Loading

yanranwang commented Nov 8, 2016 •

edited

Loading

tanayz commented Dec 4, 2016 •

edited

Loading

lawrencekiba commented Aug 26, 2019 •

edited

Loading