-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A concrete example for using data generator for large datasets such as ImageNet #1627
Comments
Hi there, my answers to your questions below:
|
I have some follow up questions @wongjingping : Response to 3. I should have been clearer. I am not concerned about one "big" HDF5 file, the question is entire data can't be loaded as you say. You say that my example looks fine, but I think its wrong. I illustrate that with the following snippet of data generator (let's leave the data augmentation for later):
The above function is called by If I comment the line If you want a self-contained code snippet, it is as follows. You can just run it.
|
Hi there, I'm afraid I'm having some problems running your code with the debugger, but from what I can see I think you need to assign the generator instance to a new variable before passing it to the
Let me know if it doesn't work! (I'm not too sure myself) |
I can run it in pycharm debugger after putting breakpoint inside Anyway, with your suggestion, same thing happens, it just keeps repeating indefinitely. However, if I remove |
My apologies, I'm having some difficulty with the ipdb debugger in spyder, and resorted to another workaround. Your model is training actually. You can add this snippet of code to verify (print out) the progress of your model using a callback:
By i ~ 500 you should observe that the training accuracy printed out is at least 90%, with 100% accuracy appearing more frequently. |
@wongjingping Thanks. It works. Just one small doubt (rather observation) is that when the logs are printed, the numbers being printed in front of "Batch: " are actually sample numbers. So if there are 60000 samples / epoch, then in the logs being printed through callbacks, we saw "Batch: 59999". Anyway, I am closing this issue now. I still haven't had success in running the data generators with >1 workers, but I have asked another question for that. You may take a look at it if time permits. That would be great. That is Issue #1638 |
Hi @parag2489, I suspect there is a bug with the fit_generator in determining the batch_size, have raised this under a separate issue #1639. Feel free to chip in! |
@wongjingping @parag2489 Hi~ May I ask you guys how to specify batch size if I wrote my own data generator since |
@sunshineatnoon You can pass the batch_size as an argument to the generator:
i = 0 while i < epoch_size:
|
@wongjingping Thanks! I will look into this. BTW, what does |
@sunshineatnoon sorry for the poor formatting; the samples_per_epoch is the number of examples you expect to see in an epoch, not batch_size * samples_per_epoch :) |
@wongjingping So it means that if I use a batch size of 64, I will have
|
@sunshineatnoon Regarding why your training slows down, its best to profile your code. There is a feature in Theano for that (I think |
@parag2489 Thanks! It's very nice of you to give such a detailed explanation, I will try to change my code. |
this page helps,thanks! btw,it seems that new version comes quickly |
Just to be clear can someone confirm:
Question: |
@parag2489 Someone on SO provides a good explanation of the purpose of the generator queue: http://stackoverflow.com/questions/36986815/in-keras-model-fit-generator-method-what-is-the-generator-queue-controlled-pa |
@wongjingping @parag2489 For the case of multi-inputs, such as we have two pathways in the network, each corresponding to different input, Can we still use "data-generator" to generate image regions parallel with training process? |
The problem I'm facing is keras fit_generator is good for processing images with collective size more than RAM size,but what if those files are actually not in image format.For example I've taken huge number of images(500k) and have used them against a pre-trained inception v3 model to get the feature out of them.Now each of those files are nothing but (1,384,8,8) array or npy files.Any idea how I can use fit generator to read them in batch as collectively they won't fit in my RAM and generators apparent don't recognize anything other than image files. |
@tanayz It would be the exact same as if they were image instead of pickled/numpy data files:
|
@seasonwang I'm afraid I haven't tried that out before - sorry for the late reply! |
Is there a way to use train_on_batch with a generator? |
you means? for batch in generator:
model.train_on_batch(batch) |
Hi, For example- So the dimensions of predicted_classes and true_classes is different since total samples is not divisible by batch size. The size of my test_set is not consistent, so the no. of steps in predict_generator would change each time depending upon the batch size. I am using flow_from_directory and cannot use predict_on_batch since my data is organized in a directory structure. One solution is running with batch size of 1, but makes it very slow. I hope my question is clear. Thanks in advance. |
The comments and suggestions in this issue and its cousin #1638 were very helpful for me to efficiently process large numbers of images. I wrote it all up in a tutorial fashion that I hope can help others. |
Hello,
How can I access validation data from a custom Callback when using fit_generator? Best, |
Hi, Tried using this but got the following error:
|
I am already aware of some discussions on how to use Keras for very large datasets (>1,000,000 images) such as this and this. However, for my scenario, I can't figure out the appropriate way to use the
ImageDataGenerator
or write my owndataGenerator
.Specifically, I have the following four questions:
datagen.fit(X_sample)
, do we assume thatX_sample
is a big enough chunk of data to calculate mean, perform feature centering/normalization and whitening on?X_sample
cannot obviously be the entire data, so will the augmentation (i.e. flipping, width/height shift) happen on partial data? For example,X_sample = 10000
out of total 1,000,000 pictures. After augmentation, suppose we get 2 * 10,000 more pictures. Note that we are not runningdatagen.fit()
again, so will our augmented data contain only1,000,000 + 2 * 10,000
samples? How do we augment entire data (i.e. 1,000,000 + 2 * 1,000,000 samples)?My approach for building a data generator (for a very large data) which loops indefinitely is as follows (which fails):
The above code doesn't work in the sense that once it enters into the above function from
fit_generator()
, it just stays in thewhile 1
loop forever. A detailed example will help a lot.ImageDataGenerator
as in this link (which is preferable instead of writing our own), should we put(X_train, y_train), (X_test, y_test) = LOAD_10K_SAMPLES_OF_BIG_DATA()
in a for loop and writedatagen.fit(X_train)
andmodel.fit_generator(datagen.flow(...))
in that loop?The text was updated successfully, but these errors were encountered: