-
Notifications
You must be signed in to change notification settings - Fork 187
Continuously load data from disk in separate thread #20
Comments
Yeah that is a little tricky to do if you don't have any index on the file. You could scan through the file once to determine the number of lines ( If you don't do any shuffling or resampling of the data (but only batching), the vast majority of the reads will be sequential and so the data loading will actually be very efficient. (Note that |
Thanks for your answer. I would be more inclined for an architecture in which the data thread loads in memory a maximum number K of batches at a time, refilling its queue each time the main thread retrieves one batch from this queue. Also, I would obviously need this operation to be async with the main thread. At the moment, I tried smthg similar to the MNIST example:
However, the call of get_it()() is blocking which is not what I need. |
i did something like what Octavian suggests in dcgan.torch code, keeping the queue full: |
@octavian-ganea Maybe I am misunderstanding what you are saying, but this is exactly what The K you mention is the |
I am not sure. As I said, the call of get_it()() in the code snippet I posted above is blocking instead of asynchronous. I need to have one separate thread that loads data mini batches in an infinite loop while my main thread is processing these mini batches with the neural network separated, without waiting for the data thread. If I set nthreads higher like you suggest, it is not clear how they can read chuncks of a file in parallel. I am fine with just one thread doing this job, but again, I need this to be asynchronous from the main thread. Would appreciate if you could show a small code snippet on how to do this properly. Thanks a lot! |
I would set it up something like this (this is untested, but I hope you get the gist of it):
|
Wow. Really helpful! Will try this out! Many thanks to both of you! |
I;ve tried to run your example, but I keep getting the error:
The problem appears even if I only have the following code:
I've tried importing torchnet.env like in the datasets folder, without success. Sorry if this is really simple and I am missing it, but I've spent a bit of time trying to figure it out. Thanks! |
Maybe you have an old version of the |
|
It actually works only if I declare tnt as global (removing local), or if I keep it local but replace the second line by
I've tried this on a completely fresh installation of Torch, tds, argcheck and torchnet. But, using this code, MyDataset seem to not be seen (a nil value) inside the closure function of the ParallelDatasetIterator. I've tried moving the tnt definition in an init function, still got the same error. Thanks for helping! |
Hi. Any idea how to solve the above issue ? Here is the code. Thanks!
|
Note that The best solution is to move your definition of This is a minimal working example:
|
Works very nicely now! Thanks a lot! |
Hi,
After playing a bit with torchnet it is still unclear to me how to properly tackle the following problem: suppose my training data lies in a very big file on disk (not loadable into memory at once), each line of the file being one example point. I would like to build a data iterator that runs on a separate thread (or more threads) and that can provide mini-batches to the main thread that performs training of the network. I would also like to do multiple epochs on the ttraining data, so I require that the training file is reopened once it is finished.
I tried using ParallelDatasetIterator, but as far as I understand, the closure is run once per thread and the returned dataset is expected to have a finite size. Can someone please explain or give an example on this issue ? Thanks a lot.
The text was updated successfully, but these errors were encountered: