Out of memory when "Training your own model" with COCO data set. #14
Comments
I happen to be facing the same problem. Τhe problem occurs during the conversion of the dataset from json to t7 but I'm not sure why. Your code has a second problem, though. You should run |
As a workaround, you can try doing the initial COCO annotations JSON to t7 conversion process in an interactive torch shell: https://gist.github.com/ryanfb/13bd5cf3d89d6b5e8acbd553256507f2#out-of-memory-loading-annotations-during-th-trainlua After that, it shouldn't need to do the conversion on subsequent runs. |
Yeah, we've noticed the JSON->T7 conversion can run out of memory on some systems... It works fine on my Mac, so I did the conversion there, than copied over the t7 file to wherever I needed it. This is an issue w the JSON parsers for Torch and the memory restrictions of LuaJIT. At some point I was working on a JSON loader that would parse directly into tds format (https://github.com/torch/tds) which would alleviate all the memory issues, but I decided it wasn't worth the effort since the JSON->T7 conversion can be done once elsewhere (e.g. on a Mac) and then used thereafter. |
In case the interactive shell option doesn't work and using a mac isn't an option, another solution is to rebuild torch using lua 5.1 instead of LuaJIT, according to this. |
@PavlosMelissinos: yes , this is a good solution as well. LuaJIT has memory limitations that Lua does not as you point out. |
Running into the same problem on an aws p2 instance. Using the interactive torch sheel workaround suggested by ryanfb worked for the validation file (because it's smaller I guess), but not for the training instances. Will try rebuilding torch with lua instead of luaJIT, but it would be much appreciated if someone could upload the t7 files directly? Thanks! Edit: Rebuild torch with lua, converted the json to t7 via the torch sheel workaround suggested by ryanfb and that worked for the json > t7 conversion. However, for training, I ran into this issue #41. Reinstalling all modules, incl. luaffifb (apparently needed for the installation of tdb when torch is used with lua instead of luaJIT) and tdb didn't work for me. |
@ChrisCKR I've just now uploaded my t7 files to Figshare here: https://figshare.com/articles/MSCOCO_Annotations_in_Torch_T7_Format/4595332 Hope they help! |
When I run command "th train.lua" I got below error
I had tried to trace back the error tree and find out that the "callstatus" value is "false" in function "dojob" line 169 at file "torch/install/share/lua/5.1/threads/threads.lua"
I cant find how to pass this error. I had tested with lower batch, maxload, testmaxload and maxexpoch but did not success.
My computer properties are:
RAM: 32Gb
Processor: Intel Core i7-4790K CPU @ 4.0GHz x 8
Graphics: GeForce GTX TITAN X/PCIe/SSE2
Free disk space: 230 Gb
The text was updated successfully, but these errors were encountered: