Skip to content
This repository has been archived by the owner on Jan 17, 2019. It is now read-only.

Out of memory when "Training your own model" with COCO data set. #14

Closed
chickensoups opened this issue Aug 31, 2016 · 7 comments
Closed

Comments

@chickensoups
Copy link

When I run command "th train.lua" I got below error

| number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768
| number of paramaters score branch: 526337
| number of paramaters total: 17333121
convert: data//annotations/instances_train2014.json --> .t7 [please be patient]
convert: data//annotations/instances_train2014.json --> .t7 [please be patient]
/home/ai02/torch/install/bin/luajit: /home/ai02/torch/install/share/lua/5.1/threads/threads.lua:184: [thread 1 callback] not enough memory
stack traceback:
[C]: in function 'error'
/home/ai02/torch/install/share/lua/5.1/threads/threads.lua:184: in function 'dojob'
/home/ai02/torch/install/share/lua/5.1/threads/threads.lua:265: in function 'synchronize'
/home/ai02/torch/install/share/lua/5.1/threads/threads.lua:142: in function 'specific'
/home/ai02/torch/install/share/lua/5.1/threads/threads.lua:125: in function 'Threads'
...-86a8-25a46f4534ea/projects/fbai/deepmask/DataLoader.lua:40: in function '__init'
/home/ai02/torch/install/share/lua/5.1/torch/init.lua:91: in function </home/ai02/torch/install/share/lua/5.1/torch/init.lua:87>
[C]: in function 'DataLoader'
...-86a8-25a46f4534ea/projects/fbai/deepmask/DataLoader.lua:21: in function 'create'
train.lua:101: in main chunk
[C]: in function 'dofile'
...ai02/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

I had tried to trace back the error tree and find out that the "callstatus" value is "false" in function "dojob" line 169 at file "torch/install/share/lua/5.1/threads/threads.lua"

local callstatus, args, endcallbackid, threadid = self.mainqueue:dojob()

I cant find how to pass this error. I had tested with lower batch, maxload, testmaxload and maxexpoch but did not success.

My computer properties are:
RAM: 32Gb
Processor: Intel Core i7-4790K CPU @ 4.0GHz x 8
Graphics: GeForce GTX TITAN X/PCIe/SSE2
Free disk space: 230 Gb

@PavlosMelissinos
Copy link

PavlosMelissinos commented Aug 31, 2016

I happen to be facing the same problem. Τhe problem occurs during the conversion of the dataset from json to t7 but I'm not sure why.

Your code has a second problem, though. You should run th train.lua -nthreads 1 the first time (see #11 for details).

@ryanfb
Copy link

ryanfb commented Aug 31, 2016

As a workaround, you can try doing the initial COCO annotations JSON to t7 conversion process in an interactive torch shell: https://gist.github.com/ryanfb/13bd5cf3d89d6b5e8acbd553256507f2#out-of-memory-loading-annotations-during-th-trainlua

After that, it shouldn't need to do the conversion on subsequent runs.

@pdollar
Copy link
Member

pdollar commented Aug 31, 2016

Yeah, we've noticed the JSON->T7 conversion can run out of memory on some systems... It works fine on my Mac, so I did the conversion there, than copied over the t7 file to wherever I needed it. This is an issue w the JSON parsers for Torch and the memory restrictions of LuaJIT. At some point I was working on a JSON loader that would parse directly into tds format (https://github.com/torch/tds) which would alleviate all the memory issues, but I decided it wasn't worth the effort since the JSON->T7 conversion can be done once elsewhere (e.g. on a Mac) and then used thereafter.

@pdollar pdollar closed this as completed Aug 31, 2016
@PavlosMelissinos
Copy link

In case the interactive shell option doesn't work and using a mac isn't an option, another solution is to rebuild torch using lua 5.1 instead of LuaJIT, according to this.
Just tried it and it works, because apparently lua 5.1 doesn't have the same memory limitation.

@tylin tylin mentioned this issue Sep 3, 2016
@pdollar
Copy link
Member

pdollar commented Sep 3, 2016

@PavlosMelissinos: yes , this is a good solution as well. LuaJIT has memory limitations that Lua does not as you point out.

@chrieke
Copy link

chrieke commented Jan 29, 2017

Running into the same problem on an aws p2 instance. Using the interactive torch sheel workaround suggested by ryanfb worked for the validation file (because it's smaller I guess), but not for the training instances. Will try rebuilding torch with lua instead of luaJIT, but it would be much appreciated if someone could upload the t7 files directly? Thanks!

Edit: Rebuild torch with lua, converted the json to t7 via the torch sheel workaround suggested by ryanfb and that worked for the json > t7 conversion. However, for training, I ran into this issue #41. Reinstalling all modules, incl. luaffifb (apparently needed for the installation of tdb when torch is used with lua instead of luaJIT) and tdb didn't work for me.
So I just copied the converted t7 files (or if you are in the same situation just use ryanfb's uploaded files from the post below, thank you Ryan!), reinstalled torch with luaJIT and now it works.

@ryanfb
Copy link

ryanfb commented Jan 29, 2017

@ChrisCKR I've just now uploaded my t7 files to Figshare here: https://figshare.com/articles/MSCOCO_Annotations_in_Torch_T7_Format/4595332

Hope they help!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants