Out of memory when "Training your own model" with COCO data set. #14

chickensoups · 2016-08-31T02:30:33Z

When I run command "th train.lua" I got below error

| number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768
| number of paramaters score branch: 526337
| number of paramaters total: 17333121
convert: data//annotations/instances_train2014.json --> .t7 [please be patient]
convert: data//annotations/instances_train2014.json --> .t7 [please be patient]
/home/ai02/torch/install/bin/luajit: /home/ai02/torch/install/share/lua/5.1/threads/threads.lua:184: [thread 1 callback] not enough memory
stack traceback:
[C]: in function 'error'
/home/ai02/torch/install/share/lua/5.1/threads/threads.lua:184: in function 'dojob'
/home/ai02/torch/install/share/lua/5.1/threads/threads.lua:265: in function 'synchronize'
/home/ai02/torch/install/share/lua/5.1/threads/threads.lua:142: in function 'specific'
/home/ai02/torch/install/share/lua/5.1/threads/threads.lua:125: in function 'Threads'
...-86a8-25a46f4534ea/projects/fbai/deepmask/DataLoader.lua:40: in function '__init'
/home/ai02/torch/install/share/lua/5.1/torch/init.lua:91: in function </home/ai02/torch/install/share/lua/5.1/torch/init.lua:87>
[C]: in function 'DataLoader'
...-86a8-25a46f4534ea/projects/fbai/deepmask/DataLoader.lua:21: in function 'create'
train.lua:101: in main chunk
[C]: in function 'dofile'
...ai02/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

I had tried to trace back the error tree and find out that the "callstatus" value is "false" in function "dojob" line 169 at file "torch/install/share/lua/5.1/threads/threads.lua"

local callstatus, args, endcallbackid, threadid = self.mainqueue:dojob()

I cant find how to pass this error. I had tested with lower batch, maxload, testmaxload and maxexpoch but did not success.

My computer properties are:
RAM: 32Gb
Processor: Intel Core i7-4790K CPU @ 4.0GHz x 8
Graphics: GeForce GTX TITAN X/PCIe/SSE2
Free disk space: 230 Gb

PavlosMelissinos · 2016-08-31T12:24:36Z

I happen to be facing the same problem. Τhe problem occurs during the conversion of the dataset from json to t7 but I'm not sure why.

Your code has a second problem, though. You should run th train.lua -nthreads 1 the first time (see #11 for details).

ryanfb · 2016-08-31T13:08:52Z

As a workaround, you can try doing the initial COCO annotations JSON to t7 conversion process in an interactive torch shell: https://gist.github.com/ryanfb/13bd5cf3d89d6b5e8acbd553256507f2#out-of-memory-loading-annotations-during-th-trainlua

After that, it shouldn't need to do the conversion on subsequent runs.

pdollar · 2016-08-31T16:41:40Z

Yeah, we've noticed the JSON->T7 conversion can run out of memory on some systems... It works fine on my Mac, so I did the conversion there, than copied over the t7 file to wherever I needed it. This is an issue w the JSON parsers for Torch and the memory restrictions of LuaJIT. At some point I was working on a JSON loader that would parse directly into tds format (https://github.com/torch/tds) which would alleviate all the memory issues, but I decided it wasn't worth the effort since the JSON->T7 conversion can be done once elsewhere (e.g. on a Mac) and then used thereafter.

PavlosMelissinos · 2016-09-02T08:29:30Z

In case the interactive shell option doesn't work and using a mac isn't an option, another solution is to rebuild torch using lua 5.1 instead of LuaJIT, according to this.
Just tried it and it works, because apparently lua 5.1 doesn't have the same memory limitation.

pdollar · 2016-09-03T17:43:26Z

@PavlosMelissinos: yes , this is a good solution as well. LuaJIT has memory limitations that Lua does not as you point out.

chrieke · 2017-01-29T15:18:26Z

Running into the same problem on an aws p2 instance. Using the interactive torch sheel workaround suggested by ryanfb worked for the validation file (because it's smaller I guess), but not for the training instances. Will try rebuilding torch with lua instead of luaJIT, but it would be much appreciated if someone could upload the t7 files directly? Thanks!

Edit: Rebuild torch with lua, converted the json to t7 via the torch sheel workaround suggested by ryanfb and that worked for the json > t7 conversion. However, for training, I ran into this issue #41. Reinstalling all modules, incl. luaffifb (apparently needed for the installation of tdb when torch is used with lua instead of luaJIT) and tdb didn't work for me.
So I just copied the converted t7 files (or if you are in the same situation just use ryanfb's uploaded files from the post below, thank you Ryan!), reinstalled torch with luaJIT and now it works.

ryanfb · 2017-01-29T16:25:17Z

@ChrisCKR I've just now uploaded my t7 files to Figshare here: https://figshare.com/articles/MSCOCO_Annotations_in_Torch_T7_Format/4595332

Hope they help!

pdollar closed this as completed Aug 31, 2016

tylin mentioned this issue Sep 3, 2016

Can't run training #18

Closed

PavlosMelissinos mentioned this issue Sep 13, 2016

when training coco dataset, meet not enough memory #31

Closed

tylin mentioned this issue Oct 19, 2016

Error when training in COCO datasets with"th train.lua" #43

Closed

hhung516 mentioned this issue Jul 29, 2017

T_END at character 1 when running th train.lua #107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory when "Training your own model" with COCO data set. #14

Out of memory when "Training your own model" with COCO data set. #14

chickensoups commented Aug 31, 2016

PavlosMelissinos commented Aug 31, 2016 •

edited

ryanfb commented Aug 31, 2016

pdollar commented Aug 31, 2016

PavlosMelissinos commented Sep 2, 2016

pdollar commented Sep 3, 2016

chrieke commented Jan 29, 2017 •

edited

ryanfb commented Jan 29, 2017 •

edited

Out of memory when "Training your own model" with COCO data set. #14

Out of memory when "Training your own model" with COCO data set. #14

Comments

chickensoups commented Aug 31, 2016

PavlosMelissinos commented Aug 31, 2016 • edited

ryanfb commented Aug 31, 2016

pdollar commented Aug 31, 2016

PavlosMelissinos commented Sep 2, 2016

pdollar commented Sep 3, 2016

chrieke commented Jan 29, 2017 • edited

ryanfb commented Jan 29, 2017 • edited

PavlosMelissinos commented Aug 31, 2016 •

edited

chrieke commented Jan 29, 2017 •

edited

ryanfb commented Jan 29, 2017 •

edited