unable to start train #27

grypes · 2017-07-20T08:51:20Z

hi
I can run standalone backend game_MC successfully, but when I try to run the codes below

game=./rts/game_MC/game model=actor_critic model_file=./rts/game_MC/model \ 
python3 run.py 
    --num_games 1024 --batchsize 128              # Set number of games to be 1024 and batchsize to be 128.  
    --freq_update 50                              # Update behavior policy after 50 updates of the model.
    --fs_opponent 20                              # How often your opponent makes a decision (every 20 ticks)
    --latest_start 500  --latest_start_decay 0.99 # Use rule-based AI for the first 500 ticks, then trained AI takes over. latest_start decays with rate latest_start_decay. 
    --opponent_type AI_SIMPLE                     # Use AI_SIMPLE as rule-based AI
    --tqdm                                        # Show progress bar.
    --gpu 0                                       # Use first gpu. 
    --T 20                                        # 20 step actor-critic

I get this error message:

Namespace(T=20, actor_only=False, additional_labels=None, ai_type='AI_NN', batchsize=128, discount=0.99, entropy_ratio=0.01, epsilon=0.0, eval=False, freq_update=50, fs_ai=50, fs_opponent=20, game_multi=None, gpu=0, grad_clip_norm=None, greedy=False, handicap_level=0, latest_start=500, latest_start_decay=0.99, load=None, max_tick=30000, mcts_threads=64, min_prob=1e-06, num_episode=10000, num_games=1024, num_minibatch=5000, opponent_type='AI_SIMPLE', ratio_change=0, record_dir='./record', sample_node='pi', sample_policy='epsilon-greedy', save_dir=None, save_prefix='save', seed=0, simple_ratio=-1, tqdm=True, verbose_collector=False, verbose_comm=False, wait_per_group=False)
段错误 (核心已转储)   # means "segmentation fault"

The program just terminates with segmentation fault.

The text was updated successfully, but these errors were encountered:

yuandong-tian · 2017-07-20T14:39:08Z

@git-hcLee It is good from my side.. What is your OS version and gcc version?

qiqiguaitm · 2017-07-20T15:34:32Z

looks like the situation of mine, #14, you can try my way to work around without random seed.

EasyHard · 2017-07-20T16:06:22Z

Could you post the backtrace of the dump? For me I rebuilt pytorch from source using gcc 5.4.0-1 then it works fine.

grypes · 2017-07-21T03:55:20Z

@yuandong-tian $ gcc --version gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609

grypes · 2017-07-21T03:56:39Z

@EasyHard Thanks, I'll try it.

LinZichuan · 2017-07-22T09:30:26Z

I met the same problem. I also got the Segmentation fault. I use gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0. Can not start to train using run.py.

EasyHard · 2017-07-22T20:34:15Z

Could any of you post a backtrace of the dump? Just for more information.
gdb python
r run.py
bt

Liujiachen · 2017-07-31T02:34:17Z

hi,I can run standalone backend game_MC successfully, but when I try to train, I got a message as below:
Traceback (most recent call last):
File "run.py", line 142, in
game = load_module(os.environ["game"]).Loader()
File "/home/myubuntu/ELF-master/rlpytorch/utils.py", line 510, in load_module
module = import(os.path.basename(mod))
File "./rts/game_MC/game.py", line 8, in
import minirts
ImportError: /home/myubuntu/anaconda3/lib/python3.5/site-packages/torch/lib/../../../../libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./rts/game_MC/minirts.so)

yuandong-tian · 2017-07-31T03:04:33Z

@Liujiachen: Check your gcc and libcpp version?

gchlodzinski · 2017-08-04T16:01:29Z

Hi, I am also having segmentation fault problem.
Here is what I am using:
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

And here is what EasyHard was asking for:
(gdb) r run.py
Starting program: /usr/bin/python3 run.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3a17700 (LWP 11218)]
[New Thread 0x7ffff1216700 (LWP 11219)]
[New Thread 0x7ffff0a15700 (LWP 11220)]
[Thread 0x7ffff0a15700 (LWP 11220) exited]
[Thread 0x7ffff1216700 (LWP 11219) exited]
[Thread 0x7ffff3a17700 (LWP 11218) exited]
Namespace(T=6, actor_only=False, additional_labels=None, ai_type='AI_NN', batchsize=128, discount=0.99, entropy_ratio=0.01, epsilon=0.0, eval=False, freq_update=1, fs_ai=50, fs_opponent=50, game_multi=None, gpu=None, grad_clip_norm=None, greedy=False, handicap_level=0, latest_start=1000, latest_start_decay=0.7, load=None, max_tick=30000, mcts_threads=64, min_prob=1e-06, num_episode=10000, num_games=1024, num_minibatch=5000, opponent_type='AI_SIMPLE', ratio_change=0, record_dir='./record', sample_node='pi', sample_policy='epsilon-greedy', save_dir=None, save_prefix='save', seed=0, simple_ratio=-1, tqdm=False, verbose_collector=False, verbose_comm=False, wait_per_group=False)

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
GI__IO_fread (buf=0x7fffffffc75c, size=4, count=1, fp=0x0) at iofread.c:37
37 iofread.c: No such file or directory.
(gdb) bt
#0 GI__IO_fread (buf=0x7fffffffc75c, size=4, count=1, fp=0x0)
at iofread.c:37
#1 0x00007fffd103ea4e in std::random_device::M_getval() ()
from /usr/local/lib/python3.5/dist-packages/torch/lib/libTHC.so.1
#2 0x00007fffbac01ffb in GameContext::GameContext(ContextOptions const&, PythonOptions const&) () from ./rts/game_MC/minirts.so
#3 0x00007fffbac03b6f in void pybind11::cpp_function::initialize<void pybind11::detail::init<ContextOptions const&, PythonOptions const&>::execute<pybind11::class, , 0>(pybind11::class&)::{lambda(GameContext*, ContextOptions const&, PythonOptions const&)#1}, void, GameContext*, ContextOptions const&, PythonOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::class&&, void ()(GameContext, ContextOptions const&, PythonOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call) const () from ./rts/game_MC/minirts.so
#4 0x00007fffbac03c9e in void pybind11::cpp_function::initialize<void pybind11::detail::init<ContextOptions const&, PythonOptions const&>::execute<pybind11::class, , 0>(pybind11::class&)::{lambda(GameContext*, ContextOptions const&, PythonOptions const&)#1}, void, GameContext*, ContextOptions const&, PythonOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::class_&&, void ()(GameContext, ContextOptions const&, PythonOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(---Type to continue, or q to quit---
pybind11::detail::function_call) () from ./rts/game_MC/minirts.so
#5 0x00007fffbabe9f7d in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from ./rts/game_MC/minirts.so
#6 0x00000000004e9bc7 in PyCFunction_Call ()
#7 0x00000000005b7167 in PyObject_Call ()
#8 0x00000000004f413e in ?? ()
#9 0x00000000005b7167 in PyObject_Call ()
#10 0x000000000054d359 in ?? ()
#11 0x000000000055d17c in ?? ()
#12 0x00000000005b7167 in PyObject_Call ()
#13 0x0000000000528d06 in PyEval_EvalFrameEx ()
#14 0x0000000000528814 in PyEval_EvalFrameEx ()
#15 0x000000000052d2e3 in ?? ()
#16 0x000000000052dfdf in PyEval_EvalCode ()
#17 0x00000000005fd2c2 in ?? ()
#18 0x00000000005ff76a in PyRun_FileExFlags ()
#19 0x00000000005ff95c in PyRun_SimpleFileExFlags ()
#20 0x000000000063e7d6 in Py_Main ()
#21 0x00000000004cfe41 in main ()

EasyHard · 2017-08-04T19:51:05Z

@gchlodzinski Your stack looks similar to what I've encountered. Compiling pytorch from source with gcc-5.4 helped me on this. Haven't got a chance to really figure out why this happens though.

gchlodzinski · 2017-08-05T11:54:45Z

@EasyHard Thanks, it helped to get things started.
But now using sample training gets only to step 147 with error (at the end of traceback):

RuntimeError: input and target have different number of elements: input[128 x 1] has 128 elements, while target[128 x 128] has 16384 elements at /home/grzegorz/pytorch/torch/lib/THCUNN/generic/SmoothL1Criterion.cu:12

Edit: moreover I have the same result even when I reinstall the whole system from scratch and used this time conda for python and packages. It still crashes when I change batch size to various different numbers (but power of 2) - just at different iteration number.

LinZichuan · 2017-08-15T09:25:41Z

@gchlodzinski Hi, have you solved the above problem?

LinZichuan · 2017-08-15T09:49:33Z

@yuandong-tian
I updated the repo to latest version and re-compiled everything, but it still cannot start to train.

Version: 99b9e219b9e23bdc7c5e710c0aec531219d5e9e0_
Num Actions: 9
Num unittype: 6
#recv_thread = 4
0%| | 0/5000 [00:00<?, ?it/s]Traceback (most recent call last):
File "run.py", line 194, in
runner.run()
File "/home/ziclin/ELF/rlpytorch/trainer.py", line 179, in run
self.GC.Run()
File "/home/ziclin/ELF/elf/utils_elf.py", line 254, in Run
res = self._call(self.infos)
File "/home/ziclin/ELF/elf/utils_elf.py", line 245, in _call
reply = self._cb[infos.gid](sel, sel_gpu)
File "/home/ziclin/ELF/rlpytorch/trainer.py", line 109, in actor
self.stats.feed_batch(sel)
File "/home/ziclin/ELF/rlpytorch/stats.py", line 188, in feed_batch
return self.collector.feed_batch(batch, hist_idx=hist_idx)
File "/home/ziclin/ELF/rlpytorch/stats.py", line 68, in feed_batch
ids = batch["id"][hist_idx]
File "/home/ziclin/ELF/elf/utils_elf.py", line 84, in getitem
raise KeyError("Batch(): specified key: %s or %s not found!" % (key, key_with_last))
KeyError: 'Batch(): specified key: id or last_id not found!'

./script.sh: line 1: 18981 Segmentation fault (core dumped) game=./rts/game_MC/game model=actor_critic model_file=./rts/game_MC/model python3 run.py --num_games 1024 --batchsize 128 --freq_update 50 --fs_opponent 20 --latest_start 500 --latest_start_decay 0.99 --opponent_type AI_SIMPLE --tqdm --gpu 0 --T 20

gchlodzinski · 2017-08-15T18:55:04Z

@LinZichuan, I was not able to find solution to my runtime error problem. I also tried to run ELF on Mac OS but there failed as well (strange CUDA error message).
Edit:
@LinZichuan, Right now I am having the same problem as your after I got the new set of sources.
But it gets solved by changing commandline.

yuandong-tian · 2017-08-18T20:58:59Z

@LinZichuan See #45

yuandong-tian · 2017-08-24T01:20:47Z

@LinZichuan @gchlodzinski @git-hcLee This commit f268feb might address your issue.

qucheng mentioned this issue Aug 8, 2017

Segmentation fault error #39

Closed

yuandong-tian closed this as completed Aug 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to start train #27

unable to start train #27

grypes commented Jul 20, 2017

yuandong-tian commented Jul 20, 2017

qiqiguaitm commented Jul 20, 2017

EasyHard commented Jul 20, 2017

grypes commented Jul 21, 2017

grypes commented Jul 21, 2017

LinZichuan commented Jul 22, 2017

EasyHard commented Jul 22, 2017

Liujiachen commented Jul 31, 2017

yuandong-tian commented Jul 31, 2017

gchlodzinski commented Aug 4, 2017

EasyHard commented Aug 4, 2017

gchlodzinski commented Aug 5, 2017 •

edited

LinZichuan commented Aug 15, 2017

LinZichuan commented Aug 15, 2017 •

edited

gchlodzinski commented Aug 15, 2017 •

edited

yuandong-tian commented Aug 18, 2017

yuandong-tian commented Aug 24, 2017 •

edited

unable to start train #27

unable to start train #27

Comments

grypes commented Jul 20, 2017

yuandong-tian commented Jul 20, 2017

qiqiguaitm commented Jul 20, 2017

EasyHard commented Jul 20, 2017

grypes commented Jul 21, 2017

grypes commented Jul 21, 2017

LinZichuan commented Jul 22, 2017

EasyHard commented Jul 22, 2017

Liujiachen commented Jul 31, 2017

yuandong-tian commented Jul 31, 2017

gchlodzinski commented Aug 4, 2017

EasyHard commented Aug 4, 2017

gchlodzinski commented Aug 5, 2017 • edited

LinZichuan commented Aug 15, 2017

LinZichuan commented Aug 15, 2017 • edited

gchlodzinski commented Aug 15, 2017 • edited

yuandong-tian commented Aug 18, 2017

yuandong-tian commented Aug 24, 2017 • edited

gchlodzinski commented Aug 5, 2017 •

edited

LinZichuan commented Aug 15, 2017 •

edited

gchlodzinski commented Aug 15, 2017 •

edited

yuandong-tian commented Aug 24, 2017 •

edited