Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

unable to start train #27

Closed
grypes opened this issue Jul 20, 2017 · 17 comments
Closed

unable to start train #27

grypes opened this issue Jul 20, 2017 · 17 comments

Comments

@grypes
Copy link

grypes commented Jul 20, 2017

hi
I can run standalone backend game_MC successfully, but when I try to run the codes below

game=./rts/game_MC/game model=actor_critic model_file=./rts/game_MC/model \ 
python3 run.py 
    --num_games 1024 --batchsize 128              # Set number of games to be 1024 and batchsize to be 128.  
    --freq_update 50                              # Update behavior policy after 50 updates of the model.
    --fs_opponent 20                              # How often your opponent makes a decision (every 20 ticks)
    --latest_start 500  --latest_start_decay 0.99 # Use rule-based AI for the first 500 ticks, then trained AI takes over. latest_start decays with rate latest_start_decay. 
    --opponent_type AI_SIMPLE                     # Use AI_SIMPLE as rule-based AI
    --tqdm                                        # Show progress bar.
    --gpu 0                                       # Use first gpu. 
    --T 20                                        # 20 step actor-critic

I get this error message:

Namespace(T=20, actor_only=False, additional_labels=None, ai_type='AI_NN', batchsize=128, discount=0.99, entropy_ratio=0.01, epsilon=0.0, eval=False, freq_update=50, fs_ai=50, fs_opponent=20, game_multi=None, gpu=0, grad_clip_norm=None, greedy=False, handicap_level=0, latest_start=500, latest_start_decay=0.99, load=None, max_tick=30000, mcts_threads=64, min_prob=1e-06, num_episode=10000, num_games=1024, num_minibatch=5000, opponent_type='AI_SIMPLE', ratio_change=0, record_dir='./record', sample_node='pi', sample_policy='epsilon-greedy', save_dir=None, save_prefix='save', seed=0, simple_ratio=-1, tqdm=True, verbose_collector=False, verbose_comm=False, wait_per_group=False)
段错误 (核心已转储)   # means "segmentation fault"

The program just terminates with segmentation fault.

@yuandong-tian
Copy link
Contributor

@git-hcLee It is good from my side.. What is your OS version and gcc version?

@qiqiguaitm
Copy link

looks like the situation of mine, #14, you can try my way to work around without random seed.

@EasyHard
Copy link
Contributor

Could you post the backtrace of the dump? For me I rebuilt pytorch from source using gcc 5.4.0-1 then it works fine.

@grypes
Copy link
Author

grypes commented Jul 21, 2017

@yuandong-tian $ gcc --version gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609

@grypes
Copy link
Author

grypes commented Jul 21, 2017

@EasyHard Thanks, I'll try it.

@LinZichuan
Copy link

I met the same problem. I also got the Segmentation fault. I use gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0. Can not start to train using run.py.

@EasyHard
Copy link
Contributor

Could any of you post a backtrace of the dump? Just for more information.
gdb python
r run.py
bt

@Liujiachen
Copy link

hi,I can run standalone backend game_MC successfully, but when I try to train, I got a message as below:
Traceback (most recent call last):
File "run.py", line 142, in
game = load_module(os.environ["game"]).Loader()
File "/home/myubuntu/ELF-master/rlpytorch/utils.py", line 510, in load_module
module = import(os.path.basename(mod))
File "./rts/game_MC/game.py", line 8, in
import minirts
ImportError: /home/myubuntu/anaconda3/lib/python3.5/site-packages/torch/lib/../../../../libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./rts/game_MC/minirts.so)

@yuandong-tian
Copy link
Contributor

@Liujiachen: Check your gcc and libcpp version?

@gchlodzinski
Copy link

Hi, I am also having segmentation fault problem.
Here is what I am using:
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

And here is what EasyHard was asking for:
(gdb) r run.py
Starting program: /usr/bin/python3 run.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3a17700 (LWP 11218)]
[New Thread 0x7ffff1216700 (LWP 11219)]
[New Thread 0x7ffff0a15700 (LWP 11220)]
[Thread 0x7ffff0a15700 (LWP 11220) exited]
[Thread 0x7ffff1216700 (LWP 11219) exited]
[Thread 0x7ffff3a17700 (LWP 11218) exited]
Namespace(T=6, actor_only=False, additional_labels=None, ai_type='AI_NN', batchsize=128, discount=0.99, entropy_ratio=0.01, epsilon=0.0, eval=False, freq_update=1, fs_ai=50, fs_opponent=50, game_multi=None, gpu=None, grad_clip_norm=None, greedy=False, handicap_level=0, latest_start=1000, latest_start_decay=0.7, load=None, max_tick=30000, mcts_threads=64, min_prob=1e-06, num_episode=10000, num_games=1024, num_minibatch=5000, opponent_type='AI_SIMPLE', ratio_change=0, record_dir='./record', sample_node='pi', sample_policy='epsilon-greedy', save_dir=None, save_prefix='save', seed=0, simple_ratio=-1, tqdm=False, verbose_collector=False, verbose_comm=False, wait_per_group=False)

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
GI__IO_fread (buf=0x7fffffffc75c, size=4, count=1, fp=0x0) at iofread.c:37
37 iofread.c: No such file or directory.
(gdb) bt
#0 GI__IO_fread (buf=0x7fffffffc75c, size=4, count=1, fp=0x0)
at iofread.c:37
#1 0x00007fffd103ea4e in std::random_device::M_getval() ()
from /usr/local/lib/python3.5/dist-packages/torch/lib/libTHC.so.1
#2 0x00007fffbac01ffb in GameContext::GameContext(ContextOptions const&, PythonOptions const&) () from ./rts/game_MC/minirts.so
#3 0x00007fffbac03b6f in void pybind11::cpp_function::initialize<void pybind11::detail::init<ContextOptions const&, PythonOptions const&>::execute<pybind11::class
, , 0>(pybind11::class
&)::{lambda(GameContext*, ContextOptions const&, PythonOptions const&)#1}, void, GameContext*, ContextOptions const&, PythonOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::class
&&, void ()(GameContext, ContextOptions const&, PythonOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call) const () from ./rts/game_MC/minirts.so
#4 0x00007fffbac03c9e in void pybind11::cpp_function::initialize<void pybind11::detail::init<ContextOptions const&, PythonOptions const&>::execute<pybind11::class, , 0>(pybind11::class&)::{lambda(GameContext*, ContextOptions const&, PythonOptions const&)#1}, void, GameContext*, ContextOptions const&, PythonOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::class_&&, void ()(GameContext, ContextOptions const&, PythonOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(---Type to continue, or q to quit---
pybind11::detail::function_call) () from ./rts/game_MC/minirts.so
#5 0x00007fffbabe9f7d in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from ./rts/game_MC/minirts.so
#6 0x00000000004e9bc7 in PyCFunction_Call ()
#7 0x00000000005b7167 in PyObject_Call ()
#8 0x00000000004f413e in ?? ()
#9 0x00000000005b7167 in PyObject_Call ()
#10 0x000000000054d359 in ?? ()
#11 0x000000000055d17c in ?? ()
#12 0x00000000005b7167 in PyObject_Call ()
#13 0x0000000000528d06 in PyEval_EvalFrameEx ()
#14 0x0000000000528814 in PyEval_EvalFrameEx ()
#15 0x000000000052d2e3 in ?? ()
#16 0x000000000052dfdf in PyEval_EvalCode ()
#17 0x00000000005fd2c2 in ?? ()
#18 0x00000000005ff76a in PyRun_FileExFlags ()
#19 0x00000000005ff95c in PyRun_SimpleFileExFlags ()
#20 0x000000000063e7d6 in Py_Main ()
#21 0x00000000004cfe41 in main ()

@EasyHard
Copy link
Contributor

EasyHard commented Aug 4, 2017

@gchlodzinski Your stack looks similar to what I've encountered. Compiling pytorch from source with gcc-5.4 helped me on this. Haven't got a chance to really figure out why this happens though.

@gchlodzinski
Copy link

gchlodzinski commented Aug 5, 2017

@EasyHard Thanks, it helped to get things started.
But now using sample training gets only to step 147 with error (at the end of traceback):

RuntimeError: input and target have different number of elements: input[128 x 1] has 128 elements, while target[128 x 128] has 16384 elements at /home/grzegorz/pytorch/torch/lib/THCUNN/generic/SmoothL1Criterion.cu:12

Edit: moreover I have the same result even when I reinstall the whole system from scratch and used this time conda for python and packages. It still crashes when I change batch size to various different numbers (but power of 2) - just at different iteration number.

@LinZichuan
Copy link

@gchlodzinski Hi, have you solved the above problem?

@LinZichuan
Copy link

LinZichuan commented Aug 15, 2017

@yuandong-tian
I updated the repo to latest version and re-compiled everything, but it still cannot start to train.

Version: 99b9e219b9e23bdc7c5e710c0aec531219d5e9e0_
Num Actions: 9
Num unittype: 6
#recv_thread = 4
0%| | 0/5000 [00:00<?, ?it/s]Traceback (most recent call last):
File "run.py", line 194, in
runner.run()
File "/home/ziclin/ELF/rlpytorch/trainer.py", line 179, in run
self.GC.Run()
File "/home/ziclin/ELF/elf/utils_elf.py", line 254, in Run
res = self._call(self.infos)
File "/home/ziclin/ELF/elf/utils_elf.py", line 245, in _call
reply = self._cb[infos.gid](sel, sel_gpu)
File "/home/ziclin/ELF/rlpytorch/trainer.py", line 109, in actor
self.stats.feed_batch(sel)
File "/home/ziclin/ELF/rlpytorch/stats.py", line 188, in feed_batch
return self.collector.feed_batch(batch, hist_idx=hist_idx)
File "/home/ziclin/ELF/rlpytorch/stats.py", line 68, in feed_batch
ids = batch["id"][hist_idx]
File "/home/ziclin/ELF/elf/utils_elf.py", line 84, in getitem
raise KeyError("Batch(): specified key: %s or %s not found!" % (key, key_with_last))
KeyError: 'Batch(): specified key: id or last_id not found!'

./script.sh: line 1: 18981 Segmentation fault (core dumped) game=./rts/game_MC/game model=actor_critic model_file=./rts/game_MC/model python3 run.py --num_games 1024 --batchsize 128 --freq_update 50 --fs_opponent 20 --latest_start 500 --latest_start_decay 0.99 --opponent_type AI_SIMPLE --tqdm --gpu 0 --T 20

@gchlodzinski
Copy link

gchlodzinski commented Aug 15, 2017

@LinZichuan, I was not able to find solution to my runtime error problem. I also tried to run ELF on Mac OS but there failed as well (strange CUDA error message).
Edit:
@LinZichuan, Right now I am having the same problem as your after I got the new set of sources.
But it gets solved by changing commandline.

@yuandong-tian
Copy link
Contributor

@LinZichuan See #45

@yuandong-tian
Copy link
Contributor

yuandong-tian commented Aug 24, 2017

@LinZichuan @gchlodzinski @git-hcLee This commit f268feb might address your issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants