-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem during training model? #14
Comments
I have the same problem. Please inform this solution if you catch it. |
Same problem, I find this, try to use python 3.7.1, but can't compile spconv. |
Same problem @Vegeta2020 ,Could you share your exactly running environment?thx~ |
@FireflyGao From your traceback, it seems the problem of deepcopy, while I haven't met such a problem before. @WWW2323 In my environment, both python 3.7 & 3.6 are okay. My codebase are based on Det3D and CIA-SSD, so you may refer to these codebase for more detailed information. For this issue, How about using a different function or your developed function (refer to update paras function in trainer_ssd) to replace the deepcopy? |
same error, Who can provide me with a proper solution? |
@Kzmc-China Instead of copying the model, you can try to build a model_ema model_ema = build_detector(cfg.model, train_cfg=cfg.train_cfg, test_cfg=cfg.test_cfg)
for param in model_ema.parameters():
param.detach_() Then doing load weights and parallel distribute the model as usual. |
Thank you very much. I have solved it with your method. |
@maudzung Thanks for your method, I have solved this problem. |
It works for me too. Tkx. |
Hi, @maudzung, did you already reproduced the paper result on val dataset, if yes, can you give me some advise on how to do it, and the pretrained model you use. thanks you very much. |
Hi, @maudzung, after I modified, it came up with "RuntimeError: Expected object of device type cuda but got device type cpu for argument #2 'mat2' in call to _th_mm_out". #63 |
@bingo830422 Please to try to use "model_ema = model_ema.cuda()", which helps to put the model on gpu devices. |
Thanks for your work and project!
When I was trainning model, there was an error as follows:
python3 -m torch.distributed.launch --nproc_per_node=4 train.py
True
True
True
True
/home/firefly/project/se-ssd/se-ssd_vegeta2020/SE-SSD/examples/second/configs/config.py
/home/firefly/project/se-ssd/se-ssd_vegeta2020/SE-SSD/examples/second/configs/config.py
/home/firefly/project/se-ssd/se-ssd_vegeta2020/SE-SSD/examples/second/configs/config.py
/home/firefly/project/se-ssd/se-ssd_vegeta2020/SE-SSD/examples/second/configs/config.py
2021-07-20 06:57:03,695 - INFO - Distributed training: True
2021-07-20 06:57:03,695 - INFO - torch.backends.cudnn.benchmark: False
2021-07-20 06:57:03,746 - INFO - Finish RPN Initialization
2021-07-20 06:57:03,747 - INFO - num_classes: [1], num_preds: [14], num_dirs: [4]
2021-07-20 06:57:03,748 - INFO - Finish MultiGroupHead Initialization
2021-07-20 06:57:08,602 - INFO - {'Car': 5}
2021-07-20 06:57:08,603 - INFO - [-1]
2021-07-20 06:57:08,675 - INFO - load 2207 Pedestrian database infos
2021-07-20 06:57:08,675 - INFO - load 14357 Car database infos
2021-07-20 06:57:08,675 - INFO - load 734 Cyclist database infos
2021-07-20 06:57:08,676 - INFO - load 1297 Van database infos
2021-07-20 06:57:08,676 - INFO - load 488 Truck database infos
2021-07-20 06:57:08,676 - INFO - load 224 Tram database infos
2021-07-20 06:57:08,676 - INFO - load 337 Misc database infos
2021-07-20 06:57:08,676 - INFO - load 56 Person_sitting database infos
2021-07-20 06:57:08,706 - INFO - After filter database:
2021-07-20 06:57:08,706 - INFO - load 2104 Pedestrian database infos
2021-07-20 06:57:08,706 - INFO - load 10520 Car database infos
2021-07-20 06:57:08,706 - INFO - load 594 Cyclist database infos
2021-07-20 06:57:08,706 - INFO - load 826 Van database infos
2021-07-20 06:57:08,706 - INFO - load 321 Truck database infos
2021-07-20 06:57:08,706 - INFO - load 199 Tram database infos
2021-07-20 06:57:08,706 - INFO - load 259 Misc database infos
2021-07-20 06:57:08,706 - INFO - load 53 Person_sitting database infos
2021-07-20 06:57:08,825 - INFO - {'Car': 5}
......
......
~/se-ssd_vegeta2020/SE-SSD/det3d/torchie/apis/train_sessd.py", line 301, in train_detector
model_ema = copy.deepcopy(model)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 306, in _reconstruct
value = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 306, in _reconstruct
value = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 215, in _deepcopy_list
append(deepcopy(a, memo))
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 169, in deepcopy
rv = reductor(4)
TypeError: can't pickle _thread.RLock objects
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 235, in
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=0']' returned non-zero exit status 1.
How to solve this problem? Could you give me some tips? Thanks!
The text was updated successfully, but these errors were encountered: