You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had trained successfully on ResNet50 Alg on 2*4 GPUs with bagua. Now I am trying to merge bagua with training code of wenet, changed the training code following examples and tutorials. After I change the code, I encounter a problem, the program is suspended without any mention. I debug the code and found the program is suspended at here . It seems call a rust module.
I am not familiar with rust. Please give some info to solve the problem.
> /opt/conda/lib/python3.7/site-packages/bagua/torch_api/communication.py(339)get_communicator()
338 device_id=get_local_rank(),
--> 339 stream_ptr=pg.stream.cuda_stream,
340 nccl_unique_id_str=nccl_unique_id,
2022-03-15 06:19:07,718 DEBUG Using selector: EpollSelector
ipdb> s
> /opt/conda/lib/python3.7/site-packages/bagua/torch_api/communication.py(340)get_communicator()
339 stream_ptr=pg.stream.cuda_stream,
--> 340 nccl_unique_id_str=nccl_unique_id,
341 )
2022-03-15 06:19:09,976 DEBUG Using selector: EpollSelector
ipdb> pp nccl_unique_id
'AgDQScCooVoAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA='
2022-03-15 06:19:21,795 DEBUG Using selector: EpollSelector
ipdb> s
The text was updated successfully, but these errors were encountered:
CaRRotOne
changed the title
merge with training code issue
issue:merge with wenet training code
Mar 18, 2022
I had trained successfully on ResNet50 Alg on 2*4 GPUs with bagua. Now I am trying to merge bagua with training code of wenet, changed the training code following examples and tutorials. After I change the code, I encounter a problem, the program is suspended without any mention. I debug the code and found the program is suspended at here . It seems call a rust module.
I am not familiar with rust. Please give some info to solve the problem.
The text was updated successfully, but these errors were encountered: