Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue:merge with wenet training code #559

Closed
CaRRotOne opened this issue Mar 15, 2022 · 0 comments
Closed

issue:merge with wenet training code #559

CaRRotOne opened this issue Mar 15, 2022 · 0 comments

Comments

@CaRRotOne
Copy link

CaRRotOne commented Mar 15, 2022

I had trained successfully on ResNet50 Alg on 2*4 GPUs with bagua. Now I am trying to merge bagua with training code of wenet, changed the training code following examples and tutorials. After I change the code, I encounter a problem, the program is suspended without any mention. I debug the code and found the program is suspended at here . It seems call a rust module.
I am not familiar with rust. Please give some info to solve the problem.

> /opt/conda/lib/python3.7/site-packages/bagua/torch_api/communication.py(339)get_communicator()
    338         device_id=get_local_rank(),
--> 339         stream_ptr=pg.stream.cuda_stream,
    340         nccl_unique_id_str=nccl_unique_id,

2022-03-15 06:19:07,718 DEBUG Using selector: EpollSelector
ipdb> s
> /opt/conda/lib/python3.7/site-packages/bagua/torch_api/communication.py(340)get_communicator()
    339         stream_ptr=pg.stream.cuda_stream,
--> 340         nccl_unique_id_str=nccl_unique_id,
    341     )

2022-03-15 06:19:09,976 DEBUG Using selector: EpollSelector
ipdb> pp nccl_unique_id
'AgDQScCooVoAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA='
2022-03-15 06:19:21,795 DEBUG Using selector: EpollSelector
ipdb> s

@CaRRotOne CaRRotOne changed the title merge with training code issue issue:merge with wenet training code Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant