Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请教下单机多卡训练的卡死问题 #49

Closed
zhoujx4 opened this issue Apr 9, 2023 · 8 comments
Closed

请教下单机多卡训练的卡死问题 #49

zhoujx4 opened this issue Apr 9, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@zhoujx4
Copy link

zhoujx4 commented Apr 9, 2023

你好,想问问用多卡训练,即 bash finetune.sh 时,能看到训练进度的吗?
image

@Facico
Copy link
Owner

Facico commented Apr 9, 2023

能看到

@zhoujx4
Copy link
Author

zhoujx4 commented Apr 9, 2023

请问有试过单机多卡的情况吗? 不是多机多卡,发现在单机多卡的时候 bash finetune.sh 时,会卡住,但也没报错,没有任务的训练时候 loss日志打印出来

@Facico
Copy link
Owner

Facico commented Apr 9, 2023

我们现在程序就是单机多卡,你那边有数据加载界面吗,我猜是卡在数据加载界面上了。

@Facico
Copy link
Owner

Facico commented Apr 9, 2023

如果卡在数据加载界面上,可能的原因是你用的数据是我们之前的版本“不是utf-8”格式的,看不到正常的中文,这个版本在一些系统上可能会存在问题。你可以看看你的数据能不能看到正常中文字符,如果不能可以参考这个issue,或者从huggingface或网盘中拉去现在的数据集

@zhoujx4 zhoujx4 changed the title 请教下多卡分布训练的可视化问题 请教下单机多卡训练的卡死问题 Apr 9, 2023
@zhoujx4
Copy link
Author

zhoujx4 commented Apr 9, 2023

我们现在程序就是单机多卡,你那边有数据加载界面吗,我猜是卡在数据加载界面上了。

你好,试了下,貌似不是卡在数据加载页面上,
单机单卡是能跑的,如下图
image
但是单机多卡,就卡住了,而且也没报错,就一直卡在那里,如下图,不知道是不是torchrun的参数问题?
image

@Facico
Copy link
Owner

Facico commented Apr 9, 2023

如果数据加载没问题的话,如果是只有多卡有问题看看是不是有下面的问题:
1、pytorch3.11的torchrun是有bug的,可以换成其他版本
2、确认多卡是否成功指定,同时那些卡是否存在问题,跑的时候可以nvidia-smi看看显存使用情况

@zhoujx4
Copy link
Author

zhoujx4 commented Apr 10, 2023

找到问题啦,已解决,我的机器是单机8张A6000,
改了bios,关闭ACS 后解决问题

@Facico Facico added the bug Something isn't working label Apr 11, 2023
@Facico Facico closed this as completed Apr 11, 2023
@cbzhao79
Copy link

找到问题啦,已解决,我的机器是单机8张A6000, 改了bios,关闭ACS 后解决问题

可以说一下如何解决的吗?我现在也是碰到这个问题,非常感谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants