-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] 预训练时间和预训练数据 #43
Comments
推算一下,7B模型,1.2万亿token,1000张A800,0.58利用率,训练一个epoch是4天左右。 |
看配置好像是纯data parallel,没有开tensor parallel吗? |
猜测应该开了tensor和pipeline并行,否则很难达到0.58利用率 |
7B开pipeline应该不至于,tp开的话可能也是2,因为seq length为4096,按照global batch size为4M推测,micro batch size和 gradient accumulate都是1,那千卡应该是纯dp的,除非是2000卡。。 |
7B在80G上不用开TP,只需要sharding=8即可,多机间就是纯dp,这样训练速度和吞吐量应该都是最优的 |
我想问下这个代码是把数据一次性加载进内存了,如果数据量很大1.4T tokens大概5T左右的数据量,是不是内存放不下呀。 |
Required prerequisites
Questions
感谢回复!
Checklist
The text was updated successfully, but these errors were encountered: