Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] 预训练时间和预训练数据 #43

Open
5 tasks done
coye01 opened this issue Jun 16, 2023 · 6 comments
Open
5 tasks done

[Question] 预训练时间和预训练数据 #43

coye01 opened this issue Jun 16, 2023 · 6 comments
Labels
question Further information is requested

Comments

@coye01
Copy link

coye01 commented Jun 16, 2023

Required prerequisites

Questions

  1. 请问该模型在千卡集群上训练了多久啊?
  2. README中提到了在大约 1.2T token上做了预训练,数据中语言的分布是怎样的啊?
    感谢回复!

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.
@coye01 coye01 added the question Further information is requested label Jun 16, 2023
@coye01 coye01 changed the title [Question] 预训练时间 [Question] 预训练时间和预训练数据 Jun 16, 2023
@formath
Copy link

formath commented Jun 16, 2023

推算一下,7B模型,1.2万亿token,1000张A800,0.58利用率,训练一个epoch是4天左右。

@miraclezqc
Copy link

推算一下,7B模型,1.2万亿token,1000张A800,0.58利用率,训练一个epoch是4天左右。

看配置好像是纯data parallel,没有开tensor parallel吗?

@formath
Copy link

formath commented Jun 19, 2023

推算一下,7B模型,1.2万亿token,1000张A800,0.58利用率,训练一个epoch是4天左右。

看配置好像是纯data parallel,没有开tensor parallel吗?

猜测应该开了tensor和pipeline并行,否则很难达到0.58利用率

@miraclezqc
Copy link

推算一下,7B模型,1.2万亿token,1000张A800,0.58利用率,训练一个epoch是4天左右。

看配置好像是纯data parallel,没有开tensor parallel吗?

猜测应该开了tensor和pipeline并行,否则很难达到0.58利用率

7B开pipeline应该不至于,tp开的话可能也是2,因为seq length为4096,按照global batch size为4M推测,micro batch size和 gradient accumulate都是1,那千卡应该是纯dp的,除非是2000卡。。

@Luoyingfeng8
Copy link

推算一下,7B模型,1.2万亿token,1000张A800,0.58利用率,训练一个epoch是4天左右。

看配置好像是纯data parallel,没有开tensor parallel吗?

猜测应该开了tensor和pipeline并行,否则很难达到0.58利用率

7B开pipeline应该不至于,tp开的话可能也是2,因为seq length为4096,按照global batch size为4M推测,micro batch size和 gradient accumulate都是1,那千卡应该是纯dp的,除非是2000卡。。

7B在80G上不用开TP,只需要sharding=8即可,多机间就是纯dp,这样训练速度和吞吐量应该都是最优的

@mynewstart
Copy link

我想问下这个代码是把数据一次性加载进内存了,如果数据量很大1.4T tokens大概5T左右的数据量,是不是内存放不下呀。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants