Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

预训练数据集 #34

Closed
rabintang opened this issue Mar 12, 2024 · 2 comments
Closed

预训练数据集 #34

rabintang opened this issue Mar 12, 2024 · 2 comments

Comments

@rabintang
Copy link

作者你好,请问为什么预训练阶段的训练数据也采用 prompt,response 的格式?
我理解预训练应该是做无监督的语言学习,直接喂入自然语言文本而不是喂入 prompt、response 格式的数据效果上有什么区别?
你当时这样整理预训练数据的原因是什么?

@charent
Copy link
Owner

charent commented Mar 12, 2024

T5是encoder-decoder架构的模型,属于条件生成(condition-generate),如果使用因果语言模型(causal LM)类似格式进行无监督预训练,要大改训练代码。

使用<prompt,response >格式进行预训练算是一种尝试吧,效果这块,我没有做过消融实验,无法下结论。

如果你需要decoder-only无监督预训练模型,请参考这个项目Phi2-mini-Chinese

整理text-to-text预训练数据的原因:受到微软phi系列模型的启发,phi系列模型使用高质量数据集进行预训练,效果很好,text-to-text高质量数据是最好获取的。

@rabintang
Copy link
Author

感谢回复!

@charent charent closed this as completed Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants