-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
为什么fine-tune过程中loss会忽大忽小呢? #39
Comments
是v100的,我之前就感觉配置环境的时候,这个bitsandbytes总是感觉不太正常,但本人对GPU不是很懂。因为一直感觉fine-tune不太正常,还没有generate和interaction。 |
讲道理load_in_8bit=True的话,v100跑个程序应该会报错啊,我是一直报错的,后面改成了load_in_8bit=False 才正常,然后用了.half()来记载模型训练,否则32Gv100会报oom |
|
好的,我试试,感谢。我想问问你之前load_in_8bit=True的时候,v100报错是什么呀?我也有bug信息,但是能执行fine-tune,就是loss不对;可是我换成A100,好像bug信息也没变,但是按照硬件信息,A100是支持8int的呀 ~ |
没报错,看上去是个奇怪的bug, 但是我马上反应过来可能是load_in_8bit设置为True了,改了就好了,我推理的时候,generate.py 设置load_in_8_bit也不会报错 |
@alisyzhu 把prepare_model_for_int8_training(model)去掉,然后在get_peft_model之后加上model.half()试试 |
我按照上面改了model的half部分,但还是OOM,哭了 ~ |
|
是不是load_in_8bit=False了,v100就跑不了13B的模型了。。。一直OOM |
@alisyzhu 我好像忘记问你mirco batch size开的多少了,你可以试试开到1 |
v100 7B是可以训练的,只是int8太慢了(可能是huggface里面int8很慢)。 |
This may be due to hardware reasons. On some hardware, the quantization model is not compatible with fp16. You can try set fp16=False. It works for me. |
13B llama + 70w开源语料,其中1w作为test,为什么fine-tune的loss最开始是1.0左右,后续就一会儿变得很大,一会儿变成0呢?是因为test集合相对于fine-tune的集合太小吗?
The text was updated successfully, but these errors were encountered: