为什么fine-tune过程中loss会忽大忽小呢？ #39

alisyzhu · 2023-04-06T03:13:26Z

13B llama + 70w开源语料，其中1w作为test，为什么fine-tune的loss最开始是1.0左右，后续就一会儿变得很大，一会儿变成0呢？是因为test集合相对于fine-tune的集合太小吗？

Facico · 2023-04-06T03:44:46Z

@alisyzhu
感谢你提供这个问题。这个可能是bitsandbytes的问题，如果是V100用8bit加载很容易炸loss。想问一下你的显卡是V100吗，我们之前确实没有注意这个问题，V100是不支持8bit tensor core的，如下图。

相关issue: issue1, issue2, issue3

还有一个问题，我想知道你用我们的generate、interaction等测试文件能生成正常的结果吗（我们推理文件也是自动load 8bit
的）

如果用V100的话，可以把load_in_8bit改成False，loss应该就不容易炸了。

alisyzhu · 2023-04-06T07:08:59Z

是v100的，我之前就感觉配置环境的时候，这个bitsandbytes总是感觉不太正常，但本人对GPU不是很懂。因为一直感觉fine-tune不太正常，还没有generate和interaction。
如果我只能用v100fine-tune和generate的话，是不是把所有load_in_8bit的地方都设置成false就可以呢？那还需要配置bitsandbytes这个依赖包吗？
另外，我还想问下，如果不用8bit了，那么在后续还需要调用tools里的脚本做LLAMA的量化吗？一直没看懂这个量化什么时候需要做，有什么作用，我理解就是让模型变小的？

alisyzhu · 2023-04-06T07:33:47Z

将fine-tune里的load_in_8bit设置为False，4个GPU就报

rookiebird · 2023-04-06T08:47:08Z

讲道理load_in_8bit=True的话，v100跑个程序应该会报错啊，我是一直报错的，后面改成了load_in_8bit=False 才正常，然后用了.half()来记载模型训练，否则32Gv100会报oom

alisyzhu · 2023-04-06T08:49:59Z

但是改成load_in_8bit=False，报了OOM，我用了8个v100，也还是OOM，请问你是几个v100，数据量个batch_size多少呀，可以跑起来？
.half()是怎么执行的呀？

alisyzhu · 2023-04-06T08:53:52Z

讲道理load_in_8bit=True的话，v100跑个程序应该会报错啊，我是一直报错的，后面改成了load_in_8bit=False 才正常，然后用了.half()来记载模型训练，否则32Gv100会报oom

请问是将finetune.py的这部分这样修改吗？

rookiebird · 2023-04-06T08:55:30Z

我没有用8个v100, 我是两个啊，一个v100 是32G的，是你这样改的，然后micro_batch_size 是4，好像我记得默认也是4， batch_size 我没改

讲道理load_in_8bit=True的话，v100跑个程序应该会报错啊，我是一直报错的，后面改成了load_in_8bit=False 才正常，然后用了.half()来记载模型训练，否则32Gv100会报oom

请问是将finetune.py的这部分这样修改吗？

alisyzhu · 2023-04-06T09:04:16Z

我没有用8个v100, 我是两个啊，一个v100 是32G的，是你这样改的，然后micro_batch_size 是4，好像我记得默认也是4， batch_size 我没改

讲道理load_in_8bit=True的话，v100跑个程序应该会报错啊，我是一直报错的，后面改成了load_in_8bit=False 才正常，然后用了.half()来记载模型训练，否则32Gv100会报oom

请问是将finetune.py的这部分这样修改吗？

好的，我试试，感谢。我想问问你之前load_in_8bit=True的时候，v100报错是什么呀？我也有bug信息，但是能执行fine-tune，就是loss不对；可是我换成A100，好像bug信息也没变，但是按照硬件信息，A100是支持8int的呀 ~

rookiebird · 2023-04-06T09:25:09Z

我没有用8个v100, 我是两个啊，一个v100 是32G的，是你这样改的，然后micro_batch_size 是4，好像我记得默认也是4， batch_size 我没改

讲道理load_in_8bit=True的话，v100跑个程序应该会报错啊，我是一直报错的，后面改成了load_in_8bit=False 才正常，然后用了.half()来记载模型训练，否则32Gv100会报oom

请问是将finetune.py的这部分这样修改吗？

好的，我试试，感谢。我想问问你之前load_in_8bit=True的时候，v100报错是什么呀？我也有bug信息，但是能执行fine-tune，就是loss不对；可是我换成A100，好像bug信息也没变，但是按照硬件信息，A100是支持8int的呀 ~

没报错，看上去是个奇怪的bug, 但是我马上反应过来可能是load_in_8bit设置为True了，改了就好了，我推理的时候，generate.py 设置load_in_8_bit也不会报错

Facico · 2023-04-06T09:27:06Z

@alisyzhu 把prepare_model_for_int8_training(model)去掉，然后在get_peft_model之后加上model.half()试试

alisyzhu · 2023-04-06T09:28:33Z

我没有用8个v100, 我是两个啊，一个v100 是32G的，是你这样改的，然后micro_batch_size 是4，好像我记得默认也是4， batch_size 我没改

讲道理load_in_8bit=True的话，v100跑个程序应该会报错啊，我是一直报错的，后面改成了load_in_8bit=False 才正常，然后用了.half()来记载模型训练，否则32Gv100会报oom

请问是将finetune.py的这部分这样修改吗？

好的，我试试，感谢。我想问问你之前load_in_8bit=True的时候，v100报错是什么呀？我也有bug信息，但是能执行fine-tune，就是loss不对；可是我换成A100，好像bug信息也没变，但是按照硬件信息，A100是支持8int的呀 ~

没报错，看上去是个奇怪的bug, 但是我马上反应过来可能是load_in_8bit设置为True了，改了就好了，我推理的时候，generate.py 设置load_in_8_bit也不会报错

请问是这种bug信息吗？

我按照上面改了model的half部分，但还是OOM，哭了 ~

alisyzhu · 2023-04-06T09:33:43Z

@alisyzhu 把prepare_model_for_int8_training(model)去掉，然后在get_peft_model之后加上model.half()试试

还是不行，

alisyzhu · 2023-04-06T09:46:36Z

是不是load_in_8bit=False了，v100就跑不了13B的模型了。。。一直OOM

Facico · 2023-04-06T09:56:43Z

@alisyzhu 我们没拿V100跑过。粗略估算，7B+8bit差不多是1B1G，13B+16bit可能就是1B4G，你可以拿之前开8bit（虽然会有问题）的显存*2算算。

如果实在需要跑，需要使用deepspeed+zero2/3 offload的技术，在huggingface的trainer上很好加，你可以试试（可以参考这些博客）。由于使用zero offload代码会跑的很慢，当时我们就没有考虑，如果之后有需求会加上的。或者如果你成功加上跑上了，我们也会非常感谢你的贡献。

Facico · 2023-04-08T17:00:53Z

@alisyzhu 我好像忘记问你mirco batch size开的多少了，你可以试试开到1

lucasjinreal · 2023-05-26T11:02:28Z

v100 7B是可以训练的，只是int8太慢了（可能是huggface里面int8很慢）。
可以训的组合是 fp16 load + deepspeed offload

lyccyl1 · 2023-12-11T02:26:39Z

This may be due to hardware reasons. On some hardware, the quantization model is not compatible with fp16. You can try set fp16=False. It works for me.

Facico mentioned this issue Apr 6, 2023

基于13B的LLAMA模型fine-tune，loss特别大，而lr初始就是0，这是正常的吗？ #32

Closed

Facico added the bug Something isn't working label Apr 6, 2023

EricHou89 mentioned this issue Apr 7, 2023

采用V100的显卡做lora微调时loss异常 LianjiaTech/BELLE#122

Closed

hiyouga referenced this issue in hiyouga/ChatGLM-Efficient-Tuning Apr 18, 2023

support quantization

4a90242

Facico mentioned this issue Apr 19, 2023

长度256 #84

Closed

Facico added the good first issue Good for newcomers label Apr 21, 2023

Facico closed this as completed Apr 24, 2023

Facico mentioned this issue Apr 26, 2023

关于训练中途意外停止的问题 #98

Closed

xianghuisun mentioned this issue May 23, 2023

expected scalar type Half but found Float LianjiaTech/BELLE#389

Closed

uuser0748 mentioned this issue May 26, 2023

v100下无法使用int8训练 LianjiaTech/BELLE#406

Closed

jianghushinian mentioned this issue Jun 8, 2023

为什么我在 kaggle.com 上训练的 LoRA 模型效果比较不错，模型下载到本地进行推理效果却很差？ #217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

为什么fine-tune过程中loss会忽大忽小呢？ #39

为什么fine-tune过程中loss会忽大忽小呢？ #39

alisyzhu commented Apr 6, 2023

Facico commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

rookiebird commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

rookiebird commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

rookiebird commented Apr 6, 2023

Facico commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

Facico commented Apr 6, 2023

Facico commented Apr 8, 2023

lucasjinreal commented May 26, 2023

lyccyl1 commented Dec 11, 2023

为什么fine-tune过程中loss会忽大忽小呢？ #39

为什么fine-tune过程中loss会忽大忽小呢？ #39

Comments

alisyzhu commented Apr 6, 2023

Facico commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

rookiebird commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

rookiebird commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

rookiebird commented Apr 6, 2023

Facico commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

alisyzhu commented Apr 6, 2023

Facico commented Apr 6, 2023

Facico commented Apr 8, 2023

lucasjinreal commented May 26, 2023

lyccyl1 commented Dec 11, 2023