Loss is Nan #2

lonleyodd · 2024-07-29T09:05:49Z

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME，the error description:

`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 ===
[2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625
[2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625
[2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625
[2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625
[2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625
[2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625
[2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625
[2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625
[2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625
[2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625
[2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625
[2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625
[2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625
[2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625
[2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625
[2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625
[2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625
[2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625
[2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it?
CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

The text was updated successfully, but these errors were encountered:

bobma-bytedance · 2024-07-30T05:56:07Z

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME，the error description:

`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 === [2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625 [2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625 [2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625 [2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625 [2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625 [2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625 [2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625 [2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625 [2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625 [2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625 [2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625 [2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625 [2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625 [2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625 [2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625 [2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625 [2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625 [2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625 [2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it? CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

Hello,

You can try reducing the learning rate for the learnable transformation parameters (e.g., halving it). Additionally, you can try adjusting the initialization parameter --alpha, with a range from 0.5 to 0.75.

Best regards.

bobma-bytedance · 2024-08-08T10:14:24Z

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME，the error description:

`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 === [2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625 [2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625 [2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625 [2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625 [2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625 [2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625 [2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625 [2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625 [2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625 [2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625 [2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625 [2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625 [2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625 [2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625 [2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625 [2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625 [2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625 [2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625 [2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it? CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

I have provided our training log, where you can compare hyperparameters and the changes in loss, among other information.
log_rank0_1706942168.txt

lonleyodd · 2024-08-08T10:57:00Z

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME，the error description:
`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 === [2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625 [2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625 [2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625 [2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625 [2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625 [2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625 [2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625 [2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625 [2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625 [2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625 [2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625 [2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625 [2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625 [2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625 [2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625 [2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625 [2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625 [2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625 [2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it? CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

I have provided our training log, where you can compare hyperparameters and the changes in loss, among other information. log_rank0_1706942168.txt

Thanks，i will try it again as follow your hyperparameters

lonleyodd · 2024-08-26T09:48:35Z

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME，the error description:
`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 === [2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625 [2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625 [2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625 [2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625 [2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625 [2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625 [2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625 [2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625 [2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625 [2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625 [2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625 [2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625 [2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625 [2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625 [2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625 [2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625 [2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625 [2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625 [2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it? CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

I have provided our training log, where you can compare hyperparameters and the changes in loss, among other information. log_rank0_1706942168.txt

@bobma-bytedance well, i trained model follow your cfg, but it occurs same error that loss is NAN, i found loss and norm a bit lower than your log since layer 0, the only difference is that your act scale is created by Omniquant, i think it doesn't matter
log_rank0_1723116003.txt

bobma-bytedance · 2024-08-26T09:58:50Z

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME，the error description:
`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 === [2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625 [2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625 [2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625 [2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625 [2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625 [2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625 [2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625 [2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625 [2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625 [2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625 [2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625 [2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625 [2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625 [2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625 [2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625 [2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625 [2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625 [2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625 [2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it? CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

I have provided our training log, where you can compare hyperparameters and the changes in loss, among other information. log_rank0_1706942168.txt

@bobma-bytedance well, i trained model follow your cfg, but it occurs same error that loss is NAN, i found loss and norm a bit lower than your log since layer 0, the only difference is that your act scale is created by Omniquant, i think it doesn't matter log_rank0_1723116003.txt

It is best to use the scale and shift provided by OmniQuant, which are initialized by the outlier suppression method. We have tried generating scale and shift using generate_act_scale_shift.py, which results in a Nan loss, same as you.
I'm not quite sure what kind of hyperparameters are needed to generate scale and shift.

lonleyodd · 2024-08-27T06:59:54Z

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME，the error description:
`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 === [2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625 [2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625 [2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625 [2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625 [2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625 [2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625 [2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625 [2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625 [2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625 [2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625 [2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625 [2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625 [2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625 [2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625 [2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625 [2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625 [2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625 [2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625 [2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it? CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

I have provided our training log, where you can compare hyperparameters and the changes in loss, among other information. log_rank0_1706942168.txt

@bobma-bytedance well, i trained model follow your cfg, but it occurs same error that loss is NAN, i found loss and norm a bit lower than your log since layer 0, the only difference is that your act scale is created by Omniquant, i think it doesn't matter log_rank0_1723116003.txt

It is best to use the scale and shift provided by OmniQuant, which are initialized by the outlier suppression method. We have tried generating scale and shift using generate_act_scale_shift.py, which results in a Nan loss, same as you. I'm not quite sure what kind of hyperparameters are needed to generate scale and shift.

It's absolutely right to use act scales and shift files provided by Omniquant，now after training I have successfully replicated the results presented in your paper. Thanks a lot , best regard!

bobma-bytedance · 2024-08-27T09:02:25Z

You're welcome. Feel free to ask any questions you may have.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss is Nan #2

Loss is Nan #2

lonleyodd commented Jul 29, 2024 •

edited

Loading

bobma-bytedance commented Jul 30, 2024

bobma-bytedance commented Aug 8, 2024

lonleyodd commented Aug 8, 2024

lonleyodd commented Aug 26, 2024 •

edited

Loading

bobma-bytedance commented Aug 26, 2024

lonleyodd commented Aug 27, 2024

bobma-bytedance commented Aug 27, 2024

Loss is Nan #2

Loss is Nan #2

Comments

lonleyodd commented Jul 29, 2024 • edited Loading

bobma-bytedance commented Jul 30, 2024

bobma-bytedance commented Aug 8, 2024

lonleyodd commented Aug 8, 2024

lonleyodd commented Aug 26, 2024 • edited Loading

bobma-bytedance commented Aug 26, 2024

lonleyodd commented Aug 27, 2024

bobma-bytedance commented Aug 27, 2024

lonleyodd commented Jul 29, 2024 •

edited

Loading

lonleyodd commented Aug 26, 2024 •

edited

Loading