Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss is Nan #2

Open
lonleyodd opened this issue Jul 29, 2024 · 7 comments
Open

Loss is Nan #2

lonleyodd opened this issue Jul 29, 2024 · 7 comments

Comments

@lonleyodd
Copy link

lonleyodd commented Jul 29, 2024

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME,the error description:

`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 ===
[2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625
[2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625
[2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625
[2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625
[2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625
[2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625
[2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625
[2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625
[2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625
[2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625
[2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625
[2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625
[2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625
[2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625
[2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625
[2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625
[2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625
[2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625
[2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it?
CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

@bobma-bytedance
Copy link
Collaborator

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME,the error description:

`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 === [2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625 [2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625 [2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625 [2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625 [2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625 [2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625 [2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625 [2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625 [2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625 [2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625 [2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625 [2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625 [2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625 [2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625 [2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625 [2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625 [2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625 [2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625 [2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it? CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

Hello,

You can try reducing the learning rate for the learnable transformation parameters (e.g., halving it). Additionally, you can try adjusting the initialization parameter --alpha, with a range from 0.5 to 0.75.

Best regards.

@bobma-bytedance
Copy link
Collaborator

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME,the error description:

`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 === [2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625 [2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625 [2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625 [2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625 [2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625 [2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625 [2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625 [2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625 [2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625 [2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625 [2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625 [2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625 [2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625 [2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625 [2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625 [2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625 [2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625 [2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625 [2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it? CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

I have provided our training log, where you can compare hyperparameters and the changes in loss, among other information.
log_rank0_1706942168.txt

@lonleyodd
Copy link
Author

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME,the error description:
`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 === [2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625 [2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625 [2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625 [2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625 [2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625 [2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625 [2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625 [2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625 [2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625 [2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625 [2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625 [2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625 [2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625 [2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625 [2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625 [2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625 [2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625 [2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625 [2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it? CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

I have provided our training log, where you can compare hyperparameters and the changes in loss, among other information. log_rank0_1706942168.txt

Thanks,i will try it again as follow your hyperparameters

@lonleyodd
Copy link
Author

lonleyodd commented Aug 26, 2024

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME,the error description:
`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 === [2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625 [2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625 [2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625 [2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625 [2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625 [2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625 [2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625 [2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625 [2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625 [2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625 [2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625 [2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625 [2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625 [2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625 [2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625 [2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625 [2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625 [2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625 [2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it? CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

I have provided our training log, where you can compare hyperparameters and the changes in loss, among other information. log_rank0_1706942168.txt

@bobma-bytedance well, i trained model follow your cfg, but it occurs same error that loss is NAN, i found loss and norm a bit lower than your log since layer 0, the only difference is that your act scale is created by Omniquant, i think it doesn't matter
log_rank0_1723116003.txt

@bobma-bytedance
Copy link
Collaborator

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME,the error description:
`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 === [2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625 [2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625 [2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625 [2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625 [2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625 [2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625 [2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625 [2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625 [2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625 [2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625 [2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625 [2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625 [2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625 [2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625 [2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625 [2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625 [2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625 [2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625 [2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it? CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

I have provided our training log, where you can compare hyperparameters and the changes in loss, among other information. log_rank0_1706942168.txt

@bobma-bytedance well, i trained model follow your cfg, but it occurs same error that loss is NAN, i found loss and norm a bit lower than your log since layer 0, the only difference is that your act scale is created by Omniquant, i think it doesn't matter log_rank0_1723116003.txt

It is best to use the scale and shift provided by OmniQuant, which are initialized by the outlier suppression method. We have tried generating scale and shift using generate_act_scale_shift.py, which results in a Nan loss, same as you.
I'm not quite sure what kind of hyperparameters are needed to generate scale and shift.

@lonleyodd
Copy link
Author

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME,the error description:
`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 === [2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625 [2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625 [2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625 [2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625 [2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625 [2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625 [2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625 [2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625 [2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625 [2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625 [2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625 [2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625 [2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625 [2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625 [2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625 [2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625 [2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625 [2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625 [2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it? CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

I have provided our training log, where you can compare hyperparameters and the changes in loss, among other information. log_rank0_1706942168.txt

@bobma-bytedance well, i trained model follow your cfg, but it occurs same error that loss is NAN, i found loss and norm a bit lower than your log since layer 0, the only difference is that your act scale is created by Omniquant, i think it doesn't matter log_rank0_1723116003.txt

It is best to use the scale and shift provided by OmniQuant, which are initialized by the outlier suppression method. We have tried generating scale and shift using generate_act_scale_shift.py, which results in a Nan loss, same as you. I'm not quite sure what kind of hyperparameters are needed to generate scale and shift.

It's absolutely right to use act scales and shift files provided by Omniquant,now after training I have successfully replicated the results presented in your paper. Thanks a lot , best regard!

@bobma-bytedance
Copy link
Collaborator

You're welcome. Feel free to ask any questions you may have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants