Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recent pull 23 generating jammed sentence output with new quantized neox20b 4bit model #26

Closed
GenTxt opened this issue Apr 27, 2023 · 4 comments

Comments

@GenTxt
Copy link

GenTxt commented Apr 27, 2023

#23

neox-20b 4bit models quantized with above generates jammed sentences as per example below.

The smell of tobacco smoke in theseemingly ceaseless breeze which swept through during these conversationswas unmistakable evidence of his presence to anyone who heard that faintpunctual pungency overrode any other possible olfactory suggestion; butRobert could sense without really seeing more than once how he etc.

Previous main version using same seed generates above correctly as:

The smell of tobacco smoke in the seemingly ceaseless breeze which swept through during these conversations was unmistakable evidence of his presence to anyone who heard that faint punctual pungency overrode any other possible olfactory suggestion; but Robert could sense without really seeing more than once how he etc.

@qwopqwop200
Copy link
Collaborator

qwopqwop200 commented Apr 27, 2023

If this is true, it's very strange. I've coded the result so that it doesn't change.

@PanQiWei
Copy link
Collaborator

@GenTxt can you share your quantization code and model to us so that we can try to reproduce and figure out what went wrong.

Also you may try on the up-to-date commit in main branch, may be it can solve your problem.

@GenTxt
Copy link
Author

GenTxt commented Apr 27, 2023

https://huggingface.co/kz919/gpt-neox-20b-8k-longtuning/tree/main

Converted above to safetensors with text generation webui script.

CUDA_VISIBLE_DEVICES="0" python quant_with_alpaca.py --pretrained_model_dir models/neox20b_8192_safe --quantized_model_dir 4bit_converted --bits 4 --group_size 128 --fast_tokenizer --save_and_reload

Old models deleted as current triton kernel can cause errors on refurbished 6000.
triton-lang/triton#1556

For the specific code above, this error:

Occurs on NVIDIA GeForce RTX 2080 Ti (similar to original 6000 - gpu1)
Doesn't occur on NVIDIA GeForce RTX 3090 (works fine on same - gpu0)

Quantized in latest cuda main and not encountering the error. False alarm. Closing here and carefully testing each update.

Thanks

@GenTxt GenTxt closed this as completed Apr 27, 2023
@PeiyuZ-star
Copy link

https://huggingface.co/kz919/gpt-neox-20b-8k-longtuning/tree/main

Converted above to safetensors with text generation webui script.

CUDA_VISIBLE_DEVICES="0" python quant_with_alpaca.py --pretrained_model_dir models/neox20b_8192_safe --quantized_model_dir 4bit_converted --bits 4 --group_size 128 --fast_tokenizer --save_and_reload

Old models deleted as current triton kernel can cause errors on refurbished 6000. openai/triton#1556

For the specific code above, this error:

Occurs on NVIDIA GeForce RTX 2080 Ti (similar to original 6000 - gpu1)
Doesn't occur on NVIDIA GeForce RTX 3090 (works fine on same - gpu0)

Quantized in latest cuda main and not encountering the error. False alarm. Closing here and carefully testing each update.

Thanks

Hi, I've also tyied neox20b quantization, the inference speed I got is 16tokens/s, which isn't fast enough, may I have your results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants