Recent pull 23 generating jammed sentence output with new quantized neox20b 4bit model #26

GenTxt · 2023-04-27T17:16:54Z

neox-20b 4bit models quantized with above generates jammed sentences as per example below.

The smell of tobacco smoke in theseemingly ceaseless breeze which swept through during these conversationswas unmistakable evidence of his presence to anyone who heard that faintpunctual pungency overrode any other possible olfactory suggestion; butRobert could sense without really seeing more than once how he etc.

Previous main version using same seed generates above correctly as:

The smell of tobacco smoke in the seemingly ceaseless breeze which swept through during these conversations was unmistakable evidence of his presence to anyone who heard that faint punctual pungency overrode any other possible olfactory suggestion; but Robert could sense without really seeing more than once how he etc.

qwopqwop200 · 2023-04-27T17:30:59Z

If this is true, it's very strange. I've coded the result so that it doesn't change.

PanQiWei · 2023-04-27T17:43:47Z

@GenTxt can you share your quantization code and model to us so that we can try to reproduce and figure out what went wrong.

Also you may try on the up-to-date commit in main branch, may be it can solve your problem.

GenTxt · 2023-04-27T19:58:43Z

https://huggingface.co/kz919/gpt-neox-20b-8k-longtuning/tree/main

Converted above to safetensors with text generation webui script.

CUDA_VISIBLE_DEVICES="0" python quant_with_alpaca.py --pretrained_model_dir models/neox20b_8192_safe --quantized_model_dir 4bit_converted --bits 4 --group_size 128 --fast_tokenizer --save_and_reload

Old models deleted as current triton kernel can cause errors on refurbished 6000.
triton-lang/triton#1556

For the specific code above, this error:

Occurs on NVIDIA GeForce RTX 2080 Ti (similar to original 6000 - gpu1)
Doesn't occur on NVIDIA GeForce RTX 3090 (works fine on same - gpu0)

Quantized in latest cuda main and not encountering the error. False alarm. Closing here and carefully testing each update.

Thanks

PeiyuZ-star · 2023-06-29T06:55:44Z

https://huggingface.co/kz919/gpt-neox-20b-8k-longtuning/tree/main

Converted above to safetensors with text generation webui script.

CUDA_VISIBLE_DEVICES="0" python quant_with_alpaca.py --pretrained_model_dir models/neox20b_8192_safe --quantized_model_dir 4bit_converted --bits 4 --group_size 128 --fast_tokenizer --save_and_reload

Old models deleted as current triton kernel can cause errors on refurbished 6000. openai/triton#1556

For the specific code above, this error:
Occurs on NVIDIA GeForce RTX 2080 Ti (similar to original 6000 - gpu1)
Doesn't occur on NVIDIA GeForce RTX 3090 (works fine on same - gpu0)
Quantized in latest cuda main and not encountering the error. False alarm. Closing here and carefully testing each update.

Thanks

Hi, I've also tyied neox20b quantization, the inference speed I got is 16tokens/s, which isn't fast enough, may I have your results?

GenTxt closed this as completed Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recent pull 23 generating jammed sentence output with new quantized neox20b 4bit model #26

Recent pull 23 generating jammed sentence output with new quantized neox20b 4bit model #26

GenTxt commented Apr 27, 2023

qwopqwop200 commented Apr 27, 2023 •

edited

PanQiWei commented Apr 27, 2023

GenTxt commented Apr 27, 2023

PeiyuZ-star commented Jun 29, 2023

Recent pull 23 generating jammed sentence output with new quantized neox20b 4bit model #26

Recent pull 23 generating jammed sentence output with new quantized neox20b 4bit model #26

Comments

GenTxt commented Apr 27, 2023

qwopqwop200 commented Apr 27, 2023 • edited

PanQiWei commented Apr 27, 2023

GenTxt commented Apr 27, 2023

PeiyuZ-star commented Jun 29, 2023

qwopqwop200 commented Apr 27, 2023 •

edited