the model quantized is not performant #35

cxfcxf · 2023-04-30T01:11:58Z

i m not sure i m the only one or not

i used this to quantize two models, one is models--eachadea--vicuna-13b-1.1 and one is models--decapoda-research--llama-7b-hf

both works fine, but when i try to inference them, they are very slow, token generation is slow and sometime it just stuck with gpu usage 100%. i have to control c.
this is where it stucks

>>> from transformers import pipeline
>>> generate = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0, max_length=512)
The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
>>> generate("write openai ceo an email to address about how important is to opensource gpt-4")
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 209, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1109, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1116, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1015, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 251, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 269, in generate
    return self.model.generate(**kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/generation/utils.py", line 1437, in generate
    return self.greedy_search(
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/generation/utils.py", line 2248, in greedy_search
    outputs = self(
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 157, in forward
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/auto_gptq/nn_modules/qlinear.py", line 200, in forward
    ).to(torch.int16 if self.bits == 8 else torch.int8)
KeyboardInterrupt
>>>
>>> quit()

since i came from GPTQ-for-LLama cuda branch, i noticed that the old cuda branch fork is pretty performant

both vicuna-13b-GPTQ-4bit-128g and gpt4-x-alpaca-13b-native-4bit-128g are quantized by that old cuda branch of GPTQ-for-LLaMa, they are fast. i m wondering whats changed

both model can not be loaded by autogptq cause some layer issue, they can only be loaded by python setup_cuda.py install with the old cuda branch of GPTQ-for-LLaMa

The text was updated successfully, but these errors were encountered:

qwopqwop200 · 2023-04-30T02:55:58Z

This was implemented inefficiently due to the complexity of implementing act-order and groupsize at the same time. This is also why I recommend triton in general.

cxfcxf · 2023-04-30T16:27:35Z

This was implemented inefficiently due to the complexity of implementing act-order and groupsize at the same time. This is also why I recommend triton in general.

thanks thats actually solved my problem, it seems triton moved all model to VRAM which make sense its faster, was not aware that the default cuda version uses VRAM + DRAM, no wonder its slow

i was working on embedding project, be able to load large model in small VRAM really helped, since most of people would not like to feed sensitive data to openai model.

btw, there is maybe a typo on the warning message when i try to load it
WARNING - use_triton will force moving the hole model to GPU, make sure you have enough VRAM.
this means whole right?

TheBloke · 2023-05-01T09:26:37Z

btw, there is maybe a typo on the warning message when i try to load it
WARNING - use_triton will force moving the hole model to GPU, make sure you have enough VRAM.
this means whole right?

It does, and I've just pushed a PR to fix the typo: #40

PanQiWei closed this as completed May 14, 2023

TaoLbr1993 mentioned this issue Feb 22, 2024

GPTQ got an unexpected inference speed compared with fp16 for llama-7b #560

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the model quantized is not performant #35

the model quantized is not performant #35

cxfcxf commented Apr 30, 2023 •

edited

qwopqwop200 commented Apr 30, 2023

cxfcxf commented Apr 30, 2023 •

edited

TheBloke commented May 1, 2023

the model quantized is not performant #35

the model quantized is not performant #35

Comments

cxfcxf commented Apr 30, 2023 • edited

qwopqwop200 commented Apr 30, 2023

cxfcxf commented Apr 30, 2023 • edited

TheBloke commented May 1, 2023

cxfcxf commented Apr 30, 2023 •

edited

cxfcxf commented Apr 30, 2023 •

edited