Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the model quantized is not performant #35

Closed
cxfcxf opened this issue Apr 30, 2023 · 3 comments
Closed

the model quantized is not performant #35

cxfcxf opened this issue Apr 30, 2023 · 3 comments

Comments

@cxfcxf
Copy link

cxfcxf commented Apr 30, 2023

i m not sure i m the only one or not

i used this to quantize two models, one is models--eachadea--vicuna-13b-1.1 and one is models--decapoda-research--llama-7b-hf

both works fine, but when i try to inference them, they are very slow, token generation is slow and sometime it just stuck with gpu usage 100%. i have to control c.
this is where it stucks

>>> from transformers import pipeline
>>> generate = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0, max_length=512)
The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
>>> generate("write openai ceo an email to address about how important is to opensource gpt-4")
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 209, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1109, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1116, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1015, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 251, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 269, in generate
    return self.model.generate(**kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/generation/utils.py", line 1437, in generate
    return self.greedy_search(
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/generation/utils.py", line 2248, in greedy_search
    outputs = self(
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 157, in forward
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/auto_gptq/nn_modules/qlinear.py", line 200, in forward
    ).to(torch.int16 if self.bits == 8 else torch.int8)
KeyboardInterrupt
>>>
>>> quit()

since i came from GPTQ-for-LLama cuda branch, i noticed that the old cuda branch fork is pretty performant

both vicuna-13b-GPTQ-4bit-128g and gpt4-x-alpaca-13b-native-4bit-128g are quantized by that old cuda branch of GPTQ-for-LLaMa, they are fast. i m wondering whats changed

both model can not be loaded by autogptq cause some layer issue, they can only be loaded by python setup_cuda.py install with the old cuda branch of GPTQ-for-LLaMa

@qwopqwop200
Copy link
Collaborator

This was implemented inefficiently due to the complexity of implementing act-order and groupsize at the same time. This is also why I recommend triton in general.

@cxfcxf
Copy link
Author

cxfcxf commented Apr 30, 2023

This was implemented inefficiently due to the complexity of implementing act-order and groupsize at the same time. This is also why I recommend triton in general.

thanks thats actually solved my problem, it seems triton moved all model to VRAM which make sense its faster, was not aware that the default cuda version uses VRAM + DRAM, no wonder its slow

i was working on embedding project, be able to load large model in small VRAM really helped, since most of people would not like to feed sensitive data to openai model.

btw, there is maybe a typo on the warning message when i try to load it
WARNING - use_triton will force moving the hole model to GPU, make sure you have enough VRAM.
this means whole right?

@TheBloke
Copy link
Contributor

TheBloke commented May 1, 2023

btw, there is maybe a typo on the warning message when i try to load it
WARNING - use_triton will force moving the hole model to GPU, make sure you have enough VRAM.
this means whole right?

It does, and I've just pushed a PR to fix the typo: #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants