-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the model quantized is not performant #35
Comments
This was implemented inefficiently due to the complexity of implementing act-order and groupsize at the same time. This is also why I recommend triton in general. |
thanks thats actually solved my problem, it seems triton moved all model to VRAM which make sense its faster, was not aware that the default cuda version uses VRAM + DRAM, no wonder its slow i was working on embedding project, be able to load large model in small VRAM really helped, since most of people would not like to feed sensitive data to openai model. btw, there is maybe a typo on the warning message when i try to load it |
It does, and I've just pushed a PR to fix the typo: #40 |
i m not sure i m the only one or not
i used this to quantize two models, one is
models--eachadea--vicuna-13b-1.1
and one ismodels--decapoda-research--llama-7b-hf
both works fine, but when i try to inference them, they are very slow, token generation is slow and sometime it just stuck with gpu usage 100%. i have to control c.
this is where it stucks
since i came from GPTQ-for-LLama cuda branch, i noticed that the old cuda branch fork is pretty performant
both vicuna-13b-GPTQ-4bit-128g and gpt4-x-alpaca-13b-native-4bit-128g are quantized by that old cuda branch of GPTQ-for-LLaMa, they are fast. i m wondering whats changed
both model can not be loaded by autogptq cause some layer issue, they can only be loaded by
python setup_cuda.py install
with the old cuda branch of GPTQ-for-LLaMaThe text was updated successfully, but these errors were encountered: