v1.1.3.4
Exllama support
A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ
weights, designed to be fast and memory-efficient on modern GPUs. It uses pytorch
and sentencepiece
to run the model.
It is assumed to work only in the local environment and at least one NVIDIA CUDA GPU
is required. You have to download tokenizer, config, and GPTQ files from huggingface and put it in the llama_models/gptq/YOUR_MODEL_FOLDER
folder
Define LLMModel in app/models/llms.py
. There are few examples, so you can easily define your own model. Refer to the exllama
repository for more detailed information: https://github.com/turboderp/exllama
Important!
Nvidia GPU only. To use exllama model, you have to download pytorch
and sentencepiece
manually and define ExllamaModel
in llms.py