Skip to content

v1.1.3.4

Compare
Choose a tag to compare
@c0sogi c0sogi released this 03 Jul 14:02
· 15 commits to master since this release

Exllama support

A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs. It uses pytorch and sentencepiece to run the model.

It is assumed to work only in the local environment and at least one NVIDIA CUDA GPU is required. You have to download tokenizer, config, and GPTQ files from huggingface and put it in the llama_models/gptq/YOUR_MODEL_FOLDER folder

Define LLMModel in app/models/llms.py. There are few examples, so you can easily define your own model. Refer to the exllama repository for more detailed information: https://github.com/turboderp/exllama

Important!

Nvidia GPU only. To use exllama model, you have to download pytorch and sentencepiece manually and define ExllamaModel in llms.py