Exllama support

A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs. It uses pytorch and sentencepiece to run the model.

It is assumed to work only in the local environment and at least one NVIDIA CUDA GPU is required. You have to download tokenizer, config, and GPTQ files from huggingface and put it in the llama_models/gptq/YOUR_MODEL_FOLDER folder

Define LLMModel in app/models/llms.py. There are few examples, so you can easily define your own model. Refer to the exllama repository for more detailed information: https://github.com/turboderp/exllama

Important!

Nvidia GPU only. To use exllama model, you have to download pytorch and sentencepiece manually and define ExllamaModel in llms.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.1.3.4

Exllama support

Important!