Skip to content

Releases: c0sogi/LLMChat

v1.1.3.4.1

03 Jul 17:29
Compare
Choose a tag to compare

Hotfix

  • Fixed error loading LlamaTokenizer
  • Added auto cuBLAS dll build (Windows) script when importing llama_cpp from llama-cpp-python repository

v1.1.3.4

03 Jul 14:02
Compare
Choose a tag to compare

Exllama support

A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs. It uses pytorch and sentencepiece to run the model.

It is assumed to work only in the local environment and at least one NVIDIA CUDA GPU is required. You have to download tokenizer, config, and GPTQ files from huggingface and put it in the llama_models/gptq/YOUR_MODEL_FOLDER folder

Define LLMModel in app/models/llms.py. There are few examples, so you can easily define your own model. Refer to the exllama repository for more detailed information: https://github.com/turboderp/exllama

Important!

Nvidia GPU only. To use exllama model, you have to download pytorch and sentencepiece manually and define ExllamaModel in llms.py

v1.1.3.3

29 Jun 07:51
Compare
Choose a tag to compare
  1. Automatically monitors the underlying Llama.cpp API server process for driving the local LLM model. Introduced a more flexible communication method over the network from the IPC method through Queue and Event in the existing process pool.

  2. Local embedding via Llama.cpp model and huggingface embedding model. For the former, you need to set the embedding=True option when defining LlamaCppModel. For the latter, you need to install pytorch additionally and set a huggingface repository such as intfloat/e5-large-v2 in the value of LOCAL_EMBEDDING_MODEL in the .env file.

v1.1.3.2

15 Jun 01:14
Compare
Choose a tag to compare

Set the web browsing default mode to Full browsing.

  • Full browsing: Clicking links and scrolling through webpages based on the query provided. This consumes a lot of tokens.
  • Light browsing: Compose answer based on snippets from the search engine with the provided query. This consumes fewer tokens.

v1.1.3.1

12 Jun 15:07
Compare
Choose a tag to compare

Added Web browsing mode(beta). It is available by enabling the Browse toggle button. By default, browsing is performed using the DuckDuckGo search engine.

v1.1.3.0

01 Jun 03:08
Compare
Choose a tag to compare
  1. The way chat message list is loaded from Redis has been changed from eager load to lazy load. It now loads all of the user's chat profiles first, and then loads the messages when they enter the chat. This dramatically reduces the initial loading time if you already have a large list of messages.

  2. You can set User role, AI role, and System role for each LLM. For OpenAI's ChatGPT, user, assistant, and system are used by default. For other LLaMa models, you can set other types of roles, which can help the LLM recognize the conversation role.

  3. Auto summarization is now applied. By default, when you type or receive a long message of 512 tokens or more, the Summarization background task for that message will run and when it finishes, it will be quietly saved to the message list. The summarized content is invisible to the user, but when sending messages to the LLM, the summarized message is passed along, which can be a huge savings in token usage (and cost).

  4. To overcome the performance limitations of Redis vectorstore (single-threaded) and replace the inaccurate KNN similarity search with cosine similarity search, we introduced Qdrant vectorstore. It enables fast asynchronous vector queries in microseconds via gRPC, a low-level API.

v1.1.2.1

25 May 11:39
Compare
Choose a tag to compare
  1. PC/Mobile responsive frontend
  2. Refactored text generation code
  3. Added admin status in MySQL Users table.

v1.1.1

20 May 08:21
Compare
Choose a tag to compare
  1. Supports dropdown chatmodel selection
  2. Added admin console endpoint /admin.
  3. Now vectorstore is not shared for all acoounts. Every account has own vectorstore, but will share public database, which can be embedded by /share command.
  4. Added token status box in frontend
  5. LLAMA supports GPU offloading when using cuBLAS.
  6. Now /query command doesn't put queried texts into chat context. The queries data will only be used for generating current response.

v1.1.0

17 May 06:18
Compare
Choose a tag to compare
removed `gpt` terms and changed project name

v1.0.1

15 May 14:45
Compare
Choose a tag to compare
New feature: editable cha title, copy to clipboard