Skip to content

Latest commit

 

History

History
58 lines (45 loc) · 4.37 KB

README.md

File metadata and controls

58 lines (45 loc) · 4.37 KB

embeddings

This binds to redis stack server for presistent storage, so if you dont have redis running, it will not work

pre-requests

You will need cuda extention installed which be compatiable with pytorch version 2.0.0

Usage

the basic idea of this is to load a gptq model and run embedding against it instead of requiring openAI connection

the gptq model is 4 bits with 128 group size model which loses some precision but allow you to fit a larger model in VRAM, for reference GTPQ

Example chat of stat of the union

Chat Example

storing documents into vector store

❯ python embeddings.py --index-name state_of_the_union store --docs state_of_the_union.txt
INFO    - Loading encoding model sentence-transformers/all-MiniLM-L6-v2...
INFO    - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
INFO    - Storing vector data to redis...
Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.38it/s]
INFO    - Index already exists
INFO    - finished, exiting...

--docs take multiple files, and currently only does txt and pdf

loading vector store with model

❯ python embeddings.py --index-name state_of_the_union run --model-dir /home/siegfried/model-gptq --model-name vicuna-13b-4bits
INFO    - Loading encoding model sentence-transformers/all-MiniLM-L6-v2...
INFO    - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
INFO    - Index already exists
INFO    - Loading Tokenizer from /home/siegfried/model-gptq/vicuna-13b-4bits...
INFO    - Loading the model from /home/siegfried/model-gptq/vicuna-13b-4bits...
INFO    - Loading gptq quantized models...
WARNING - CUDA extension not installed.
INFO    - creating transformer pipeline...
The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
INFO    - creating chain...
INFO    - Loading Q&A chain...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

About loading models

there are two way of loading models, normally its expecting a model which is quantized with GPTQ 4bits 128group_size

you can specify --no-gptq, it would load model normally (you can prob fit a 13b model with 8 bits for 24GB VRAM)

About converting model to 4bits

Please use AutoGPTQ to quantlized it, a 13b model will need about 35GB DRAM

Current issues

  • it does seems the gptq quantized model done by AutoGPTQ which from GPTQ-for-LLaMa is not very performant, somehow the old cuda branch GPTQ-for-LLaMa is performant. but AutoGPTQ makes it really easy to use, so i stick with that

^ the issue is solved by force the model load with use_triton=True which loads the whole model into VRAM