llama : add BERT support #2872

ggerganov · 2023-08-29T09:57:28Z

There is a working bert.cpp implementation.
We should try to implement this in llama.cpp and update the embedding example to use it.

The implementation should follow mostly what we did to integrate Falcon.
Here are the main steps:

Update gguf.py with BERT arch KV pairs and tensors
Python convert script using gguf.py to generate F16 model
add tokenizer implementation in llama.cpp
add function to build BERT graph
add any new ops in ggml if needed
add CUDA offloading
add tokenizer tests

The text was updated successfully, but these errors were encountered:

dranger003 · 2023-08-29T21:13:25Z

@monatis Happy to help with testing or any other tasks you may need help with. Just keep in mind my understanding of LLM architecture is quite limited.

m0saan · 2023-09-01T22:57:13Z

@ggerganov Is this claimed by somebody? otherwise I can give this a shot!

ggerganov · 2023-09-02T13:08:59Z

I think @monatis is already working on it

monatis · 2023-09-03T14:54:50Z

Yes I'm on the work, I'll raise a PR early this week.

nortekax · 2023-10-10T10:32:52Z

@monatis did this ever happen? I did not see a PR, RAG (retrieval augmented generation) is a major application for which bert models are heavily used. It would be good to add bert model support.

ggerganov · 2023-10-16T15:52:50Z

Field report: https://bloop.ai/blog/gpu_with_ggml

monatis · 2023-10-16T16:05:50Z

Ouch, after that I was diving deep into LLaVA. Now that it's shipped, I can go back to this one.

yunghoy · 2023-10-24T06:44:47Z

Maybe m0saan can take care of this?

CyrilPeponnet · 2023-10-26T04:14:23Z

While on it maybe worth adding https://jina.ai/news/jina-ai-launches-worlds-first-open-source-8k-text-embedding-rivaling-openai/ :p doesn’t seems to be a regular Bert model though see https://huggingface.co/jinaai/jina-bert-implementation/tree/main

xbotter · 2023-11-12T00:30:18Z

Is there any progress?

m0saan · 2023-11-12T21:57:43Z

Hello @yunghoy, Is someone working on this, otherwise I'll work on it! Thankies

nortekax · 2023-11-13T14:41:23Z

@m0saan, I don't think anyone is working on it; originally it was assigned to monatis, but it never happened; so my humble opinion is that it will be great if you work on it.

wsxiaoys · 2023-11-22T03:58:41Z

Hi @m0saan, could you please share the implementation process? I'm also interested in the topic and am willing to help with testing or implementing

snowyu · 2023-11-28T05:33:35Z

References: HF's tokenizers wrapper tokenizers-cpp .

obezzad · 2023-12-03T02:59:48Z

I can take care of the integration if no one is working on it.

dranger003 · 2023-12-04T22:48:08Z

https://github.com/FFengIll/embedding.cpp

sandangel · 2023-12-08T11:10:31Z

@obezzad , thank you so much. That would be awesome. I can help with testing when a PR is ready.

Lurrobert · 2023-12-19T22:59:08Z

@obezzad, @monatis any updates?

xyzhang626 · 2023-12-20T11:12:45Z

Hey guys, I made some progress based on the fork of bert.cpp, please check it out if you are still interested. Support the SOTA BGE series model, real batch inference, multilingual tokenizer and no 3rd party dependencies. embeddings.cpp

snowyu · 2023-12-20T13:06:04Z

@xyzhang626 all tokenizer test failed after using models/bge-small-zh-v1.5/ggml-model-q4_0.bin

// examples/test_tokenizer.cpp
int main(int argc, char ** argv) {

    bert_params params;
    params.model = "models/bge-base-zh-v1.5/ggml-model-q4_0.bin";
    ...

xyzhang626 · 2023-12-21T09:03:05Z

hey @snowyu it's because that test is written for the all-MiniLM-L6-v2 instead of bge-small-zh-v1.5. It actually works well and I've pushed a script for a better test, refer to this instruction.

To not pollute this thread, we can discuss more details in this post.

snowyu · 2023-12-28T08:30:21Z

The Pre-Tokenizer component in Hugging Face Transformers is highly versatile, offering various types that can be dynamically applied to text before it's tokenized by the main Tokenizer class.

Currently, bert.cpp or embeddings.cpp implements only one type of pre-processing WordPiece and uses this method exclusively for all models based on BERT architecture. However, relying solely on WordPiece may not be sufficient or appropriate in certain scenarios where other types of pre-tokenization are required to handle specific text patterns more effectively.

To achieve a pure C++ implementation that fully leverages the capabilities provided by Hugging Face's Tokenizers library and its tokenizer_config.json file, we need to implement all available tokenizers in addition to reading this configuration file correctly. This would allow us to apply different pre-processing steps as specified for each model checkpoint directory within LLaMA or other similar benchmarking tools.

For example, consider the tokenizer.json from the paraphrase-multilingual-MiniLM Pre-trainedLLM:

{
  ...,
  "pre_tokenizer": {
      "type": "Sequence",
       "pretokenizers": [
          {"type": "WhitespaceSplit"},
          {"type": "Metaspace","replacement": "▁", ...}
        ]
    }
}

This configuration file specifies a sequence of pre-tokenization steps that should be applied to the text before it's tokenized. Implementing such configurations in C++ would require parsing this JSON structure and applying each step accordingly, ensuring compatibility with Hugging Face Tokenizers library while maintaining performance efficiency within llama.cpp.

For a clearer example of how these pre-tokenization steps might be implemented using JavaScript code in one file, you can refer to the source for transformers.js's tokenizer implementation: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js

The content of the OBJ type is actually a list of all key names of the object. This change includes several improvements and additions to the codebase: * GGUFWriter: * Added `def add_kv(self, key: str, val: Any) -> None` method: Automatically determines the appropriate value type based on val. * Added `def add_dict(self, key: str, val: dict) -> None` method: add object(dict) key-value * constants: * Revised `GGUFValueType.get_type(val)`: Added support for numpy's integers and floating point numbers, and appropriately selected the number of digits according to the size of the integer. * gguf_reader * Added `ReaderField.get()` method: get the value of this ReaderField * Unit tests have been added to cover these changes. Related Issues: ggerganov#4868, ggerganov#2872

The content of the OBJ type is actually a list of all key names of the object. * GGUFWriter: * add `def add_kv(self, key: str, val: Any) -> None`: This will be added based on the val type * add `def add_dict(self, key: str, val: dict) -> None`: add object(dict) value * constants: * `GGUFValueType.get_type`: Added support for Numpy's integers and floating-point numbers, and selected the appropriate number of digits based on the size of the integer. * gguf_reader: * add `ReaderField.get`: to return the value of the field * Unit test added. Related Issues: ggerganov#4868, ggerganov#2872

iamlemec · 2024-02-02T21:59:40Z

Hi folks! I went ahead and updated the existing BERT implementations to work with GGML master. See here: https://github.com/iamlemec/bert.cpp/. This is forked from @xyzhang626's embeddings.cpp and consequently @skeskinen's bert.cpp. Mostly graph and allocation related changes, and the recent 4d copy commits ended up being necessary for it to work.

I've tested CPU and CUDA with various quantization levels, but I don't have acces to Metal. Preliminary benchmarks on bge class models show a roughly 3x speedup relative to ONNX on CPU, while on CUDA ONNX is actually about 2x faster. But maybe flash attention will change that?

I don't want to be too presumptuous, as I'm not the original author, but it would seem to me that including this in ggml would increase its visibility and improve the chances that it stays up to date with the evolution of the ggml interface.

sroussey · 2024-02-03T01:41:24Z

Nice!

Might add to the readme that you need to pip install gguf and some others that I happened to have installed, or create a requirements file.

sroussey · 2024-02-03T01:44:28Z

For fun and speed, I tried adding to CMakeLists.txt:
option(GGML_METAL "ggml: use Metal" ON)

But

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=260 "The file “ggml-metal.metal” couldn’t be opened because there is no such file." UserInfo={NSFilePath=ggml-metal.metal, NSUnderlyingError=0x12ebbb020 {Error Domain=NSPOSIXErrorDomain Code=2 "No such file or directory"}}
bert_load_from_file: ggml_backend_metal_init() failed
bert_load_from_file: using CPU backend

So I tried like this:

❯ GGML_METAL_PATH_RESOURCES=ggml/src/ build/bin/main -m models/bge-base-en-v1.5/ggml-model-f16.gguf -p "Hello world"                                                                
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = ggml/src/
ggml_metal_init: loading 'ggml/src/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   209.09 MiB, (  210.72 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (  210.73 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  1109.16 MiB, ( 1319.88 / 49152.00)
101 -> [CLS]
7592 -> hello
2088 -> world
102 -> [SEP]

ggml_metal_graph_compute_block_invoke: error: unsupported op 'REPEAT'
GGML_ASSERT: /Users/steve/Code/bert.cpp/ggml/src/ggml-metal.m:760: !"unsupported op"
[1]    28616 abort      GGML_METAL_PATH_RESOURCES=ggml/src/ build/bin/main -m  -p "Hello world"

I'm looking forward to this work getting integrated into a PR for llama.cpp so embedding (which is a real practical use case, and ideal for client side indexing) gets more attention!

iamlemec · 2024-02-03T02:21:38Z

@sroussey Thanks for the feedback! Just added in a requirements.txt and updated the CMakeLists.txt file to copy over the ggml-metal.metal file on configuration. I think that should do the trick.

Agreed on the last point, and in that case we could use the more well developed tokenization infrastructure from llama.cpp as well.

sroussey · 2024-02-03T02:32:45Z

@iamlemec Some other things

How do you get the embeddings from main? I want to match the numbers for a model elsewhere that should have near the same numbers, so should output all the dimensions of the result
What is it doing creating 32 copies of the tokens to create a batch? I took this out and of course its faster. ;)

slaren · 2024-02-03T02:35:56Z

ggml_metal_graph_compute_block_invoke: error: unsupported op 'REPEAT'

Regarding this error, the CPU, CUDA and Metal backends support full broadcasting in the ggml_add operation, and it should be possible to remove the ggml_repeat operation entirely. Looking at the code it may require reshaping some tensors, but I think it should be possible.

iamlemec · 2024-02-03T03:06:27Z

@sroussey For any testing and benchmarking, I usually just use the Python interface, which should have ~zero overhead. But yeah, the 32 was to test batching, and I was just printing the first 8 to make sure they didn't differ from the baseline. Now I've changed main so it just does an n=1 batch and outputs the full embedding vector.

@slaren Yeah, seems like it. I guess I could keep the attention mask as a 4d tensor rather than combining the head and batch dimensions as it does currently. Then adding it to the KQ matrix will broadcast implicitly.

iamlemec · 2024-02-03T03:21:32Z

@sroussey Ok, I believe ggml_repeat has been excised. And actually it obviated a bunch of other reshape ops. Might work on Metal now!

sroussey · 2024-02-03T03:59:06Z

Metal works now! Embedding two words is actually slower, but not that surprised. I'll need to do some testing later with real data.

BTW: I made a PR to fix a segfault, and another to have all the messages go to stderr except the array itself so it can be piped around.

ggerganov · 2024-02-03T08:31:59Z

You can replace these with ggml_soft_max_ext():

// before
KQ = ggml_scale_inplace(ctx0, KQ, 1.0f / sqrt((float)d_head));
KQ = ggml_add(ctx0, KQ, attn_mask);
KQ = ggml_soft_max(ctx0, KQ);

// after
KQ = ggml_soft_max_ext(ctx0, KQ, attn_mask, 1.0f / sqrt((float)d_head));

Edit: nvm, it's okay the way it is (iamlemec/bert.cpp#3)

cebtenzzre · 2024-02-07T04:52:05Z

Which repo do we plan to eventually integrate BERT into - is it still llama.cpp? I would like to port Nomic Embed to GGML so GPT4All can use it locally (its architecture is very similar to BERT), and also drop our non-GPU-accelerated, non-batching variant of bert.cpp in the process. But I'm not sure how best to contribute.

Also @iamlemec - could you shed light on why this normalization of the output was added?

    inpL = ggml_rms_norm(ctx0, inpL, layer_norm_eps); // [E, B]
    inpL = ggml_scale_inplace(ctx0, inpL, 1.0f / sqrt((float)n_embd)); // [E, B] (since rms_norm does mean instead of sum)

cc @ggerganov

ggerganov · 2024-02-07T07:09:44Z

I would like at some point to have embedding models supported in llama.cpp. Not sure if the existing llama_get_embeddings API would be enough though - any thoughts on this?

But I'm not sure how best to contribute.

Let me know about any specific roadblocks and we can discuss. From brief look in bert.cpp, I think it is completely possible to add BERT support in llama.cpp

iamlemec · 2024-02-07T19:20:02Z

@cebtenzzre That final normalization is just the usual L2-normalization. Looking into it, I realized that most implementations of BERT (like HF) don't normalize in the core model and only do it later at a higher level. So I just added a normalize flag to the embed interface that gets passed to the graph construction code.

@ggerganov Could have a PR for you later today. Were you thinking of putting it in examples a la LLaVa or more integrated into the core? As an example, I basically have working right now. For the more integrated approach, seems like llama_get_embeddings would work and I could just add a build_bert function and work out any conversion issues.

cebtenzzre · 2024-02-07T19:45:44Z

@iamlemec Here is my rough initial attempt at integrating BERT into core llama.cpp, which I started before I found your fork. Hopefully you find some part of it useful, I have not spent much time comparing to your implementation but I know that it is missing some of your improvements: https://github.com/ggerganov/llama.cpp/tree/ceb/bert

iamlemec · 2024-02-07T20:28:53Z

@cebtenzzre Thanks! This is really helpful. Seems like I can build off of this and just add in my changes to build_bert.

I think the biggest difference going to BERT style models is that the attention is not causal. Right now it looks like causal attention is hard-coded in llama_build_graph. I'll add in a bool causal_attn flag to control this that defaults to true. Does it make more sense for this to go in llama_cparams or llama_hparams?

cebtenzzre · 2024-02-07T20:40:07Z

Does it make more sense for this to go in llama_cparams or llama_hparams?

It should be in llama_hparams. llama_cparams is for things the user can change via llama_new_context_with_model.

ggerganov · 2024-02-08T08:18:59Z

Were you thinking of putting it in examples a la LLaVa or more integrated into the core?

It should be integrated in the core library and have support for converting BERT models to GGUF format and loading them as any other model. There could be an accompanying example to demonstrate basic usage

I think the biggest difference going to BERT style models is that the attention is not causal. Right now it looks like causal attention is hard-coded in llama_build_graph

You can change the KQ_mask to non-causal if the model arch is BERT. Currently, it always constructs causal mask:

llama.cpp/llama.cpp

Lines 7047 to 7070 in 8504d2d

    
           { 
        
               const int64_t n_kv     = llm.n_kv; 
        
               const int64_t n_tokens = batch.n_tokens; 
        
               GGML_ASSERT(ggml_backend_buffer_is_host(lctx.inp_KQ_mask->buffer)); 
        
               float * data = (float *) lctx.inp_KQ_mask->data; 
        
               for (int h = 0; h < 1; ++h) { 
        
                   for (int j = 0; j < n_tokens; ++j) { 
        
                       const llama_pos    pos    = batch.pos[j]; 
        
                       const llama_seq_id seq_id = batch.seq_id[j][0]; 
        
                       for (int i = 0; i < n_kv; ++i) { 
        
                           float f; 
        
                           if (!lctx.kv_self.cells[i].has_seq_id(seq_id) || lctx.kv_self.cells[i].pos > pos) { 
        
                               f = -INFINITY; 
        
                           } else { 
        
                               f = 0; 
        
                           } 
        
                           data[h*(n_kv*n_tokens) + j*n_kv + i] = f; 
        
                       } 
        
                   } 
        
               } 
        
           }

But we can add a check if it is an embedding model and construct the appropriate mask. This way you wouldn't need an extra flag and in the future if other embedding models are added, they can reuse that mask. Would that work?

iamlemec · 2024-02-08T19:50:46Z

Ok, I think I have it working! Current status is at: https://github.com/iamlemec/llama.cpp. Happy to make it a PR if you think it's ready.

I ended up adding a WordPiece tokenizer called llm_tokenizer_wpm. This should be similar to llm_tokenizer_spm, but I was having issues with spm not choosing the longest matching token. It seems possible there's some way to unify these, but I couldn't quite figure it out.

As you suggested, I added in a causal_attn bool to hparams and a coresponding key bert.attention.causal to the GGUF converter. These are then reflected around where that code you quoted above is.

I was a little unsure about how to get the output correctly. Right now, it looks at the last node, and if it's called "result_embed" it treats it as a pooled embedding. Otherwise, it just executes the old logic (which might include getting a last token embedding from a proper LLM).

Also, the pooling layer is a bit tricky. Right now it's averaging over the entire batch. But if you had multiple sequences, it would average over all of those, so essentially it will only work properly for single-sequence batches. Is there some way to sum within sequence? I was thinking you could preconstruct a matrix of size [n_seqs, n_tokens] and matmul with that.

slaren · 2024-02-08T20:02:07Z

It would be good to open a PR even if it is still a draft, so that we can see the overall changes and give more specific feedback.

ggerganov added the model Model specific label Aug 29, 2023

ggerganov assigned monatis Aug 29, 2023

ggerganov mentioned this issue Aug 29, 2023

Sentence-BERT support for sentence similarity embeddings #2863

Closed

CyrilPeponnet mentioned this issue Oct 19, 2023

Bring back the EMBED feature in the Modelfile ollama/ollama#834

Open

xbotter mentioned this issue Nov 7, 2023

Unable to get a simple example to work with Llamasharp and Semantic Kernel SciSharp/LLamaSharp#186

Open

wsxiaoys mentioned this issue Nov 18, 2023

(Optional) Self hosted embedding model solution TabbyML/tabby#822

Open

sandangel mentioned this issue Dec 8, 2023

Embedding model support ollama/ollama#327

Closed

easp mentioned this issue Jan 1, 2024

[enhancement] use bert.cpp for /api/embeddings ollama/ollama#1755

Closed

snowyu mentioned this issue Jan 26, 2024

feat: Introduce new GGUFValueType.OBJ virtual type🌠 #5143

Open

MiracleXYZ mentioned this issue Jan 30, 2024

[FR] Add Ollama as Embedding Provider logancyang/obsidian-copilot#269

Closed

ggerganov unassigned monatis Jan 30, 2024

easp mentioned this issue Feb 1, 2024

List of embedding models supported by Ollama ollama/ollama#2287

Closed

xyzhang626 mentioned this issue Feb 4, 2024

Cuda? xyzhang626/embeddings.cpp#2

Open

francis2tm mentioned this issue Feb 8, 2024

Add /embeddings api edgenai/edgen#41

Closed

iamlemec mentioned this issue Feb 8, 2024

Add support for BERT embedding models #5423

Merged

ggerganov closed this as completed Feb 16, 2024

sroecker mentioned this issue Feb 29, 2024

upgraded dependencies, code and README sroecker/LLM_AppDev-HandsOn#3

Merged

llama : add BERT support #2872

llama : add BERT support #2872

Comments

ggerganov commented Aug 29, 2023

dranger003 commented Aug 29, 2023

m0saan commented Sep 1, 2023 • edited

ggerganov commented Sep 2, 2023

monatis commented Sep 3, 2023

nortekax commented Oct 10, 2023 • edited

ggerganov commented Oct 16, 2023

monatis commented Oct 16, 2023

yunghoy commented Oct 24, 2023

CyrilPeponnet commented Oct 26, 2023

xbotter commented Nov 12, 2023

m0saan commented Nov 12, 2023

nortekax commented Nov 13, 2023

wsxiaoys commented Nov 22, 2023 • edited

snowyu commented Nov 28, 2023

obezzad commented Dec 3, 2023

dranger003 commented Dec 4, 2023

sandangel commented Dec 8, 2023

Lurrobert commented Dec 19, 2023 • edited

xyzhang626 commented Dec 20, 2023 • edited

snowyu commented Dec 20, 2023

xyzhang626 commented Dec 21, 2023

snowyu commented Dec 28, 2023

iamlemec commented Feb 2, 2024

sroussey commented Feb 3, 2024

sroussey commented Feb 3, 2024

iamlemec commented Feb 3, 2024

sroussey commented Feb 3, 2024

slaren commented Feb 3, 2024

iamlemec commented Feb 3, 2024

iamlemec commented Feb 3, 2024

sroussey commented Feb 3, 2024

ggerganov commented Feb 3, 2024 • edited

cebtenzzre commented Feb 7, 2024

ggerganov commented Feb 7, 2024

iamlemec commented Feb 7, 2024

cebtenzzre commented Feb 7, 2024

iamlemec commented Feb 7, 2024

cebtenzzre commented Feb 7, 2024

ggerganov commented Feb 8, 2024

iamlemec commented Feb 8, 2024

slaren commented Feb 8, 2024

m0saan commented Sep 1, 2023 •

edited

nortekax commented Oct 10, 2023 •

edited

wsxiaoys commented Nov 22, 2023 •

edited

Lurrobert commented Dec 19, 2023 •

edited

xyzhang626 commented Dec 20, 2023 •

edited

ggerganov commented Feb 3, 2024 •

edited