Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Request for BAAI/bge-m3 (XLMRoberta-based Multilingual Embedding Model) #6007

Open
4 tasks done
mofanke opened this issue Mar 12, 2024 · 7 comments
Open
4 tasks done
Labels
enhancement New feature or request

Comments

@mofanke
Copy link

mofanke commented Mar 12, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Supporting a multilingual embedding.
https://huggingface.co/BAAI/bge-m3

Motivation

There are some differences between multilingual embeddings and BERT

Possible Implementation

sorry, no idea. I tried , seems model arch is same as bert ,but tokenizer is XLMRobertaTokenizer , not bertTokenizer

@mofanke mofanke added the enhancement New feature or request label Mar 12, 2024
@github-actions github-actions bot added the stale label Apr 12, 2024
@RoggeOhta
Copy link

Also request this model to be supported.

@github-actions github-actions bot removed the stale label Apr 24, 2024
@vonjackustc
Copy link

vonjackustc commented May 4, 2024

Tried to support it, use BertModel & SPM tokenizer.
https://huggingface.co/vonjack/bge-m3-gguf

Tested cosine similarity between "中国" and "中华人民共和国":
bge-m3-f16: 0.9993230772798457
mxbai-embed-large-v1-f16: 0.7287733321223814

@vuminhquang
Copy link

vuminhquang commented May 12, 2024

I got error when using with langchain
"terminate called after throwing an instance of 'std::out_of_range'"

@ciekawy
Copy link

ciekawy commented May 21, 2024

same here with llama.cpp, the full error:

libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found

@ciekawy
Copy link

ciekawy commented May 21, 2024

the _bert version does not crash, but the the embeddings do not seem to have any sense...

@ciekawy
Copy link

ciekawy commented May 21, 2024

also tried to follow instructions on https://github.com/PrithivirajDamodaran/blitz-embed but after converting to gguf, getting error:

llama_model_quantize: failed to quantize: key not found in model: bert.context_length

@ciekawy
Copy link

ciekawy commented May 22, 2024

@vonjackustc can you share params you used with llama.cpp?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants